Text to speech (free read-aloud)
Turn a script, an article or a voiceover draft into something you can actually hear — that's what text-to-speech is for. ConvertMeow uses Kokoro, an open-source speech model, to synthesize a natural-sounding voice right in your browser, giving you a WAV file you can play and download, with a choice of US and UK voices. Honest note: the model downloads once (~80MB, then cached in your browser so reuse is instant) and synthesis runs entirely on your device — the text is never uploaded — so it's free, unlimited and watermark-free, a zero-cost alternative to paid voiceover like ElevenLabs (from $22/mo). Voices are English-first for now; for long scripts, split them into sentences and generate in chunks.
Turn a script, an article or a voiceover draft into something you can actually hear — that's what text-to-speech is for.
First run downloads the Kokoro voice model (~80MB), then it's cached in your browser for instant reuse. Everything runs on your device — the text is never uploaded.
How to use text to speech
- 1Paste or type the text you want read aloud (up to ~2,000 characters per run).
- 2Pick a voice (different genders, US/UK accents).
- 3Click Generate speech. The first run downloads the voice model (~80MB, then cached); synthesis is quick.
- 4Listen in the browser, then download it as a WAV file.
Why use ConvertMeow's Text to speech?
- Text stays on your machine: synthesis runs in your browser, so scripts, drafts and private notes never touch a server.
- Free, unlimited, no watermark: generate as many clips as you like — nothing is stamped on the audio and there's no upgrade nag.
- Apache-licensed, commercial-friendly model: it uses Kokoro-82M (Apache-2.0), so the speech you generate is safe to use in video voiceovers, podcasts and more.
Frequently asked questions
Yes. ConvertMeow uses the Kokoro-82M model under the Apache-2.0 license (which permits commercial use), synthesis happens locally, and the WAV output is yours to use freely. As always, whether your final content is compliant still depends on the text you choose to read.
The first time, the ~80MB voice model is downloaded to your browser (then cached, so you never re-download it). After that, every synthesis is fast. Chrome / Edge with WebGPU are noticeably quicker; other browsers automatically fall back to WASM and still work.
The current voices are English (US/UK) and work best on English text. Other languages aren't its strength yet. For long passages, split them into sentences or paragraphs and generate in chunks, then join the clips.
No. The whole process runs in your browser on your device — neither the text nor the generated audio is sent to any server, so nothing is collected or used for training.
Updated · ConvertMeow team