MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:
Got really good clone results with 1 minute clips, but only generates output from Text, which isn't where most of the fun is, you want the version that copies the emotion from the original speech - but still might work for JOI stuff potentially. I also think speech-to-speech is possible but behind a paywall:
You can also download the ~5gb model but the python training stuff needs a module called flash-attention which only works on Linux. Maybe some of the Linux Subsystem for Windows folks will be able to get along with it. I'd love to dabble with the python and figure out how to change the input from TTS to STS, as clone result was good, but typing out the text or autotranscribing from a video is going to lead to less emotion and sync going out.
- Emotional speech rhythm and tone in English. No hallucinations.
- Zero-shot cloning for American & British voices, with 30s reference audio.
- Support for (cross-lingual) voice cloning with finetuning.
- We have had success with as little as 1 minute training data for Indian speakers.
- Support for long-form synthesis.
Got really good clone results with 1 minute clips, but only generates output from Text, which isn't where most of the fun is, you want the version that copies the emotion from the original speech - but still might work for JOI stuff potentially. I also think speech-to-speech is possible but behind a paywall:
MetaVoice - Text to Speech & AI Voice Changer
Emotive, human-like speech at scale in any voice or style. Perfect for content creators, developers, and businesses. Use text to speech to voice content for your videos, brands, characters, or AI agents. Alternatively, use our AI voice changer to transform your voice to a different style, whilst...
themetavoice.xyz
You can also download the ~5gb model but the python training stuff needs a module called flash-attention which only works on Linux. Maybe some of the Linux Subsystem for Windows folks will be able to get along with it. I'd love to dabble with the python and figure out how to change the input from TTS to STS, as clone result was good, but typing out the text or autotranscribing from a video is going to lead to less emotion and sync going out.
GitHub - metavoiceio/metavoice-src: Foundational model for human-like, expressive TTS
Foundational model for human-like, expressive TTS. Contribute to metavoiceio/metavoice-src development by creating an account on GitHub.
github.com