MrDeepFakes Forums

Some content may not be available to Guests. Consider registering an account to enjoy unrestricted access to guides, support and tools

Voice Cloning: New Open Source Model - zero shot "Metavoice" (Only TTS free but sounds like S2S is possible but not public)

666VR999

DF Enthusiast
MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:

  • Emotional speech rhythm and tone in English. No hallucinations.
  • Zero-shot cloning for American & British voices, with 30s reference audio.
  • Support for (cross-lingual) voice cloning with finetuning.
    • We have had success with as little as 1 minute training data for Indian speakers.
  • Support for long-form synthesis.

Got really good clone results with 1 minute clips, but only generates output from Text, which isn't where most of the fun is, you want the version that copies the emotion from the original speech - but still might work for JOI stuff potentially. I also think speech-to-speech is possible but behind a paywall:



You can also download the ~5gb model but the python training stuff needs a module called flash-attention which only works on Linux. Maybe some of the Linux Subsystem for Windows folks will be able to get along with it. I'd love to dabble with the python and figure out how to change the input from TTS to STS, as clone result was good, but typing out the text or autotranscribing from a video is going to lead to less emotion and sync going out.

 
Top