By training on diverse data, Voicebox can generate speech that sounds more conversational, regardless of the languages involved. Meta’s researchers have found that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech. Remarkably, the computer-generated speech only experienced a 1 percent error rate degradation, compared to the 45 to 70 percent drop-off typically observed with existing text-to-speech (TTS) models.
The initial training process involved teaching the model to predict speech segments based on their surrounding segments and the passage’s transcript. Once it learned to generate speech from context, the model could apply this knowledge to various speech generation tasks. It can generate portions of speech in the middle of an audio recording without recreating the entire input, showcasing its flexibility.