Meta’s Voicebox AI Creates Lifelike Speech from Text, Like Dall-E for Text-to-Speech

Voicebox also boasts the capability to actively edit audio clips, removing background noise and replacing misspoken words. Users can identify and crop segments of speech that are corrupted by noise, instructing the model to regenerate those segments. This functionality resembles image-editing software’s ability to enhance and clean up photographs.

While text-to-speech generators have been available for some time, Voicebox’s novel training method, known as Flow Matching, sets it apart. Unlike previous systems, Voicebox doesn’t require vast amounts of specific source material for each subject it mimics. Meta’s AI outperforms the current state of the art in terms of intelligibility and audio similarity, achieving a word error rate of 1.9 percent (compared to 5.9 percent) and an audio similarity composite score of 0.681 (compared to the previous state of the art’s 0.580). Additionally, Voicebox operates up to 20 times faster than existing TTS systems.