Meta, the company formerly known as Facebook, has unveiled its latest innovation in artificial intelligence: Voicebox, a generative text-to-speech model designed to revolutionize audio generation. This new system aims to do for spoken words what Meta’s ChatGPT and Dall-E models did for text and image generation.
Voicebox operates as a text-to-output generator, similar to GPT and Dall-E, but with a focus on generating audio clips. Meta describes it as “a non-autoregressive flow-matching model trained to infill speech, given audio context and text.” The model has been trained on an extensive dataset of over 50,000 hours of unfiltered audio, including recorded speech and transcripts from public domain audiobooks in multiple languages such as English, French, Spanish, German, Polish, and Portuguese.
By training on diverse data, Voicebox can generate speech that sounds more conversational, regardless of the languages involved. Meta’s researchers have found that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech. Remarkably, the computer-generated speech only experienced a 1 percent error rate degradation, compared to the 45 to 70 percent drop-off typically observed with existing text-to-speech (TTS) models.
The initial training process involved teaching the model to predict speech segments based on their surrounding segments and the passage’s transcript. Once it learned to generate speech from context, the model could apply this knowledge to various speech generation tasks. It can generate portions of speech in the middle of an audio recording without recreating the entire input, showcasing its flexibility.
Voicebox also boasts the capability to actively edit audio clips, removing background noise and replacing misspoken words. Users can identify and crop segments of speech that are corrupted by noise, instructing the model to regenerate those segments. This functionality resembles image-editing software’s ability to enhance and clean up photographs.
While text-to-speech generators have been available for some time, Voicebox’s novel training method, known as Flow Matching, sets it apart. Unlike previous systems, Voicebox doesn’t require vast amounts of specific source material for each subject it mimics. Meta’s AI outperforms the current state of the art in terms of intelligibility and audio similarity, achieving a word error rate of 1.9 percent (compared to 5.9 percent) and an audio similarity composite score of 0.681 (compared to the previous state of the art’s 0.580). Additionally, Voicebox operates up to 20 times faster than existing TTS systems.
However, Meta has clarified that the Voicebox app and its source code will not be released to the public at this time due to concerns about potential misuse. Despite this decision, Meta has provided audio examples and released the initial research paper, allowing the public to glimpse the capabilities of the technology. The research team envisions future applications for Voicebox in areas such as prosthetics for patients with vocal cord damage, in-game non-player characters (NPCs), and digital assistants.
Meta’s Voicebox represents a significant advancement in generative speech models, opening up possibilities for enhanced audio experiences and practical applications across various domains.