Meta’s Voicebox AI Creates Lifelike Speech from Text, Like Dall-E for Text-to-Speech

Meta, the company formerly known as Facebook, has unveiled its latest innovation in artificial intelligence: Voicebox, a generative text-to-speech model designed to revolutionize audio generation. This new system aims to do for spoken words what Meta’s ChatGPT and Dall-E models did for text and image generation.

Voicebox operates as a text-to-output generator, similar to GPT and Dall-E, but with a focus on generating audio clips. Meta describes it as “a non-autoregressive flow-matching model trained to infill speech, given audio context and text.” The model has been trained on an extensive dataset of over 50,000 hours of unfiltered audio, including recorded speech and transcripts from public domain audiobooks in multiple languages such as English, French, Spanish, German, Polish, and Portuguese.