OpenAI Enhances ChatGPT with Voice Conversations and Image Recognition

OpenAI is unveiling substantial upgrades to ChatGPT, enabling the chatbot to interact through voice commands and handle image-based queries. These enhancements are being rolled out now, with access initially granted to Plus and Enterprise users, while image-based features will become accessible to others in the future.

To engage in voice conversations with ChatGPT on Android and iOS, users will need to opt in within the ChatGPT app by navigating to Settings and then New Features. Once activated, users can select from five distinct voices by tapping the microphone icon.

OpenAI’s voice conversations are powered by a novel text-to-speech model capable of generating “human-like audio from just text and a few seconds of sample speech.” OpenAI collaborated with professional actors to create the five available voices. Conversely, the Whisper speech recognition system translates spoken words into text.

The image-based features are equally intriguing. Users can show the chatbot images for various purposes, such as diagnosing a malfunctioning grill, meal planning based on fridge contents, or solving math problems from a picture. OpenAI is utilizing GPT-3.5 and GPT-4 to drive the image recognition capabilities. To use ChatGPT’s image-based functions, users can tap the photo button (iOS and Android require tapping the plus button first) to capture a new image or select an existing one. Multiple images can be discussed, and a drawing tool is available to focus on specific image details.

In its announcement, OpenAI acknowledged the potential for misuse, such as bad actors mimicking voices, potentially leading to fraud. As a result, OpenAI is initially focusing on voice conversations with ChatGPT and collaborating with select partners for limited use cases.

Regarding images, OpenAI has worked with Be My Eyes, an app aiding blind and low-vision individuals through volunteer-assisted video calls. ChatGPT engages in general image conversations, even those featuring individuals in the background. OpenAI, however, limits ChatGPT’s analysis and direct statements about people in images to respect privacy. The company has also published a paper on the safety properties of its image-based functionality, referred to as GPT-4 with vision.

ChatGPT performs better at understanding English text in images compared to other languages, with poorer performance in non-English languages, particularly those using non-Roman scripts. OpenAI recommends non-English users to avoid using ChatGPT for text in images at this time.

In another development, Spotify has partnered with OpenAI to employ voice-based technology for a unique purpose. Spotify is introducing a pilot tool called “Voice Translation for podcasters.” This tool translates podcasts into different languages while retaining the original speaker’s speech characteristics. It initially translates select English-based shows into several languages, with Spanish versions of some episodes already available.

Expanding ChatGPT’s Horizons

OpenAI’s enhancements to ChatGPT represent a significant step forward, bringing voice conversations and image recognition capabilities to the chatbot. These features broaden its utility, enabling users to interact more naturally and obtain information from images. While OpenAI acknowledges the potential for misuse, it aims to strike a balance between innovation and safety.

Furthermore, Spotify’s collaboration with OpenAI highlights the versatility of voice-based technology. The ability to translate podcasts while preserving the speaker’s unique voice characteristics opens new avenues for content creators and audiences alike, fostering accessibility and diversity in the podcasting landscape.