GPT-4o voice functionality has finally arrived as scheduled, bringing the sci-fi version of "Her" into reality. Some users who have been granted access to the gray-scale testing have started to try this new feature. Currently, OpenAI only offers 4 preset voices. Additionally, the output tokens of the new GPT-4o model have significantly increased to 64K, which is 16 times more than before.
Just before the end of July, GPT-4o voice mode began its gray-scale testing, with some ChatGPT Plus users already granted access. OpenAI states that the advanced voice mode provides a more natural, real-time conversation experience where users can interrupt freely, and the system can even perceive and respond to users' emotions. It's expected that all ChatGPT Plus users will be able to use this feature by this fall.
Moreover, more powerful features like video and screen sharing will be launched later. Users will be able to turn on their cameras for "face-to-face" communication with ChatGPT.
Some users with gray-scale testing access have started exploring various application scenarios for GPT-4o voice mode. For instance, some are using it as a "foreign language coach" to practice speaking. ChatGPT can correct users' pronunciation of words like "Croissant" and "Baguette".
Meanwhile, GPT-4o's output tokens have significantly increased. OpenAI recently quietly launched a test version of the new model gpt-4o-64k-output-alpha on their official website, increasing output tokens from the initial 4,000 to 64,000. This means users can obtain about 4 complete full-length movie scripts at once.
OpenAI explains that the delay in launching GPT-4o voice functionality was due to safety and quality testing over the past few months. They tested GPT-4o's voice capabilities in 45 languages with over 100 red team members. To protect user privacy, the system only uses 4 "preset voices" for conversations and has created a system to block the output of other voices. Additionally, content filtering measures have been implemented to prevent the generation of violent and copyright-infringing content.
OpenAI plans to release a detailed report in early August, introducing GPT-4o's capabilities, limitations, and safety assessment results.
Users have shared various application cases of GPT-4o voice mode, including performing rhythm beatboxing, telling jokes with different emotions, and imitating animal sounds. Tests show that ChatGPT's advanced voice mode responds quickly with almost no delay and can accurately mimic various voices and accents.
In addition to voice functionality, GPT-4o supporting larger token outputs has also been launched. OpenAI announced the provision of GPT-4o Alpha version to testers, supporting up to 64K token output per request, equivalent to a 200-page novel. Testers can access GPT-4o's long output functionality through "gpt-4o-64k-output-alpha".
The price of the new model has increased, with $6 per million input tokens and $18 per million output tokens. Although the output tokens are 16 times that of GPT-4o, the price has also increased by $3.
Researcher Simon Willison states that long outputs are mainly used for data transformation cases, such as translating documents from one language to another or extracting structured data from documents. Previously, the longest output model was GPT-4o mini, at 16K tokens.