OpenAI has announced a major feature for developers, with the key feature being the ability to access voice data directly through an open API, allowing for the creation of natural conversation applications in Advanced Voice Mode. This unique feature sets OpenAI developers apart from those outside the OpenAI ecosystem who were unable to create similar apps before.
The voice input can be accessed through the Realtime API, which connects to servers via WebSocket instead of the traditional HTTP. While designed primarily for voice communication, it also supports text chat. Additionally, the existing Chat API now supports voice data, and the GPT-4o model can respond in voice, although not as instantly as the Realtime API.
The voice-supported models, gpt-4o-realtime-preview and gpt-4o-audio-preview, are priced at approximately $0.06 per minute for input and $0.24 per minute for output. Fine-tuning the models to recognize specific tasks, such as reading traffic signs, enabling automatic screen clicks, or creating websites based on screen images, has been employed by Coframe, resulting in a 26% improvement over the original models.
Training requires the gpt-4o-2024-08-06 model, priced at $25 per million tokens for training, $3.75 for input tokens, and $15 for output tokens per million. Utilizing the Model Distillation feature, results from large models like GPT-4o or o1-preview are used to train GPT-4o-mini for specialized capabilities. While users can already perform this function, OpenAI provides additional services to store results from large models and offers performance metrics, making usage more convenient.
Prompt Caching, the final feature, reduces costs by recognizing repeated patterns in prompts. With at least 1,024 identical tokens, the cache can be activated, utilizing 128 tokens at a time. Encountering the cache reduces costs by half and is only shared within the same organization, deleting unused caches within 5-10 minutes.
TLDR: OpenAI introduces a groundbreaking feature that allows developers to access voice data directly through an open API, enhancing their ability to create natural conversation applications. Models such as gpt-4o-realtime-preview and gpt-4o-audio-preview support voice interactions, with training and fine-tuning options available for specialized tasks. Additionally, the Model Distillation feature and Prompt Caching system further optimize performance and reduce costs for users.
Leave a Comment