OpenAI has released the whisper-large-v3-turbo voice-to-text conversion model, which has been optimized by reducing the decoder layers from 32 to 8, resulting in a decrease in parameters from 1.55 billion to only 809 million.
After the optimization, the team retrained the original large-v3 model for two more rounds and found that the model regained its quality, closely matching the original model’s performance, except for Thai and Cantonese languages, where there was a noticeable decrease in performance. In the case of the Common Voice dataset, the error rate for Thai language increased by nearly four times.
The development approach for whisper-large-v3-turbo was adapted from the Distil-Whisper research, which involves training a smaller model using outputs from a larger model. However, OpenAI opted to train with full data instead.
Currently, whisper-large-v3-turbo is the starting model in the openai-whisper package’s latest version. Users utilizing Thai language may need to be cautious and switch to other models.
Source: OpenAI/Whisper
TLDR: OpenAI released the whisper-large-v3-turbo model with optimized parameters, retrained it for improved performance, and advised caution for Thai language users.
Leave a Comment