Home ยป OpenAI Enhances Tokenizer to Support 20 Additional Languages, Optimizing Token Costs While Excluding Thai Language

OpenAI Enhances Tokenizer to Support 20 Additional Languages, Optimizing Token Costs While Excluding Thai Language

During the unveiling of OpenAI’s GPT-4o, it’s not just about the model’s improved efficiency, but also the optimization of the tokenizer for better utilization of languages other than English. The OpenAI team has selected 20 languages, including English, but Thai is not among them. This enhancement allows for increased efficiency in the usage of these languages due to a reduction in the number of tokens, resulting in a more streamlined communication process.

Languages that have been optimized include Gujarati, spoken by approximately 55 million people, which sees a token reduction of up to 4.4 times in a sample sentence. Arabic sees a token reduction of 2 times, Vietnamese sees a reduction of 1.5 times, and even popular languages like English, French, Spanish, and Portuguese have been optimized to reduce tokens by 1.1 times.

The number of tokens in each language directly impacts usage, as the parameters of large-language models calculate based on token count, not character count. Therefore, fewer tokens in a sentence allow for more data to be input in the same context window. The GPT-4’s tokenizer provides tokens that are closer in proximity, with Thai having approximately 2 times more tokens than English.

It remains uncertain how this new tokenizer will affect the Thai language, as OpenAI is still in the testing phase for GPT-4o’s tokenizer and it’s not yet operational.

TLDR: OpenAI’s GPT-4o introduces an optimized tokenizer for multiple languages, enhancing efficiency by reducing the number of tokens used in sentences across various languages. Thai language optimization is still in the testing phase.

More Reading

Post navigation

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Multilingual AI Sensation Emerges from Roblox’s Enigmatic Laboratory, Allowing Seamless Translation Across 16 Languages, Including Thai

Typhoon-7b, Unleashed by SCB 10X, Triumphs Over All LLMs in Thai Language Rivalry, On Par with GPT-3.5