Cerebras Unveils Llama 3.1 405B with Blistering 969 Token/s Speed, First Token Takes just 240ms

Cerebras, the developer of specialized chips for running large-scale AI models, has unveiled the Cerebras Inference service. This service offers the Llama 3.1 405B model with full precision 16-bit, delivering a remarkable speed of up to 969 token/s and starting to respond to the first token in just 240ms, approaching real-time responsiveness.

Cerebras showcased the speed of its chips a month ago with the demonstration of running the Llama 3.2 70B model at a rate of 2,100 token/s. However, they have not disclosed when this service will be available, but this time Cerebras has stated that it will launch in the first quarter of 2025 and has announced pricing at $6 per million tokens for input and $12 per million tokens for output (compared to Azure’s $5.33 input and $15 output).

Currently available in a closed beta, interested individuals can sign up to join the waitlist.

Source: Cerebras

TLDR: Cerebras showcases its high-speed chips with the Cerebras Inference service, offering fast processing of large AI models at competitive pricing. Interested users can now sign up for the closed beta.

More Reading

Obsidian Legend: Wukong Emerges as a Dark Horse Contender for the Lowest-Rated Game of the Year in History

Introducing Alibaba's Qwen2.5-Turbo: Artificial Intelligence Service Powered by Cloud with 1 Million Tokens Support, More Affordable than GPT-4o-mini 3.6倍.

Leave a Comment

Leave a Reply Cancel reply

More Reading

Post navigation

Leave a Comment

Leave a Reply Cancel reply

Cerebras Launches Llama 3.1 Cloud Service with Blistering Speeds Exceeding 1,800 Tokens per Second, Packing RAM into the Chip