Cerebras, the developer of specialized chips for running large-scale AI models, has unveiled the Cerebras Inference service. This service offers the Llama 3.1 405B model with full precision 16-bit, delivering a remarkable speed of up to 969 token/s and starting to respond to the first token in just 240ms, approaching real-time responsiveness.
Cerebras showcased the speed of its chips a month ago with the demonstration of running the Llama 3.2 70B model at a rate of 2,100 token/s. However, they have not disclosed when this service will be available, but this time Cerebras has stated that it will launch in the first quarter of 2025 and has announced pricing at $6 per million tokens for input and $12 per million tokens for output (compared to Azure’s $5.33 input and $15 output).
Currently available in a closed beta, interested individuals can sign up to join the waitlist.
Source: Cerebras
TLDR: Cerebras showcases its high-speed chips with the Cerebras Inference service, offering fast processing of large AI models at competitive pricing. Interested users can now sign up for the closed beta.
Leave a Comment