Cerebras Launches Llama 3.1 Cloud Service with Blistering Speeds Exceeding 1,800 Tokens per Second, Packing RAM into the Chip

Cerebras, a leading AI chip manufacturing company, has launched the Cerebras Inference service running the Llama 3.1 model at high speeds. The Llama 3.1 70B model can achieve a remarkable 450 tokens/s, while the Llama 3.1 8B can reach up to 1,800 tokens per second. This makes it the fastest high-speed service in the world, surpassing Groq’s previous record of 750 tokens/s.

The key feature of Cerebras is the Wafer Scale Engine chip, which houses a high-speed 44GB SRAM on the chip itself. This is connected to a processing unit with a bandwidth of up to 21 Petabytes/s, compared to NVIDIA H100 chip which, despite its high bandwidth, only provides 3.3 Terabytes/s. This is crucial for model inference as each token data must pass through the entire model. For instance, running the 70B model at 1000 tokens/s would require a bandwidth of up to 140 Terabytes/s.

The cost of using the 70B model is estimated at $0.6 per million tokens, with the limitation of only accommodating 8,000 tokens as inputs.

TLDR: Cerebras introduces the Cerebras Inference service with the Llama 3.1 model, offering high-speed processing capabilities surpassing previous records in the industry.

Cerebras Launches Llama 3.1 Cloud Service with Blistering Speeds Exceeding 1,800 Tokens per Second, Packing RAM into the Chip

More Reading

Keanu Reeves lends his voice to Shadow in Sonic the Hedgehog 3.

First-Ever Nintendo Switch Release of Yakuza Series Announced with Yakuza Kiwami (Part 1 Remake) - A Momentous Occasion

Leave a Comment

Leave a Reply Cancel reply

More Reading

Post navigation

Leave a Comment

Leave a Reply Cancel reply

Meta Launches AI Studio: Crafting Custom Llama 3.1 Chatbots with Personalized Topic Responses

Cerebras Unveils Llama 3.1 405B with Blistering 969 Token/s Speed, First Token Takes just 240ms