Cerebras, a leading AI chip manufacturing company, has launched the Cerebras Inference service running the Llama 3.1 model at high speeds. The Llama 3.1 70B model can achieve a remarkable 450 tokens/s, while the Llama 3.1 8B can reach up to 1,800 tokens per second. This makes it the fastest high-speed service in the world, surpassing Groq’s previous record of 750 tokens/s.
The key feature of Cerebras is the Wafer Scale Engine chip, which houses a high-speed 44GB SRAM on the chip itself. This is connected to a processing unit with a bandwidth of up to 21 Petabytes/s, compared to NVIDIA H100 chip which, despite its high bandwidth, only provides 3.3 Terabytes/s. This is crucial for model inference as each token data must pass through the entire model. For instance, running the 70B model at 1000 tokens/s would require a bandwidth of up to 140 Terabytes/s.
The cost of using the 70B model is estimated at $0.6 per million tokens, with the limitation of only accommodating 8,000 tokens as inputs.
TLDR: Cerebras introduces the Cerebras Inference service with the Llama 3.1 model, offering high-speed processing capabilities surpassing previous records in the industry.
Leave a Comment