MLCommons has released the results of MLPerf 3.1, a benchmark test for training large language models (LLMs). In this round, the spotlight was on training large-scale language models. NVIDIA once again showcased its fastest training machine, the NVIDIA Eos, powered by 10,752 NVIDIA H100 chips. This machine trained GPT3 in just 3.9 minutes and achieved a 2.8x performance improvement, resulting in an impressive machine efficiency of 93%.
What makes this round special is that Azure also submitted a machine with the same specifications. However, the Azure ND machine performed only 2% slower than NVIDIA’s, proving that this size of machine is truly viable in the cloud.
On the Google side, they submitted the TPU v5e, demonstrating high-accuracy quantize techniques using INT8 processing. Despite taking 44.68 minutes to train GPT3 on a 4,096-chip TPU v5e, Google showed that the performance-to-cost ratio of the TPU v5e is significantly better. The rental cost is just $1.2 per chip per hour. With this test announcement, Google has also announced the GA availability of TPU v5e chip rental services.
Intel, on the other hand, continued to submit results using the Intel Gaudi2 chip. However, they were able to speed up the training process by using the FP8 model instead. With a total training time of 153.58 seconds on a machine with 384 Gaudi2 chips, the cost-effective performance of Gaudi2 allows organizations to purchase and use them for their operations.
TLDR: MLCommons conducted the MLPerf 3.1 benchmark test for training large language models. NVIDIA’s machine was the fastest, Azure performed slightly slower but still viable in the cloud, Google showcased an efficient TPU v5e with high accuracy, and Intel used the Gaudi2 chip with improved speed and cost-effectiveness.