MLPerf is a testing suite that evaluates the capabilities of computers and accelerators for machine learning tasks, with a focus on version 4.1 that specifically assesses the training ability of generative AI models for text and image generation. In the latest results, only two major competitors, NVIDIA and Google, were prominent.
NVIDIA showcased the training results of Llama 2 70B with fine-tuning on a single DGX B200 server equipped with 8 B200-SXM cards and 180GB of RAM, along with Xeon Platinum 8570 CPU. The training of Llama 2 70B was completed in 12.958 minutes compared to H200, which took approximately 24 minutes. On the other hand, training GPT3 using 8 DGX B200 machines was finished in 193.738 minutes making it the smallest cluster in this test batch.
Google introduced their TPU 6 model named Trillium, highlighting its cost-effective model training by up to 45% compared to TPUv5p, although the performance boost was not explicitly mentioned. The TPUv5p test utilized a cluster of 2048 accelerator cards to train GPT3 in 29.616 minutes, while Trillium of the same size completed the training in 27.330 minutes.
Source – MLCommons, NVIDIA, Google
TLDR: MLPerf testing assesses computer and accelerator capabilities for machine learning, with NVIDIA and Google showcasing impressive training results for generative AI models like Llama 2 70B and GPT3. NVIDIA demonstrated faster training times using DGX B200 setups, while Google’s Trillium TPU model delivered cost-effective training solutions with competitive performance metrics.
Leave a Comment