Exploring LLM with OpenAI Triton instead of CUDA: Achieving a Maximum of 82% CUDA Performance

A team of engineers from IBM and Meta reported on the experiment of changing the core engine for running LLM in PyTorch from the original use of CUDA to Triton language by OpenAI, finding that the performance is approaching that of CUDA.

OpenAI introduced the Triton project in 2021 with the aim of developing a language that makes it easier for programmers to directly write programs on graphics chips. Besides replacing CUDA, they also had to choose the Flash Attention engine to replace cuDNN Flash Attention for running LLM models, and found that the AMD Flash Attention worked perfectly in all modes.

The overall performance of running LLM with CUDA removed entirely like this can reach 76-78% on A100 chips and 62-82% on H100 chips.

CUDA is a key selling point of NVIDIA chips that ensures developers can run various AI models efficiently and seamlessly integrate with different models, even though other brands may offer cheaper alternatives.

Source: PyTorch

TLDR: Engineers switched from using CUDA to Triton language for running LLM models in PyTorch and found the performance comparable, showcasing the potential of alternative technologies in the AI field.

Exploring LLM with OpenAI Triton instead of CUDA: Achieving a Maximum of 82% CUDA Performance

More Reading

Microsoft Discontinues ActiveX Starting with Office 2024, Microsoft 365 to Follow in 2025

First Look at Huawei Mate XT: The Innovative Tri-Fold Mobile Phone with Folding Display

Leave a Comment

Leave a Reply Cancel reply

More Reading

Post navigation

Leave a Comment

Leave a Reply Cancel reply

Gemini Model by Google Adopted in Opera, Utilizing Diverse Models from Various Companies

Introducing HuggingChat Assistant: Unleashing the Splendid Chatbot Customization Service by Hugging Face, Complimentary Access Ensured

AI Agent Goose Unveiled: Running Locally with Open-Source Flexibility, Opt for LLM Autonomy