Home ยป Exploring LLM with OpenAI Triton instead of CUDA: Achieving a Maximum of 82% CUDA Performance

Exploring LLM with OpenAI Triton instead of CUDA: Achieving a Maximum of 82% CUDA Performance

A team of engineers from IBM and Meta reported on the experiment of changing the core engine for running LLM in PyTorch from the original use of CUDA to Triton language by OpenAI, finding that the performance is approaching that of CUDA.

OpenAI introduced the Triton project in 2021 with the aim of developing a language that makes it easier for programmers to directly write programs on graphics chips. Besides replacing CUDA, they also had to choose the Flash Attention engine to replace cuDNN Flash Attention for running LLM models, and found that the AMD Flash Attention worked perfectly in all modes.

The overall performance of running LLM with CUDA removed entirely like this can reach 76-78% on A100 chips and 62-82% on H100 chips.

CUDA is a key selling point of NVIDIA chips that ensures developers can run various AI models efficiently and seamlessly integrate with different models, even though other brands may offer cheaper alternatives.

Source: PyTorch

TLDR: Engineers switched from using CUDA to Triton language for running LLM models in PyTorch and found the performance comparable, showcasing the potential of alternative technologies in the AI field.

More Reading

Post navigation

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Innovative Development of LLM and Chatbots with Philosophical Influence from Xi Jinping’s Ideology

Gemini Model by Google Adopted in Opera, Utilizing Diverse Models from Various Companies

Introducing HuggingChat Assistant: Unleashing the Splendid Chatbot Customization Service by Hugging Face, Complimentary Access Ensured