Exploring LLM with OpenAI Triton instead of CUDA: Achieving a Maximum of 82% CUDA Performance
A team of engineers from IBM and Meta reported on the experiment of changing the core engine for running LLM in PyTorch from the original use of CUDA to Triton...