DeepSeek recently held an open-source event, releasing a suite of software used for development and services. Several components within this group have garnered high interest due to their ability to significantly boost speed and simplify caching. Noteworthy projects include:
FlashMLA: decoding kernel optimized for NVIDIA Hopper, an evolution from Flash Attention. Currently, vLLM has adopted this technique, resulting in a 3x performance boost for DeepSeek models and a 10x increase in memory token storage.
DeepEP: a cross-chip graphic communication library tailored for running models in the Mixture-of-Experts (MoE) group, focusing on minimizing latency.
DeepGEMM: a CUDA library for FP8 matrix multiplication, providing a maximum 2.7x speedup while maintaining the worst-case scenario.
EPLB: a load balancer for running MoE artificial intelligence models, adjusting the load levels of each expert in the system accordingly.
DualPipe: a pipeline for AI processing initially introduced in DeepSeek-V3, available as an external library for utilization.
3FS: a distributed file system enabling efficient data retrieval for AI models, maximizing SSD performance.
DeepSeek concluded this event by presenting a high-performance AI model running system architecture that requires costly H800 chips for optimal efficiency (H800 cost approximately $2 per hour). Operation is currently limited to midnight to 9 AM. With pricing based on DeepSeek-R1 token value, the full system can generate daily revenue (server costs excluded) of up to $562,027, yielding a 545% profit margin. However, many opt for free services or the more affordable DeepSeek-V3, resulting in reduced usage during off-peak hours.
TLDR: DeepSeek showcased various high-performance open-source software components and system architecture designed for AI model processing, with a focus on efficiency and cost-effectiveness.
Leave a Comment