NVIDIA has announced a collaboration with Hugging Face to offer retail services for running models in real time using the NVIDIA DGX Cloud server with the H100 chip.
Users who wish to access this service must be Hugging Face Enterprise members (at $20 per month) and can access the models through the “NVIDIA NIM Enterprise” option, calling via the openai library in Python. Previously, Hugging Face had provided the H100 chip for model training.
NVIDIA and Hugging Face charge based on actual usage at $8.25 per hour, not based on token count. Typically, the Llama 3 8B model with 500 input tokens and 100 output tokens takes about 1 second on the H100 chip, costing approximately $0.0023. The Llama 3 70B model, however, requires 4 H100 chips and 2 seconds to run, resulting in a cost of approximately $0.0184.
The available models are limited to popular ones like Mixtral 8x22B, Llama 3.1, Mistral 7B, and Llama 3. Users already utilizing Hugging Face and wanting to explore further may find this option suitable, while others may find using a cloud exclusively more cost-effective.
Source: Hugging Face
TLDR: NVIDIA and Hugging Face collaborate to provide retail services for running models in real-time using the NVIDIA DGX Cloud server with the H100 chip. Users can access models through the “NVIDIA NIM Enterprise” option, priced at $8.25 per hour regardless of token count. Popular models like Mixtral 8x22B and Llama 3 are available, catering to users familiar with Hugging Face and looking to experiment further.
Leave a Comment