Deploy your own Large Language Model (LLM) using NVIDIA NIM - a powerful service that automatically optimizes your model for maximum performance on NVIDIA GPUs.
NVIDIA NIM is a containerized solution that:
- Automatically optimizes your LLM for the best performance
- Handles all the complexity of model deployment and serving
- Provides an OpenAI-compatible API for easy integration
- Supports multiple model formats without manual conversion
This repository contains three hands-on notebooks showing different deployment approaches:
Best for: Quick deployment of popular models from HuggingFace
- Deploy models directly from HuggingFace with a single command
- Download and deploy from local storage for offline use
- Choose between TensorRT-LLM and vLLM backends
Best for: Maximum performance with custom optimization
- Convert HuggingFace models to TensorRT-LLM format
- Compile optimized engines for production deployment
- Fine-tune performance parameters
Best for: Memory-efficient deployment with quantized models
- Deploy pre-quantized GGUF models (4-bit, 8-bit, etc.)
- Reduce GPU memory requirements significantly
- Handle external configuration requirements
Important
NVIDIA cannot guarantee the security of any models hosted on non-NVIDIA systems such as HuggingFace. Malicious or insecure models can result in serious security risks up to and including full remote code execution. We strongly recommend that before attempting to load it you manually verify the safety of any model not provided by NVIDIA, through such mechanisms as a) ensuring that the model weights are serialized using the safetensors format, b) conducting a manual review of any model or inference code to ensure that it is free of obfuscated or malicious code, and c) validating the signature of the model, if available, to ensure that it comes from a trusted source and has not been modified.
- What is NIM? NVIDIA Inference Microservice (NIM) is a containerized solution that automatically optimizes LLM deployment
- Why use NIM? Handles model optimization, backend selection, and scaling automatically
- What you need to know: Basic Docker commands and Python. No prior LLM deployment experience required.
docker pull nvcr.io/nim/nvidia/llm-nim:latestdocker run -it --rm --name=my-first-nim \
--runtime=nvidia --gpus all \
-p 8000:8000 \
-e NIM_MODEL_NAME="hf://Qwen/Qwen2.5-0.5B" \
nvcr.io/nim/nvidia/llm-nim:latestcurl http://localhost:8000/v1/completions \
-d '{"model": "Qwen/Qwen2.5-0.5B", "prompt": "Hello world"}'This blueprint demonstrates how to deploy almost any Large Language Model using NVIDIA NIM (NVIDIA Inference Microservice). NIM provides a streamlined way to deploy and serve LLMs with optimized performance, handling model analysis, backend selection, and configuration automatically.
Key capabilities:
- Deploy models directly from Hugging Face Hub
- Deploy models from local storage
- Multiple inference backends (TensorRT-LLM, vLLM)
- Customizable deployment parameters
- OpenAI-compatible REST API
- NVIDIA NIM LLM Container:
nvcr.io/nim/nvidia/llm-nim:latest - Supported Inference Backends:
- TensorRT-LLM (high performance)
- vLLM (versatile deployment)
- Container Runtime: Docker with NVIDIA Container Runtime
- Model Sources: Hugging Face Hub, Local filesystem
- GPU: NVIDIA GPU with compute capability 7.0+ (V100, T4, A10, A100, H100, etc.)
- GPU Memory: Varies by model size (minimum 8GB for small models, 40GB+ for large models)
- System Memory: 16GB+ RAM recommended
- Storage: Sufficient disk space for model weights and cache
- NVIDIA Driver: Version 535+ required
The deployed NIM service provides OpenAI-compatible REST API endpoints:
- Health Check:
GET /v1/health/ready - Completions:
POST /v1/completions - Chat Completions:
POST /v1/chat/completions - Models:
GET /v1/models
Example completion request:
{
"model": "model-name",
"prompt": "Your prompt here",
"max_tokens": 250,
"temperature": 0.7
}- NVIDIA Developer License: Access to NVIDIA Container Registry
- NGC API Key: Required for pulling NIM containers (Generate here)
- Hugging Face Token: Required for downloading models from Hugging Face Hub
- Docker: With NVIDIA Container Runtime installed