Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
Prerequisites: This guide assumes you have already installed the Dynamo Kubernetes Platform. If not, follow the Kubernetes Deployment Guide first.
These recipes compare Dynamo performance features with benchmark results, each including both baseline and optimized deployment configurations:
| Model | Framework | Configuration | GPUs | Features |
|---|---|---|---|---|
| Qwen3-32B | vLLM | Disagg + KV-Router | 16x H200 | Disaggregated Serving + KV-Aware Routing — benchmark comparison with real-world Mooncake traces |
| DeepSeek-V3.2-NVFP4 | TensorRT-LLM | Agg + Disagg WideEP | 32x GB200 | Disaggregated Serving + KV-Aware Routing — benchmark comparison with Mooncake-based synthetic coding trace |
| Qwen3-VL-30B-A3B-FP8 | vLLM | Agg + Embedding Cache | 1x GB200 | Multimodal Embedding Cache — benchmark comparison showing +16% throughput, -28% TTFT |
These recipes demonstrate aggregated or disaggregated serving:
GAIE Column: Indicates whether the recipe includes integration with the Gateway API Inference Extension (GAIE) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.
| Model | Framework | Mode | GPUs | Deployment | Benchmark | Notes | GAIE |
|---|---|---|---|---|---|---|---|
| Llama-3-70B | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ |
| Llama-3-70B | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
| Llama-3-70B | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
| Qwen3-32B-FP8 | TensorRT-LLM | Aggregated | 2x H100/H200/A100 | ✅ | ✅ | FP8 quantization | ❌ |
| Qwen3-32B-FP8 | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 | ✅ | ✅ | Prefill + Decode separation | ❌ |
| Qwen3-235B-A22B-FP8 | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
| Qwen3-235B-A22B-FP8 | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
| GPT-OSS-120B | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
| DeepSeek-R1 | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | TP=8, single-node. Use model-download-sglang.yaml |
❌ |
| DeepSeek-R1 | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use model-download-sglang.yaml |
❌ |
| DeepSeek-R1 | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
| DeepSeek-R1 | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
Legend:
- Deployment: ✅ = Complete
deploy.yamlmanifest available - Benchmark: ✅ = Includes
perf.yamlfor running AIPerf benchmarks
These recipes demonstrate functional deployments with Dynamo features, but have not yet been performance-tuned or paired with benchmark manifests.
| Model | Framework | Mode | GPUs | Deployment | Notes |
|---|---|---|---|---|---|
| Nemotron-3-Super-FP8 | vLLM | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing |
| Nemotron-3-Super-FP8 | SGLang | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing, 1.0+ |
| Nemotron-3-Super-FP8 | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
| Nemotron-3-Super-FP8 | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
| Kimi-K2.5 (Baseten) | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling |
These recipes are under active development and may require additional setup steps (e.g., container patching). They are functional but not yet fully validated for production use.
| Model | Framework | Mode | GPUs | Deployment | Notes |
|---|---|---|---|---|---|
| nvidia/Kimi-K2.5-NVFP4 | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Requires container patch. Vision input not yet functional with the patch. |
Each complete recipe follows this standard structure:
<model-name>/
├── README.md (optional) # Model-specific deployment notes
├── model-cache/
│ ├── model-cache.yaml # PersistentVolumeClaim for model storage
│ └── model-download.yaml # Job to download model from HuggingFace
└── <framework>/ # vllm, sglang, or trtllm
└── <deployment-mode>/ # agg, disagg, disagg-single-node, etc.
├── deploy.yaml # Complete DynamoGraphDeployment manifest
└── perf.yaml (optional) # AIPerf benchmark job
1. Dynamo Platform Installed
The recipes require the Dynamo Kubernetes Platform to be installed. Follow the installation guide:
- Kubernetes Deployment Guide - Quickstart (~10 minutes)
- Detailed Installation Guide - Advanced options
2. GPU Cluster Requirements
Ensure your cluster has:
- GPU nodes matching recipe requirements (see table above)
- GPU operator installed
- Appropriate GPU drivers and container runtime
3. HuggingFace Access
Configure authentication to download models:
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}4. Storage Configuration
Update the storageClassName in <model>/model-cache/model-cache.yaml to match your cluster:
# Find your storage class name
kubectl get storageclass
# Edit the model-cache.yaml file and update:
# spec:
# storageClassName: "your-actual-storage-class"Step 1: Download Model
cd recipes
# Update storageClassName in model-cache.yaml first!
kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}
# Wait for download to complete (may take 10-60 minutes depending on model size)
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
# Monitor progress
kubectl logs -f job/model-download -n ${NAMESPACE}Step 2: Deploy Service
Update the image in <model>/<framework>/<mode>/deploy.yaml.
kubectl apply -f <model>/<framework>/<mode>/deploy.yaml -n ${NAMESPACE}
# Check deployment status
kubectl get dynamographdeployment -n ${NAMESPACE}
# Check pod status
kubectl get pods -n ${NAMESPACE}
# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=<deployment-name> -n ${NAMESPACE} --timeout=600sStep 3: Test Deployment
# Port forward to access the service locally
kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n ${NAMESPACE}
# In another terminal, test the endpoint
curl http://localhost:8000/v1/models
# Send a test request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-name>",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'Step 4: Run Benchmark (Optional)
# Only if perf.yaml exists in the recipe directory
kubectl apply -f <model>/<framework>/<mode>/perf.yaml -n ${NAMESPACE}
# Monitor benchmark progress
kubectl logs -f job/<benchmark-job-name> -n ${NAMESPACE}
# View results after completion
kubectl logs job/<benchmark-job-name> -n ${NAMESPACE} | tail -50export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HF token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token" \
-n ${NAMESPACE}
# Deploy
cd recipes
kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
# Test
kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.
First, deploy the Dynamo Graph per instructions above.
Then follow Deploy Inference Gateway Section 2 to install GAIE.
Update the containers.epp.image in the deployment file, i.e. llama-3-70b/vllm/agg/gaie/k8s-manifests/epp/deployment.yaml. It should match the release tag and be in the format nvcr.io/nvidia/ai-dynamo/frontend:<version> e.g. nvcr.io/nvidia/ai-dynamo/frontend:0.9.0
The recipe assumes you are using Kubernetes discovery backend and sets the DYN_DISCOVERY_BACKEND env variable in the epp deployment. If you want to use etcd enable the lines below and remove the DYN_DISCOVERY_BACKEND env var.
- name: ETCD_ENDPOINTS
value: "dynamo-platform-etcd.$(PLATFORM_NAMESPACE):2379" # update dynamo-platform to appropriate namespaceexport DEPLOY_PATH=llama-3-70b/vllm/agg/
# DEPLOY_PATH=<model>/<framework>/<mode>/
kubectl apply -R -f "$DEPLOY_PATH/gaie/k8s-manifests" -n "$NAMESPACE"See deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml for the complete multi-node WideEP configuration.
Each deploy.yaml contains:
- ConfigMap: Engine-specific configuration (embedded in the manifest)
- DynamoGraphDeployment: Kubernetes resource definitions
- Resource limits: GPU count, memory, CPU requests/limits
- Image references: Container images with version tags
Model Configuration:
# In deploy.yaml under worker args:
args:
- python3 -m dynamo.vllm --model <your-model-path> --served-model-name <name>GPU Resources:
resources:
limits:
gpu: "4" # Adjust based on your requirements
requests:
gpu: "4"Scaling:
services:
VllmDecodeWorker:
replicas: 2 # Scale to multiple workersRouter Mode:
# In Frontend args:
args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000
# Options: round-robin, kv (KV-aware routing)Container Images:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z
# Update version tag as neededPods stuck in Pending:
- Check GPU availability:
kubectl describe node <node-name> - Verify storage class exists:
kubectl get storageclass - Check resource requests vs. available resources
Model download fails:
- Verify HuggingFace token is correct
- Check network connectivity from cluster
- Review job logs:
kubectl logs job/model-download -n ${NAMESPACE}
Workers fail to start:
- Check GPU compatibility (driver version, CUDA version)
- Verify image pull secrets if using private registries
- Review pod logs:
kubectl logs <pod-name> -n ${NAMESPACE}
For more troubleshooting:
- Kubernetes Deployment Guide - Platform installation and concepts
- API Reference - DynamoGraphDeployment CRD specification
- vLLM Backend Guide - vLLM-specific features
- SGLang Backend Guide - SGLang-specific features
- TensorRT-LLM Backend Guide - TensorRT-LLM features
- Observability - Monitoring and logging
- Benchmarking Guide - Performance testing
We welcome contributions of new recipes! See CONTRIBUTING.md for:
- Recipe submission guidelines
- Required components checklist
- Testing and validation requirements
- Documentation standards
A production-ready recipe must include:
- ✅ Complete
deploy.yamlwith DynamoGraphDeployment - ✅ Model cache PVC and download job
- ✅ Benchmark recipe (
perf.yaml) for performance testing - ✅ Verification on target hardware
- ✅ Documentation of GPU requirements