Dynamo Production-Ready Recipes

Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.

Prerequisites: This guide assumes you have already installed the Dynamo Kubernetes Platform. If not, follow the Kubernetes Deployment Guide first.

Available Recipes

Feature Comparison Recipes

These recipes compare Dynamo performance features with benchmark results, each including both baseline and optimized deployment configurations:

Model	Framework	Configuration	GPUs	Features
Qwen3-32B	vLLM	Disagg + KV-Router	16x H200	Disaggregated Serving + KV-Aware Routing — benchmark comparison with real-world Mooncake traces
DeepSeek-V3.2-NVFP4	TensorRT-LLM	Agg + Disagg WideEP	32x GB200	Disaggregated Serving + KV-Aware Routing — benchmark comparison with Mooncake-based synthetic coding trace
Qwen3-VL-30B-A3B-FP8	vLLM	Agg + Embedding Cache	1x GB200	Multimodal Embedding Cache — benchmark comparison showing +16% throughput, -28% TTFT

Aggregated & Disaggregated Recipes

These recipes demonstrate aggregated or disaggregated serving:

GAIE Column: Indicates whether the recipe includes integration with the Gateway API Inference Extension (GAIE) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.

Model	Framework	Mode	GPUs	Deployment	Benchmark	Notes	GAIE
Llama-3-70B	vLLM	Aggregated	4x H100/H200	✅	✅	FP8 dynamic quantization	✅
Llama-3-70B	vLLM	Disagg (Single-Node)	8x H100/H200	✅	✅	Prefill + Decode separation	❌
Llama-3-70B	vLLM	Disagg (Multi-Node)	16x H100/H200	✅	✅	2 nodes, 8 GPUs each	❌
Qwen3-32B-FP8	TensorRT-LLM	Aggregated	2x H100/H200/A100	✅	✅	FP8 quantization	❌
Qwen3-32B-FP8	TensorRT-LLM	Disaggregated	8x H100/H200/A100	✅	✅	Prefill + Decode separation	❌
Qwen3-235B-A22B-FP8	TensorRT-LLM	Aggregated	16x H100/H200	✅	✅	MoE model, TP4×EP4	❌
Qwen3-235B-A22B-FP8	TensorRT-LLM	Disaggregated	16x H100/H200	✅	✅	MoE model, Prefill + Decode	❌
GPT-OSS-120B	TensorRT-LLM	Aggregated	4x GB200	✅	✅	Blackwell only, WideEP	❌
DeepSeek-R1	SGLang	Disagg WideEP	16x H200	✅	❌	TP=8, single-node. Use `model-download-sglang.yaml`	❌
DeepSeek-R1	SGLang	Disagg WideEP	32x H200	✅	❌	TP=16, multi-node. Use `model-download-sglang.yaml`	❌
DeepSeek-R1	TensorRT-LLM	Disagg WideEP (GB200)	36x GB200	✅	✅	Multi-node: 8 decode + 1 prefill nodes	❌
DeepSeek-R1	vLLM	Disagg DEP16	32x H200	✅	❌	Multi-node, data-expert parallel	❌

Legend:

Deployment: ✅ = Complete deploy.yaml manifest available
Benchmark: ✅ = Includes perf.yaml for running AIPerf benchmarks

Functional Recipes (Not Yet Benchmarked)

These recipes demonstrate functional deployments with Dynamo features, but have not yet been performance-tuned or paired with benchmark manifests.

Model	Framework	Mode	GPUs	Deployment	Notes
Nemotron-3-Super-FP8	vLLM	Aggregated	4x H100/H200	✅	TP=4, KV-aware routing
Nemotron-3-Super-FP8	SGLang	Aggregated	4x H100/H200	✅	TP=4, KV-aware routing, 1.0+
Nemotron-3-Super-FP8	TensorRT-LLM	Disaggregated	4x H100/H200	✅	TP=2 prefill/decode split, UCX KV transfer
Nemotron-3-Super-FP8	SGLang	Disaggregated	4x H100/H200	✅	TP=2 prefill/decode split, nixl KV transfer, 1.0+
Kimi-K2.5 (Baseten)	TensorRT-LLM	Aggregated	8x B200	✅	Text only — MoE model, TP8×EP8, reasoning + tool calling

Experimental Recipes

These recipes are under active development and may require additional setup steps (e.g., container patching). They are functional but not yet fully validated for production use.

Model	Framework	Mode	GPUs	Deployment	Notes
nvidia/Kimi-K2.5-NVFP4	TensorRT-LLM	Aggregated	8x B200	✅	Text only — MoE model, TP8×EP8, reasoning + tool calling. Requires container patch. Vision input not yet functional with the patch.

Recipe Structure

Each complete recipe follows this standard structure:

<model-name>/
├── README.md (optional)           # Model-specific deployment notes
├── model-cache/
│   ├── model-cache.yaml          # PersistentVolumeClaim for model storage
│   └── model-download.yaml       # Job to download model from HuggingFace
└── <framework>/                  # vllm, sglang, or trtllm
    └── <deployment-mode>/        # agg, disagg, disagg-single-node, etc.
        ├── deploy.yaml           # Complete DynamoGraphDeployment manifest
        └── perf.yaml (optional)  # AIPerf benchmark job

Quick Start

Prerequisites

1. Dynamo Platform Installed

The recipes require the Dynamo Kubernetes Platform to be installed. Follow the installation guide:

Kubernetes Deployment Guide - Quickstart (~10 minutes)
Detailed Installation Guide - Advanced options

2. GPU Cluster Requirements

Ensure your cluster has:

GPU nodes matching recipe requirements (see table above)
GPU operator installed
Appropriate GPU drivers and container runtime

3. HuggingFace Access

Configure authentication to download models:

export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}

# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}

4. Storage Configuration

Update the storageClassName in <model>/model-cache/model-cache.yaml to match your cluster:

# Find your storage class name
kubectl get storageclass

# Edit the model-cache.yaml file and update:
# spec:
#   storageClassName: "your-actual-storage-class"

Deploy a Recipe

Step 1: Download Model

cd recipes
# Update storageClassName in model-cache.yaml first!
kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}

# Wait for download to complete (may take 10-60 minutes depending on model size)
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s

# Monitor progress
kubectl logs -f job/model-download -n ${NAMESPACE}

Step 2: Deploy Service

Update the image in <model>/<framework>/<mode>/deploy.yaml.

kubectl apply -f <model>/<framework>/<mode>/deploy.yaml -n ${NAMESPACE}

# Check deployment status
kubectl get dynamographdeployment -n ${NAMESPACE}

# Check pod status
kubectl get pods -n ${NAMESPACE}

# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=<deployment-name> -n ${NAMESPACE} --timeout=600s

Step 3: Test Deployment

# Port forward to access the service locally
kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n ${NAMESPACE}

# In another terminal, test the endpoint
curl http://localhost:8000/v1/models

# Send a test request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-name>",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

Step 4: Run Benchmark (Optional)

# Only if perf.yaml exists in the recipe directory
kubectl apply -f <model>/<framework>/<mode>/perf.yaml -n ${NAMESPACE}

# Monitor benchmark progress
kubectl logs -f job/<benchmark-job-name> -n ${NAMESPACE}

# View results after completion
kubectl logs job/<benchmark-job-name> -n ${NAMESPACE} | tail -50

Example Deployments

Llama-3-70B with vLLM (Aggregated)

export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}

# Create HF token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token" \
  -n ${NAMESPACE}

# Deploy
cd recipes
kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}

# Test
kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}

Inference Gateway (GAIE) Integration (Optional)

For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.

First, deploy the Dynamo Graph per instructions above.

Then follow Deploy Inference Gateway Section 2 to install GAIE.

Update the containers.epp.image in the deployment file, i.e. llama-3-70b/vllm/agg/gaie/k8s-manifests/epp/deployment.yaml. It should match the release tag and be in the format nvcr.io/nvidia/ai-dynamo/frontend:<version> e.g. nvcr.io/nvidia/ai-dynamo/frontend:0.9.0 The recipe assumes you are using Kubernetes discovery backend and sets the DYN_DISCOVERY_BACKEND env variable in the epp deployment. If you want to use etcd enable the lines below and remove the DYN_DISCOVERY_BACKEND env var.

- name: ETCD_ENDPOINTS
  value: "dynamo-platform-etcd.$(PLATFORM_NAMESPACE):2379" #  update dynamo-platform to appropriate namespace

export DEPLOY_PATH=llama-3-70b/vllm/agg/
# DEPLOY_PATH=<model>/<framework>/<mode>/
kubectl apply -R -f "$DEPLOY_PATH/gaie/k8s-manifests" -n "$NAMESPACE"

DeepSeek-R1 on GB200 (Multi-node)

See deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml for the complete multi-node WideEP configuration.

Customization

Each deploy.yaml contains:

ConfigMap: Engine-specific configuration (embedded in the manifest)
DynamoGraphDeployment: Kubernetes resource definitions
Resource limits: GPU count, memory, CPU requests/limits
Image references: Container images with version tags

Key Customization Points

Model Configuration:

# In deploy.yaml under worker args:
args:
  - python3 -m dynamo.vllm --model <your-model-path> --served-model-name <name>

GPU Resources:

resources:
  limits:
    gpu: "4"  # Adjust based on your requirements
  requests:
    gpu: "4"

Scaling:

services:
  VllmDecodeWorker:
    replicas: 2  # Scale to multiple workers

Router Mode:

# In Frontend args:
args:
  - python3 -m dynamo.frontend --router-mode kv --http-port 8000
# Options: round-robin, kv (KV-aware routing)

Container Images:

image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z
# Update version tag as needed

Troubleshooting

Common Issues

Pods stuck in Pending:

Check GPU availability: kubectl describe node <node-name>
Verify storage class exists: kubectl get storageclass
Check resource requests vs. available resources

Model download fails:

Verify HuggingFace token is correct
Check network connectivity from cluster
Review job logs: kubectl logs job/model-download -n ${NAMESPACE}

Workers fail to start:

Check GPU compatibility (driver version, CUDA version)
Verify image pull secrets if using private registries
Review pod logs: kubectl logs <pod-name> -n ${NAMESPACE}

For more troubleshooting:

Contributing

We welcome contributions of new recipes! See CONTRIBUTING.md for:

Recipe submission guidelines
Required components checklist
Testing and validation requirements
Documentation standards

Recipe Quality Standards

A production-ready recipe must include:

✅ Complete deploy.yaml with DynamoGraphDeployment
✅ Model cache PVC and download job
✅ Benchmark recipe (perf.yaml) for performance testing
✅ Verification on target hardware
✅ Documentation of GPU requirements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamo Production-Ready Recipes

Available Recipes

Feature Comparison Recipes

Aggregated & Disaggregated Recipes

Functional Recipes (Not Yet Benchmarked)

Experimental Recipes

Recipe Structure

Quick Start

Prerequisites

Deploy a Recipe

Example Deployments

Llama-3-70B with vLLM (Aggregated)

Inference Gateway (GAIE) Integration (Optional)

DeepSeek-R1 on GB200 (Multi-node)

Customization

Key Customization Points

Troubleshooting

Common Issues

Related Documentation

Contributing

Recipe Quality Standards

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Dynamo Production-Ready Recipes

Available Recipes

Feature Comparison Recipes

Aggregated & Disaggregated Recipes

Functional Recipes (Not Yet Benchmarked)

Experimental Recipes

Recipe Structure

Quick Start

Prerequisites

Deploy a Recipe

Example Deployments

Llama-3-70B with vLLM (Aggregated)

Inference Gateway (GAIE) Integration (Optional)

DeepSeek-R1 on GB200 (Multi-node)

Customization

Key Customization Points

Troubleshooting

Common Issues

Related Documentation

Contributing

Recipe Quality Standards