1+ <!--
2+ SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+ SPDX-License-Identifier: Apache-2.0
4+ -->
5+
16# Dynamo Production-Ready Recipes
27
38Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
@@ -7,41 +12,51 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D
712
813## Available Recipes
914
10- ### Multi- Feature Recipe
15+ ### Feature Comparison Recipes
1116
12- This recipe combines multiple Dynamo performance features (disaggregated serving + KV-aware routing) :
17+ These recipes compare Dynamo performance features with benchmark results, each including both baseline and optimized deployment configurations :
1318
1419| Model | Framework | Configuration | GPUs | Features |
1520| -------| -----------| ---------------| ------| ----------|
16- | ** [ Qwen3-32B] ( qwen3-32b/ ) ** | vLLM | Disagg + KV-Router | 16x H200 | ** Disaggregated Serving + KV-Aware Routing** — includes benchmark comparison with real-world Mooncake traces |
21+ | ** [ Qwen3-32B] ( qwen3-32b/ ) ** | vLLM | Disagg + KV-Router | 16x H200 | ** Disaggregated Serving + KV-Aware Routing** — benchmark comparison with real-world Mooncake traces |
22+ | ** [ DeepSeek-V3.2-NVFP4] ( deepseek-v32-fp4/ ) ** | TensorRT-LLM | Agg + Disagg WideEP | 32x GB200 | ** Disaggregated Serving + KV-Aware Routing** — benchmark comparison with Mooncake-based synthetic coding trace |
23+ | ** [ Qwen3-VL-30B-A3B-FP8] ( qwen3-vl-30b/ ) ** | vLLM | Agg + Embedding Cache | 1x GB200 | ** Multimodal Embedding Cache** — benchmark comparison showing +16% throughput, -28% TTFT |
1724
1825### Aggregated & Disaggregated Recipes
1926
2027These recipes demonstrate aggregated or disaggregated serving:
2128
2229** GAIE Column** : Indicates whether the recipe includes integration with the [ Gateway API Inference Extension (GAIE)] ( ../deploy/inference-gateway/README.md ) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.
2330
24- | Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | GAIE |
25- | -------| -----------| ------| ------| ------------| ------------------ | -------| ------|
31+ | Model | Framework | Mode | GPUs | Deployment | Benchmark | Notes | GAIE |
32+ | -------| -----------| ------| ------| ------------| -----------| -------| ------|
2633| ** [ Llama-3-70B] ( llama-3-70b/vllm/agg/ ) ** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ |
2734| ** [ Llama-3-70B] ( llama-3-70b/vllm/disagg-single-node/ ) ** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
2835| ** [ Llama-3-70B] ( llama-3-70b/vllm/disagg-multi-node/ ) ** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
29- | ** [ Qwen3-32B-FP8] ( qwen3-32b-fp8/trtllm/agg/ ) ** | TensorRT-LLM | Aggregated | 2x GPU | ✅ | ✅ | FP8 quantization | ❌ |
30- | ** [ Qwen3-32B-FP8] ( qwen3-32b-fp8/trtllm/disagg/ ) ** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | ❌ |
31- | ** [ Qwen3-235B-A22B-FP8] ( qwen3-235b-a22b-fp8/trtllm/agg/ ) ** | TensorRT-LLM | Aggregated | 16x GPU | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
32- | ** [ Qwen3-235B-A22B-FP8] ( qwen3-235b-a22b-fp8/trtllm/disagg/ ) ** | TensorRT-LLM | Disaggregated | 16x GPU | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
36+ | ** [ Qwen3-32B-FP8] ( qwen3-32b-fp8/trtllm/agg/ ) ** | TensorRT-LLM | Aggregated | 2x H100/H200/A100 | ✅ | ✅ | FP8 quantization | ❌ |
37+ | ** [ Qwen3-32B-FP8] ( qwen3-32b-fp8/trtllm/disagg/ ) ** | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 | ✅ | ✅ | Prefill + Decode separation | ❌ |
38+ | ** [ Qwen3-235B-A22B-FP8] ( qwen3-235b-a22b-fp8/trtllm/agg/ ) ** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
39+ | ** [ Qwen3-235B-A22B-FP8] ( qwen3-235b-a22b-fp8/trtllm/disagg/ ) ** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
3340| ** [ GPT-OSS-120B] ( gpt-oss-120b/trtllm/agg/ ) ** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
34- | ** [ GPT-OSS-120B] ( gpt-oss-120b/trtllm/disagg/ ) ** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | ❌ |
35- | ** [ DeepSeek-R1] ( deepseek-r1/sglang/disagg-8gpu/ ) ** | SGLang | Disagg WideEP | 16x H200 | ✅* 1 | ❌ | TP=8 per worker, single-node | ❌ |
36- | ** [ DeepSeek-R1] ( deepseek-r1/sglang/disagg-16gpu/ ) ** | SGLang | Disagg WideEP | 32x H200 | ✅* 1 | ❌ | TP=16 per worker, multi-node | ❌ |
37- | ** [ DeepSeek-R1] ( deepseek-r1/trtllm/disagg/wide_ep/gb200/ ) ** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
38- | ** [ DeepSeek-R1] ( deepseek-r1/vllm/disagg/ ) ** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
39-
40- * 1: Please use ` deepseek-r1/model-cache/model-download-sglang.yaml ` to download the model into the PVC.
41+ | ** [ DeepSeek-R1] ( deepseek-r1/sglang/disagg-8gpu/ ) ** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | TP=8, single-node. Use ` model-download-sglang.yaml ` | ❌ |
42+ | ** [ DeepSeek-R1] ( deepseek-r1/sglang/disagg-16gpu/ ) ** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use ` model-download-sglang.yaml ` | ❌ |
43+ | ** [ DeepSeek-R1] ( deepseek-r1/trtllm/disagg/wide_ep/gb200/ ) ** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
44+ | ** [ DeepSeek-R1] ( deepseek-r1/ ) ** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
4145
4246** Legend:**
43- - ** Deployment** : ✅ = Complete ` deploy.yaml ` manifest available | ❌ = Missing or incomplete
44- - ** Benchmark Recipe** : ✅ = Includes ` perf.yaml ` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
47+ - ** Deployment** : ✅ = Complete ` deploy.yaml ` manifest available
48+ - ** Benchmark** : ✅ = Includes ` perf.yaml ` for running AIPerf benchmarks
49+
50+ ### Functional Recipes (Not Yet Benchmarked)
51+
52+ These recipes demonstrate functional deployments with Dynamo features, but have not yet been performance-tuned or paired with benchmark manifests.
53+
54+ | Model | Framework | Mode | GPUs | Deployment | Notes |
55+ | -------| -----------| -------| ------| ------------| -------|
56+ | ** [ Nemotron-3-Super-FP8] ( nemotron-3-super-fp8/vllm/agg/ ) ** | vLLM | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing |
57+ | ** [ Nemotron-3-Super-FP8] ( nemotron-3-super-fp8/sglang/agg/ ) ** | SGLang | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing, 1.0+ |
58+ | ** [ Nemotron-3-Super-FP8] ( nemotron-3-super-fp8/trtllm/disagg/ ) ** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
59+ | ** [ Nemotron-3-Super-FP8] ( nemotron-3-super-fp8/sglang/disagg/ ) ** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
4560
4661## Recipe Structure
4762
@@ -113,9 +128,6 @@ cd recipes
113128# Update storageClassName in model-cache.yaml first!
114129kubectl apply -f < model> /model-cache/ -n ${NAMESPACE}
115130
116- # Create model cache PVC
117- kubectl apply -f < model> /model-cache/model-download.yaml -n ${NAMESPACE}
118-
119131# Wait for download to complete (may take 10-60 minutes depending on model size)
120132kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
121133
@@ -189,15 +201,14 @@ kubectl create secret generic hf-token-secret \
189201# Deploy
190202cd recipes
191203kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
192- kubectl apply -f llama-3-70b/model-cache/model-download.yaml -n ${NAMESPACE}
193204kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
194205kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
195206
196207# Test
197208kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
198209```
199210
200- ### Inference Gateway (GAIE) Integration (Optional)**
211+ ### Inference Gateway (GAIE) Integration (Optional)
201212
202213For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.
203214
0 commit comments