Skip to content

Commit 9bdda19

Browse files
dagil-nvidianealvaidyaBenHammclaude
authored
docs: cherry-pick recipes landing page fixes and Nemotron recipes to release/1.0.0 (#7254)
Signed-off-by: Neal Vaidya <nealv@nvidia.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: Ben Hamm <ben.hamm@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ac31663 commit 9bdda19

File tree

10 files changed

+833
-57
lines changed

10 files changed

+833
-57
lines changed

recipes/README.md

Lines changed: 34 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
-->
5+
16
# Dynamo Production-Ready Recipes
27

38
Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
@@ -7,41 +12,51 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D
712
813
## Available Recipes
914

10-
### Multi-Feature Recipe
15+
### Feature Comparison Recipes
1116

12-
This recipe combines multiple Dynamo performance features (disaggregated serving + KV-aware routing):
17+
These recipes compare Dynamo performance features with benchmark results, each including both baseline and optimized deployment configurations:
1318

1419
| Model | Framework | Configuration | GPUs | Features |
1520
|-------|-----------|---------------|------|----------|
16-
| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — includes benchmark comparison with real-world Mooncake traces |
21+
| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with real-world Mooncake traces |
22+
| **[DeepSeek-V3.2-NVFP4](deepseek-v32-fp4/)** | TensorRT-LLM | Agg + Disagg WideEP | 32x GB200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with Mooncake-based synthetic coding trace |
23+
| **[Qwen3-VL-30B-A3B-FP8](qwen3-vl-30b/)** | vLLM | Agg + Embedding Cache | 1x GB200 | **Multimodal Embedding Cache** — benchmark comparison showing +16% throughput, -28% TTFT |
1724

1825
### Aggregated & Disaggregated Recipes
1926

2027
These recipes demonstrate aggregated or disaggregated serving:
2128

2229
**GAIE Column**: Indicates whether the recipe includes integration with the [Gateway API Inference Extension (GAIE)](../deploy/inference-gateway/README.md) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.
2330

24-
| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | GAIE |
25-
|-------|-----------|------|------|------------|------------------|-------|------|
31+
| Model | Framework | Mode | GPUs | Deployment | Benchmark | Notes | GAIE |
32+
|-------|-----------|------|------|------------|-----------|-------|------|
2633
| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 ||| FP8 dynamic quantization ||
2734
| **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 ||| Prefill + Decode separation ||
2835
| **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 ||| 2 nodes, 8 GPUs each ||
29-
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x GPU ||| FP8 quantization ||
30-
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU ||| Prefill + Decode separation ||
31-
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x GPU ||| MoE model, TP4×EP4 ||
32-
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x GPU ||| MoE model, Prefill + Decode ||
36+
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x H100/H200/A100 ||| FP8 quantization ||
37+
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 ||| Prefill + Decode separation ||
38+
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 ||| MoE model, TP4×EP4 ||
39+
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 ||| MoE model, Prefill + Decode ||
3340
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 ||| Blackwell only, WideEP ||
34-
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD ||| Engine configs only, no K8s manifest ||
35-
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 |*1 || TP=8 per worker, single-node ||
36-
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 |*1 || TP=16 per worker, multi-node ||
37-
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 ||| Multi-node: 8 decode + 1 prefill nodes ||
38-
| **[DeepSeek-R1](deepseek-r1/vllm/disagg/)** | vLLM | Disagg DEP16 | 32x H200 ||| Multi-node, data-expert parallel ||
39-
40-
*1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC.
41+
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 ||| TP=8, single-node. Use `model-download-sglang.yaml` ||
42+
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 ||| TP=16, multi-node. Use `model-download-sglang.yaml` ||
43+
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 ||| Multi-node: 8 decode + 1 prefill nodes ||
44+
| **[DeepSeek-R1](deepseek-r1/)** | vLLM | Disagg DEP16 | 32x H200 ||| Multi-node, data-expert parallel ||
4145

4246
**Legend:**
43-
- **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
44-
- **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
47+
- **Deployment**: ✅ = Complete `deploy.yaml` manifest available
48+
- **Benchmark**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks
49+
50+
### Functional Recipes (Not Yet Benchmarked)
51+
52+
These recipes demonstrate functional deployments with Dynamo features, but have not yet been performance-tuned or paired with benchmark manifests.
53+
54+
| Model | Framework | Mode | GPUs | Deployment | Notes |
55+
|-------|-----------|-------|------|------------|-------|
56+
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 || TP=4, KV-aware routing |
57+
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/agg/)** | SGLang | Aggregated | 4x H100/H200 || TP=4, KV-aware routing, 1.0+ |
58+
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 || TP=2 prefill/decode split, UCX KV transfer |
59+
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 || TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
4560

4661
## Recipe Structure
4762

@@ -113,9 +128,6 @@ cd recipes
113128
# Update storageClassName in model-cache.yaml first!
114129
kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}
115130

116-
# Create model cache PVC
117-
kubectl apply -f <model>/model-cache/model-download.yaml -n ${NAMESPACE}
118-
119131
# Wait for download to complete (may take 10-60 minutes depending on model size)
120132
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
121133

@@ -189,15 +201,14 @@ kubectl create secret generic hf-token-secret \
189201
# Deploy
190202
cd recipes
191203
kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
192-
kubectl apply -f llama-3-70b/model-cache/model-download.yaml -n ${NAMESPACE}
193204
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
194205
kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
195206

196207
# Test
197208
kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
198209
```
199210

200-
### Inference Gateway (GAIE) Integration (Optional)**
211+
### Inference Gateway (GAIE) Integration (Optional)
201212

202213
For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.
203214

recipes/gpt-oss-120b/README.md

Lines changed: 40 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,59 @@
1-
# GPT-OSS-120B Recipe Guide
1+
# GPT-OSS-120B Recipes
22

3-
This guide will help you run the GPT-OSS-120B language model using Dynamo's optimized setup.
3+
Production-ready deployment for **GPT-OSS-120B** using TensorRT-LLM on Blackwell (GB200) hardware.
4+
5+
## Available Configurations
6+
7+
| Configuration | GPUs | Mode | Description |
8+
|--------------|------|------|-------------|
9+
| [**trtllm/agg**](trtllm/agg/) | 4x GB200 | Aggregated | WideEP, ARM64 |
10+
11+
> **Note:** A [disaggregated configuration](trtllm/disagg/) exists with engine configs but is not yet production-ready. See [trtllm/disagg/README.md](trtllm/disagg/README.md) for details.
412
513
## Prerequisites
614

7-
Follow the instructions in recipe [README.md](../README.md) to create a namespace and kubernetes secret for huggingface token.
15+
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
16+
2. **GPU cluster** with GB200 (Blackwell) GPUs
17+
3. **HuggingFace token** with access to the model
818

919
## Quick Start
1020

11-
To run the model, simply execute this command in your terminal:
12-
1321
```bash
14-
cd recipe
15-
./run.sh --model gpt-oss-120b --framework trtllm agg
16-
```
22+
# Set namespace
23+
export NAMESPACE=dynamo-demo
24+
kubectl create namespace ${NAMESPACE}
1725

18-
## (Alternative) Step by Step Guide
26+
# Create HuggingFace token secret
27+
kubectl create secret generic hf-token-secret \
28+
--from-literal=HF_TOKEN="your-token-here" \
29+
-n ${NAMESPACE}
1930

20-
### 1. Download the Model
31+
# Download model (update storageClassName in model-cache/model-cache.yaml first!)
32+
kubectl apply -f model-cache/ -n ${NAMESPACE}
33+
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
2134

22-
```bash
23-
cd recipes/gpt-oss-120b
24-
kubectl apply -n $NAMESPACE -f ./model-cache
35+
# Deploy
36+
kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
2537
```
2638

27-
### 2. Deploy and Benchmark the Model
39+
## Test the Deployment
2840

2941
```bash
30-
cd recipes/gpt-oss-120b
31-
kubectl apply -n $NAMESPACE -f ./trtllm/agg
32-
```
33-
34-
### Container Image
35-
This recipe was tested with dynamo trtllm runtime container for ARM64 processors.
36-
37-
**Important Note:**
38-
39-
Before dynamo v0.5.1 release, following container image is supported:
40-
```
41-
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3
42-
```
43-
44-
After dynamo v0.5.1 release, following container image will be supported:
45-
```
46-
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1
42+
# Port-forward the frontend
43+
kubectl port-forward svc/gpt-oss-agg-frontend 8000:8000 -n ${NAMESPACE}
44+
45+
# Send a test request
46+
curl http://localhost:8000/v1/chat/completions \
47+
-H "Content-Type: application/json" \
48+
-d '{
49+
"model": "openai/gpt-oss-120b",
50+
"messages": [{"role": "user", "content": "Hello!"}],
51+
"max_tokens": 50
52+
}'
4753
```
4854

4955
## Notes
50-
1. The benchmark container image uses a specific commit of aiperf to ensure reproducible results and compatibility with the benchmarking setup.
5156

52-
2. storage class is not specified in the recipe, you need to specify it in the `deploy.yaml` file.
57+
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
58+
- This recipe requires ARM64 (GB200) nodes — it will not run on x86 Hopper/Ampere hardware
59+
- Update the container image tag in `deploy.yaml` to match your Dynamo release version

0 commit comments

Comments
 (0)