ai-dynamo
diff --git a/‎recipes/README.md‎
Lines changed: 34 additions & 23 deletions b/‎recipes/README.md‎
Lines changed: 34 additions & 23 deletions
diff --git a/‎recipes/gpt-oss-120b/README.md‎
Lines changed: 40 additions & 33 deletions b/‎recipes/gpt-oss-120b/README.md‎
Lines changed: 40 additions & 33 deletions
@@ -1,3 +1,8 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
 # Dynamo Production-Ready Recipes
 
 Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
@@ -7,41 +12,51 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D
 
 ## Available Recipes
 
-### Multi-Feature Recipe
+### Feature Comparison Recipes
 
-This recipe combines multiple Dynamo performance features (disaggregated serving + KV-aware routing):
+These recipes compare Dynamo performance features with benchmark results, each including both baseline and optimized deployment configurations:
 
 | Model | Framework | Configuration | GPUs | Features |
 |-------|-----------|---------------|------|----------|
-| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — includes benchmark comparison with real-world Mooncake traces |
+| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with real-world Mooncake traces |
+| **[DeepSeek-V3.2-NVFP4](deepseek-v32-fp4/)** | TensorRT-LLM | Agg + Disagg WideEP | 32x GB200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with Mooncake-based synthetic coding trace |
+| **[Qwen3-VL-30B-A3B-FP8](qwen3-vl-30b/)** | vLLM | Agg + Embedding Cache | 1x GB200 | **Multimodal Embedding Cache** — benchmark comparison showing +16% throughput, -28% TTFT |
 
 ### Aggregated & Disaggregated Recipes
 
 These recipes demonstrate aggregated or disaggregated serving:
 
 **GAIE Column**: Indicates whether the recipe includes integration with the [Gateway API Inference Extension (GAIE)](../deploy/inference-gateway/README.md) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.
 
-| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | GAIE |
-|-------|-----------|------|------|------------|------------------|-------|------|
+| Model | Framework | Mode | GPUs | Deployment | Benchmark | Notes | GAIE |
+|-------|-----------|------|------|------------|-----------|-------|------|
 | **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ |
 | **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
 | **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
-| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x GPU | ✅ | ✅ | FP8 quantization | ❌ |
-| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | ❌ |
-| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x GPU | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
-| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x GPU | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
+| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x H100/H200/A100 | ✅ | ✅ | FP8 quantization | ❌ |
+| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 | ✅ | ✅ | Prefill + Decode separation | ❌ |
+| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
+| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
 | **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
-| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | ❌ |
-| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅*1 | ❌ | TP=8 per worker, single-node | ❌ |
-| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅*1 | ❌ | TP=16 per worker, multi-node | ❌ |
-| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
-| **[DeepSeek-R1](deepseek-r1/vllm/disagg/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
-
-*1: Please use `deepseek-r1/model-cache/model-download-sglang.yaml` to download the model into the PVC.
+| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | TP=8, single-node. Use `model-download-sglang.yaml` | ❌ |
+| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use `model-download-sglang.yaml` | ❌ |
+| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
+| **[DeepSeek-R1](deepseek-r1/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
 
 **Legend:**
-- **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
-- **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
+- **Deployment**: ✅ = Complete `deploy.yaml` manifest available
+- **Benchmark**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks
+
+### Functional Recipes (Not Yet Benchmarked)
+
+These recipes demonstrate functional deployments with Dynamo features, but have not yet been performance-tuned or paired with benchmark manifests.
+
+| Model | Framework | Mode | GPUs | Deployment | Notes |
+|-------|-----------|-------|------|------------|-------|
+| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing |
+| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/agg/)** | SGLang | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing, 1.0+ |
+| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
+| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
 
 ## Recipe Structure
 
@@ -113,9 +128,6 @@ cd recipes
 # Update storageClassName in model-cache.yaml first!
 kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}
 
-# Create model cache PVC
-kubectl apply -f <model>/model-cache/model-download.yaml -n ${NAMESPACE}
-
 # Wait for download to complete (may take 10-60 minutes depending on model size)
 kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
 
@@ -189,15 +201,14 @@ kubectl create secret generic hf-token-secret \
 # Deploy
 cd recipes
 kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
-kubectl apply -f llama-3-70b/model-cache/model-download.yaml -n ${NAMESPACE}
 kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
 kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
 
 # Test
 kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
 ```
 
-### Inference Gateway (GAIE) Integration (Optional)**
+### Inference Gateway (GAIE) Integration (Optional)
 
 For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.
 
 
@@ -1,52 +1,59 @@
-# GPT-OSS-120B Recipe Guide
+# GPT-OSS-120B Recipes
 
-This guide will help you run the GPT-OSS-120B language model using Dynamo's optimized setup.
+Production-ready deployment for **GPT-OSS-120B** using TensorRT-LLM on Blackwell (GB200) hardware.
+
+## Available Configurations
+
+| Configuration | GPUs | Mode | Description |
+|--------------|------|------|-------------|
+| [**trtllm/agg**](trtllm/agg/) | 4x GB200 | Aggregated | WideEP, ARM64 |
+
+> **Note:** A [disaggregated configuration](trtllm/disagg/) exists with engine configs but is not yet production-ready. See [trtllm/disagg/README.md](trtllm/disagg/README.md) for details.
 
 ## Prerequisites
 
-Follow the instructions in recipe [README.md](../README.md) to create a namespace and kubernetes secret for huggingface token.
+1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **GPU cluster** with GB200 (Blackwell) GPUs
+3. **HuggingFace token** with access to the model
 
 ## Quick Start
 
-To run the model, simply execute this command in your terminal:
-
 ```bash
-cd recipe
-./run.sh --model gpt-oss-120b --framework trtllm agg
-```
+# Set namespace
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
 
-## (Alternative) Step by Step Guide
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
 
-### 1. Download the Model
+# Download model (update storageClassName in model-cache/model-cache.yaml first!)
+kubectl apply -f model-cache/ -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
 
-```bash
-cd recipes/gpt-oss-120b
-kubectl apply -n $NAMESPACE -f ./model-cache
+# Deploy
+kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
 ```
 
-### 2. Deploy and Benchmark the Model
+## Test the Deployment
 
 ```bash
-cd recipes/gpt-oss-120b
-kubectl apply -n $NAMESPACE -f ./trtllm/agg
-```
-
-### Container Image
-This recipe was tested with dynamo trtllm runtime container for ARM64 processors.
-
-**Important Note:**
-
-Before dynamo v0.5.1 release, following container image is supported:
-```
-nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3
-```
-
-After dynamo v0.5.1 release, following container image will be supported:
-```
-nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1
+# Port-forward the frontend
+kubectl port-forward svc/gpt-oss-agg-frontend 8000:8000 -n ${NAMESPACE}
+
+# Send a test request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-oss-120b",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 50
+  }'
 ```
 
 ## Notes
-1. The benchmark container image uses a specific commit of aiperf to ensure reproducible results and compatibility with the benchmarking setup.
 
-2. storage class is not specified in the recipe, you need to specify it in the `deploy.yaml` file.
+- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
+- This recipe requires ARM64 (GB200) nodes — it will not run on x86 Hopper/Ampere hardware
+- Update the container image tag in `deploy.yaml` to match your Dynamo release version