From 8e9bcd6b0f7da008bedc9295cd009c8808ced079 Mon Sep 17 00:00:00 2001
From: Ben Hamm <ben.hamm@gmail.com>
Date: Wed, 5 Nov 2025 18:08:01 -0800
Subject: [PATCH 1/4] recipes: Clean up incomplete recipes and clarify
 Kubernetes-only focus

Remove incomplete model directories and non-Kubernetes configurations to
streamline the recipes directory for production Kubernetes deployments.

Changes:
- Remove 5 incomplete model directories (deepseek-r1-distill-llama-8b,
  gemma3, llama4, qwen2-vl-7b-instruct, qwen3) that lack proper
  Kubernetes deployment manifests
- Delete run.sh script (non-Kubernetes automation tool)
- Remove standalone engine config YAMLs from deepseek-r1/trtllm that
  were not wrapped in Kubernetes manifests
- Document incomplete gpt-oss-120b disagg recipe with README explaining
  missing components

README improvements:
- Restructure Available Recipes table with 'Deployment' and 'Benchmark
  Recipe' columns to clarify that perf.yaml files are tools for users
  to run benchmarks, not published performance results
- Add comprehensive quick start guide with prerequisites
- Link to correct Kubernetes deployment guides
- Add troubleshooting section
- Remove extraneous links (docs.nvidia.com, license section)

Result: 4 models with 10 complete deployment recipes (7 with benchmark
scripts), focused exclusively on Kubernetes deployments.

Signed-off-by: Ben Hamm <ben.hamm@gmail.com>
---
 recipes/README.md                             | 396 ++++++++----------
 .../trtllm/agg.yaml                           |  34 --
 .../trtllm/decode.yaml                        |  31 --
 .../trtllm/prefill.yaml                       |  30 --
 .../deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml   |  51 ---
 .../deepseek-r1/trtllm/agg/simple/agg.yaml    |  56 ---
 .../trtllm/agg/wide_ep/dep16_agg.yaml         |  29 --
 .../deepseek-r1/trtllm/agg/wide_ep/eplb.yaml  |   7 -
 .../trtllm/agg/wide_ep/wide_ep_agg.yaml       |  39 --
 .../trtllm/disagg/mtp/mtp_decode.yaml         |  57 ---
 .../trtllm/disagg/mtp/mtp_prefill.yaml        |  41 --
 .../trtllm/disagg/simple/decode.yaml          |  60 ---
 .../trtllm/disagg/simple/prefill.yaml         |  39 --
 .../trtllm/disagg/wide_ep/eplb.yaml           |   7 -
 .../trtllm/disagg/wide_ep/wide_ep_decode.yaml |  66 ---
 .../disagg/wide_ep/wide_ep_prefill.yaml       |  44 --
 recipes/gemma3/trtllm/vswa_agg.yaml           |  26 --
 recipes/gemma3/trtllm/vswa_decode.yaml        |  29 --
 recipes/gemma3/trtllm/vswa_prefill.yaml       |  30 --
 recipes/gpt-oss-120b/trtllm/disagg/README.md  |  25 ++
 recipes/llama4/trtllm/eagle/eagle_agg.yml     |  39 --
 recipes/llama4/trtllm/eagle/eagle_decode.yaml |  52 ---
 .../llama4/trtllm/eagle/eagle_prefill.yaml    |  37 --
 recipes/llama4/trtllm/multimodal/agg.yaml     |  33 --
 recipes/llama4/trtllm/multimodal/decode.yaml  |  29 --
 recipes/llama4/trtllm/multimodal/prefill.yaml |  31 --
 recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml  |  33 --
 .../qwen2-vl-7b-instruct/trtllm/decode.yaml   |  29 --
 .../qwen2-vl-7b-instruct/trtllm/encode.yaml   |  30 --
 .../qwen2-vl-7b-instruct/trtllm/prefill.yaml  |  31 --
 recipes/qwen3/trtllm/agg.yaml                 |  34 --
 recipes/qwen3/trtllm/decode.yaml              |  31 --
 recipes/qwen3/trtllm/prefill.yaml             |  30 --
 recipes/run.sh                                | 261 ------------
 34 files changed, 210 insertions(+), 1587 deletions(-)
 delete mode 100644 recipes/deepseek-r1-distill-llama-8b/trtllm/agg.yaml
 delete mode 100644 recipes/deepseek-r1-distill-llama-8b/trtllm/decode.yaml
 delete mode 100644 recipes/deepseek-r1-distill-llama-8b/trtllm/prefill.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/agg/simple/agg.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/agg/wide_ep/dep16_agg.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/disagg/mtp/mtp_decode.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/disagg/mtp/mtp_prefill.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/disagg/simple/decode.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/disagg/simple/prefill.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml
 delete mode 100644 recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml
 delete mode 100644 recipes/gemma3/trtllm/vswa_agg.yaml
 delete mode 100644 recipes/gemma3/trtllm/vswa_decode.yaml
 delete mode 100644 recipes/gemma3/trtllm/vswa_prefill.yaml
 create mode 100644 recipes/gpt-oss-120b/trtllm/disagg/README.md
 delete mode 100644 recipes/llama4/trtllm/eagle/eagle_agg.yml
 delete mode 100644 recipes/llama4/trtllm/eagle/eagle_decode.yaml
 delete mode 100644 recipes/llama4/trtllm/eagle/eagle_prefill.yaml
 delete mode 100644 recipes/llama4/trtllm/multimodal/agg.yaml
 delete mode 100644 recipes/llama4/trtllm/multimodal/decode.yaml
 delete mode 100644 recipes/llama4/trtllm/multimodal/prefill.yaml
 delete mode 100644 recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml
 delete mode 100644 recipes/qwen2-vl-7b-instruct/trtllm/decode.yaml
 delete mode 100644 recipes/qwen2-vl-7b-instruct/trtllm/encode.yaml
 delete mode 100644 recipes/qwen2-vl-7b-instruct/trtllm/prefill.yaml
 delete mode 100644 recipes/qwen3/trtllm/agg.yaml
 delete mode 100644 recipes/qwen3/trtllm/decode.yaml
 delete mode 100644 recipes/qwen3/trtllm/prefill.yaml
 delete mode 100755 recipes/run.sh

diff --git a/recipes/README.md b/recipes/README.md
index 236a38a71a..abcac27198 100644
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -1,297 +1,271 @@
-# Dynamo Model Serving Recipes
+# Dynamo Production-Ready Recipes
 
-This repository contains production-ready recipes for deploying large language models using the Dynamo platform. Each recipe includes deployment configurations, performance benchmarking, and model caching setup.
+Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
 
-## Contents
-- [Available Models](#available-models)
-- [Quick Start](#quick-start)
-- [Prerequisites](#prerequisites)
-- Deployment Methods
-   - [Option 1: Automated Deployment](#option-1-automated-deployment)
-   - [Option 2: Manual Deployment](#option-2-manual-deployment)
+> **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform.  
+> If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first.
 
+## 📊 Available Recipes
 
-## Available Models
-
-| Model Family    | Framework | Deployment Mode      | GPU Requirements | Status | Benchmark |GAIE-integration |
-|-----------------|-----------|---------------------|------------------|--------|-----------|------------------|
-| llama-3-70b     | vllm      | agg                 | 4x H100/H200     | ✅     | ✅        |✅                |
-| llama-3-70b     | vllm      | disagg (1 node)      | 8x H100/H200    | ✅     | ✅        | 🚧               |
-| llama-3-70b     | vllm      | disagg (multi-node)     | 16x H100/H200    | ✅     | ✅        |🚧               |
-| deepseek-r1     | sglang    | disagg (1 node, wide-ep)     | 8x H200          | ✅     | 🚧        |🚧               |
-| deepseek-r1     | sglang    | disagg (multi-node, wide-ep)     | 16x H200        | ✅     | 🚧        |🚧               |
-| gpt-oss-120b    | trtllm    | agg                 | 4x GB200         | ✅     | ✅        |🚧               |
+| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes |
+|-------|-----------|------|------|------------|------------------|-------|
+| **Llama-3-70B** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization |
+| **Llama-3-70B** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation |
+| **Llama-3-70B** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each |
+| **Qwen3-32B-FP8** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization |
+| **Qwen3-32B-FP8** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation |
+| **GPT-OSS-120B** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP |
+| **GPT-OSS-120B** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest |
+| **DeepSeek-R1** | SGLang | Disagg WideEP | 8x H200 | ✅ | ❌ | Benchmark recipe pending |
+| **DeepSeek-R1** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | Benchmark recipe pending |
+| **DeepSeek-R1** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes |
 
 **Legend:**
-- ✅ Functional
-- 🚧 Under development
+- **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
+- **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
+
+## 📁 Recipe Structure
 
+Each complete recipe follows this standard structure:
 
-**Recipe Directory Structure:**
-Recipes are organized into a directory structure that follows the pattern:
-```text
+```
 <model-name>/
+├── README.md (optional)           # Model-specific deployment notes
 ├── model-cache/
-│   ├── model-cache.yaml         # PVC for model cache
-│   └── model-download.yaml      # Job for model download
-├── <framework>/
-│   └── <deployment-mode>/
-│       ├── deploy.yaml          # DynamoGraphDeployment CRD and optional configmap for custom configuration
-│       └── perf.yaml (optional) # Performance benchmark
-└── README.md (optional)         # Model documentation
+│   ├── model-cache.yaml          # PersistentVolumeClaim for model storage
+│   └── model-download.yaml       # Job to download model from HuggingFace
+└── <framework>/                  # vllm, sglang, or trtllm
+    └── <deployment-mode>/        # agg, disagg, disagg-single-node, etc.
+        ├── deploy.yaml           # Complete DynamoGraphDeployment manifest
+        └── perf.yaml (optional)  # AIPerf benchmark job
 ```
 
-## Quick Start
-
-Follow the instructions in the [Prerequisites](#prerequisites) section to set up your environment.
-
-Choose your preferred deployment method: using the `run.sh` script or manual deployment steps.
+## 🚀 Quick Start
 
+### Prerequisites
 
-## Prerequisites
+**1. Dynamo Platform Installed**
 
-### 1. Environment Setup
+The recipes require the Dynamo Kubernetes Platform to be installed. Follow the installation guide:
 
-Create a Kubernetes namespace and set environment variable:
+- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Quickstart (~10 minutes)
+- **[Detailed Installation Guide](../docs/kubernetes/installation_guide.md)** - Advanced options
 
-```bash
-export NAMESPACE=your-namespace
-kubectl create namespace ${NAMESPACE}
-```
+**2. GPU Cluster Requirements**
 
-### 2. Deploy Dynamo Platform
-
-Install the Dynamo Cloud Platform following the [Quickstart Guide](../docs/kubernetes/README.md).
-
-### 3. GPU Cluster
-
-Ensure your Kubernetes cluster has:
-- GPU nodes with appropriate GPU types (see model requirements above)
+Ensure your cluster has:
+- GPU nodes matching recipe requirements (see table above)
 - GPU operator installed
-- Sufficient GPU memory and compute resources
-
-### 4. Container Registry Access
+- Appropriate GPU drivers and container runtime
 
-Ensure access to NVIDIA container registry for runtime images:
-- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z`
-- `nvcr.io/nvidia/ai-dynamo/trtllm-runtime:x.y.z`
-- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:x.y.z`
+**3. HuggingFace Access**
 
-### 5. HuggingFace Access and Kubernetes Secret Creation
-
-Set up a kubernetes secret with the HuggingFace token for model download:
+Configure authentication to download models:
 
 ```bash
-# Update the token in the secret file
-vim hf_hub_secret/hf_hub_secret.yaml
+export NAMESPACE=your-namespace
+kubectl create namespace ${NAMESPACE}
 
-# Apply the secret
-kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
+# Create HuggingFace token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
 ```
 
-6. Configure Storage Class
+**4. Storage Configuration**
+
+Update the `storageClassName` in `<model>/model-cache/model-cache.yaml` to match your cluster:
 
 ```bash
-# Check available storage classes
+# Find your storage class name
 kubectl get storageclass
-```
-
-Replace "your-storage-class-name" with your actual storage class in the file: `<model>/model-cache/model-cache.yaml`
 
-```yaml
-# In <model>/model-cache/model-cache.yaml
-spec:
-  storageClassName: "your-actual-storage-class"  # Replace this
+# Edit the model-cache.yaml file and update:
+# spec:
+#   storageClassName: "your-actual-storage-class"
 ```
 
-## Option 1: Automated Deployment
-
-Use the `run.sh` script for fully automated deployment:
-
-**Note:** The script automatically:
-- Create model cache PVC and downloads the model
-- Deploy the model service
-- Runs performance benchmark if a `perf.yaml` file is present in the deployment directory
+### Deploy a Recipe
 
-
-#### Script Usage
+**Step 1: Download Model**
 
 ```bash
-./run.sh [OPTIONS] --model <model> --framework <framework> --deployment <deployment-type>
-```
+# Update storageClassName in model-cache.yaml first!
+kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}
 
-**Required Options:**
-- `--model <model>`: Model name matching the directory name in the recipes directory (e.g., llama-3-70b, gpt-oss-120b, deepseek-r1)
-- `--framework <framework>`: Backend framework (`vllm`, `trtllm`, `sglang`)
-- `--deployment <deployment-type>`: Deployment mode (e.g., agg, disagg, disagg-single-node, disagg-multi-node)
+# Wait for download to complete (may take 10-60 minutes depending on model size)
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
 
-**Optional Options:**
-- `--namespace <namespace>`: Kubernetes namespace (default: dynamo)
-- `--dry-run`: Show commands without executing them
-- `-h, --help`: Show help message
+# Monitor progress
+kubectl logs -f job/model-download -n ${NAMESPACE}
+```
 
-**Environment Variables:**
-- `NAMESPACE`: Kubernetes namespace (default: dynamo)
+**Step 2: Deploy Service**
 
-#### Example Usage
 ```bash
-# Set up environment
-export NAMESPACE=your-namespace
-kubectl create namespace ${NAMESPACE}
-# Configure HuggingFace token
-kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
-
-# use run.sh script to deploy the model
-# Deploy Llama-3-70B with vLLM (aggregated mode)
-./run.sh --model llama-3-70b --framework vllm --deployment agg
+kubectl apply -f <model>/<framework>/<mode>/deploy.yaml -n ${NAMESPACE}
 
-# Deploy GPT-OSS-120B with TensorRT-LLM
-./run.sh --model gpt-oss-120b --framework trtllm --deployment agg
-
-# Deploy DeepSeek-R1 with SGLang (disaggregated mode)
-./run.sh --model deepseek-r1 --framework sglang --deployment disagg
+# Check deployment status
+kubectl get dynamographdeployment -n ${NAMESPACE}
 
-# Deploy with custom namespace
-./run.sh --namespace my-namespace --model llama-3-70b --framework vllm --deployment agg
+# Check pod status
+kubectl get pods -n ${NAMESPACE}
 
-# Dry run to see what would be executed
-./run.sh --dry-run --model llama-3-70b --framework vllm --deployment agg
+# Wait for pods to be ready
+kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=<deployment-name> -n ${NAMESPACE} --timeout=600s
 ```
 
-## If deploying with Gateway API Inference extension GAIE
+**Step 3: Test Deployment**
 
-1. Follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE.
+```bash
+# Port forward to access the service locally
+kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n ${NAMESPACE}
 
-2. Apply manifests by running a script.
+# In another terminal, test the endpoint
+curl http://localhost:8000/v1/models
 
-```bash
-# Match the block size to the cli value in your deployment file deploy.yaml: - "python3 -m dynamo.vllm ... --block-size 128"
-export DYNAMO_KV_BLOCK_SIZE=128
-export EPP_IMAGE=nvcr.io/you/epp:tag
-# Add --gaie argument to the script i.e.:
-./run.sh --model llama-3-70b --framework vllm --gaie agg
+# Send a test request
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "<model-name>",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 50
+  }'
 ```
-The script will perform gateway checks and apply the manifests.
 
-## Option 2: Manual Deployment
-
-For step-by-step manual deployment follow these steps :
+**Step 4: Run Benchmark (Optional)**
 
 ```bash
-# 0. Set up environment (see Prerequisites section)
-export NAMESPACE=your-namespace
-kubectl create namespace ${NAMESPACE}
-kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
+# Only if perf.yaml exists in the recipe directory
+kubectl apply -f <model>/<framework>/<mode>/perf.yaml -n ${NAMESPACE}
 
-# 1. Download model (see Model Download section)
-kubectl apply -n $NAMESPACE -f <model>/model-cache/
+# Monitor benchmark progress
+kubectl logs -f job/<benchmark-job-name> -n ${NAMESPACE}
 
-# 2. Deploy model (see Deployment section)
-kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/deploy.yaml
-
-# 3. Run benchmarks (optional, if perf.yaml exists)
-kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/perf.yaml
+# View results after completion
+kubectl logs job/<benchmark-job-name> -n ${NAMESPACE} | tail -50
 ```
 
-### Step 1: Download Model
+## 📖 Example Deployments
 
-```bash
-# Start the download job
-kubectl apply -n $NAMESPACE -f <model>/model-cache
-
-# Verify job creation
-kubectl get jobs -n $NAMESPACE | grep model-download
-```
-
-Monitor and wait for the model download to complete:
+### Llama-3-70B with vLLM (Aggregated)
 
 ```bash
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
 
-# Wait for job completion (timeout after 100 minutes)
-kubectl wait --for=condition=Complete job/model-download -n $NAMESPACE --timeout=6000s
+# Create HF token secret
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token" \
+  -n ${NAMESPACE}
 
-# Check job status
-kubectl get job model-download -n $NAMESPACE
+# Deploy
+kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
+kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}
 
-# View download logs
-kubectl logs job/model-download -n $NAMESPACE
+# Test
+kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
 ```
 
-### Step 2: Deploy Model Service
+### DeepSeek-R1 on GB200 (Multi-node)
 
-```bash
-# Navigate to the specific deployment configuration
-cd <model>/<framework>/<deployment-mode>/
-
-# Deploy the model service
-kubectl apply -n $NAMESPACE -f deploy.yaml
-
-# Verify deployment creation
-kubectl get deployments -n $NAMESPACE
-```
+See [deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml](deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml) for the complete multi-node WideEP configuration.
 
-#### Wait for Deployment Ready
+## 🛠️ Customization
 
-```bash
-# Get deployment name from the deploy.yaml file
-DEPLOYMENT_NAME=$(grep "name:" deploy.yaml | head -1 | awk '{print $2}')
-
-# Wait for deployment to be ready (timeout after 10 minutes)
-kubectl wait --for=condition=available deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=1200s
+Each `deploy.yaml` contains:
+- **ConfigMap**: Engine-specific configuration (embedded in the manifest)
+- **DynamoGraphDeployment**: Kubernetes resource definitions
+- **Resource limits**: GPU count, memory, CPU requests/limits
+- **Image references**: Container images with version tags
 
-# Check deployment status
-kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE
+### Key Customization Points
 
-# Check pod status
-kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT_NAME
+**Model Configuration:**
+```yaml
+# In deploy.yaml under worker args:
+args:
+  - python3 -m dynamo.vllm --model <your-model-path> --served-model-name <name>
 ```
 
-#### Verify Model Service
-
-```bash
-# Check if service is running
-kubectl get services -n $NAMESPACE
+**GPU Resources:**
+```yaml
+resources:
+  limits:
+    gpu: "4"  # Adjust based on your requirements
+  requests:
+    gpu: "4"
+```
 
-# Test model endpoint (port-forward to test locally)
-kubectl port-forward service/${DEPLOYMENT_NAME}-frontend 8000:8000 -n $NAMESPACE
+**Scaling:**
+```yaml
+services:
+  VllmDecodeWorker:
+    replicas: 2  # Scale to multiple workers
+```
 
-# Test the model API (in another terminal)
-curl http://localhost:8000/v1/models
+**Router Mode:**
+```yaml
+# In Frontend args:
+args:
+  - python3 -m dynamo.frontend --router-mode kv --http-port 8000
+# Options: round-robin, kv (KV-aware routing)
+```
 
-# Stop port-forward when done
-pkill -f "kubectl port-forward"
+**Container Images:**
+```yaml
+image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
+# Update version tag as needed
 ```
 
-### Step 3: Performance Benchmarking (Optional)
+## 🔧 Troubleshooting
 
-Run performance benchmarks to evaluate model performance. Note that benchmarking is only available for models that include a `perf.yaml` file (optional):
+### Common Issues
 
-#### Launch Benchmark Job
+**Pods stuck in Pending:**
+- Check GPU availability: `kubectl describe node <node-name>`
+- Verify storage class exists: `kubectl get storageclass`
+- Check resource requests vs. available resources
 
-```bash
-# From the deployment directory
-kubectl apply -n $NAMESPACE -f perf.yaml
+**Model download fails:**
+- Verify HuggingFace token is correct
+- Check network connectivity from cluster
+- Review job logs: `kubectl logs job/model-download -n ${NAMESPACE}`
 
-# Verify benchmark job creation
-kubectl get jobs -n $NAMESPACE
-```
+**Workers fail to start:**
+- Check GPU compatibility (driver version, CUDA version)
+- Verify image pull secrets if using private registries
+- Review pod logs: `kubectl logs <pod-name> -n ${NAMESPACE}`
 
-#### Monitor Benchmark Progress
+**For more troubleshooting:**
+- [Kubernetes Deployment Guide](../docs/kubernetes/README.md#troubleshooting)
+- [Observability Documentation](../docs/kubernetes/observability/)
 
-```bash
-# Get benchmark job name
-PERF_JOB_NAME=$(grep "name:" perf.yaml | head -1 | awk '{print $2}')
+## 📖 Related Documentation
 
-# Monitor benchmark logs in real-time
-kubectl logs -f job/$PERF_JOB_NAME -n $NAMESPACE
+- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Platform installation and concepts
+- **[API Reference](../docs/kubernetes/api_reference.md)** - DynamoGraphDeployment CRD specification
+- **[vLLM Backend Guide](../docs/backends/vllm/README.md)** - vLLM-specific features
+- **[SGLang Backend Guide](../docs/backends/sglang/README.md)** - SGLang-specific features
+- **[TensorRT-LLM Backend Guide](../docs/backends/trtllm/README.md)** - TensorRT-LLM features
+- **[Observability](../docs/kubernetes/observability/)** - Monitoring and logging
+- **[Benchmarking Guide](../docs/benchmarks/benchmarking.md)** - Performance testing
 
-# Wait for benchmark completion (timeout after 100 minutes)
-kubectl wait --for=condition=Complete job/$PERF_JOB_NAME -n $NAMESPACE --timeout=6000s
-```
+## 🤝 Contributing
 
-#### View Benchmark Results
+We welcome contributions of new recipes! See [CONTRIBUTING.md](CONTRIBUTING.md) for:
+- Recipe submission guidelines
+- Required components checklist
+- Testing and validation requirements
+- Documentation standards
 
-```bash
-# Check final benchmark results
-kubectl logs job/$PERF_JOB_NAME -n $NAMESPACE | tail -50
-```
\ No newline at end of file
+### Recipe Quality Standards
+
+A production-ready recipe must include:
+- ✅ Complete `deploy.yaml` with DynamoGraphDeployment
+- ✅ Model cache PVC and download job
+- ✅ Benchmark recipe (`perf.yaml`) for performance testing
+- ✅ Verification on target hardware
+- ✅ Documentation of GPU requirements
diff --git a/recipes/deepseek-r1-distill-llama-8b/trtllm/agg.yaml b/recipes/deepseek-r1-distill-llama-8b/trtllm/agg.yaml
deleted file mode 100644
index 53e0e6ce38..0000000000
--- a/recipes/deepseek-r1-distill-llama-8b/trtllm/agg.yaml
+++ /dev/null
@@ -1,34 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-
-
-cuda_graph_config:
-  max_batch_size: 16
\ No newline at end of file
diff --git a/recipes/deepseek-r1-distill-llama-8b/trtllm/decode.yaml b/recipes/deepseek-r1-distill-llama-8b/trtllm/decode.yaml
deleted file mode 100644
index a0154bb6e3..0000000000
--- a/recipes/deepseek-r1-distill-llama-8b/trtllm/decode.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-disable_overlap_scheduler: false
-
-cuda_graph_config:
-  max_batch_size: 16
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/deepseek-r1-distill-llama-8b/trtllm/prefill.yaml b/recipes/deepseek-r1-distill-llama-8b/trtllm/prefill.yaml
deleted file mode 100644
index 4996c1fdc6..0000000000
--- a/recipes/deepseek-r1-distill-llama-8b/trtllm/prefill.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-# Overlap scheduler not currently supported in prefill only workers.
-disable_overlap_scheduler: true
-cuda_graph_config:
-  max_batch_size: 16
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-
-cache_transceiver_config:
-  backend: DEFAULT
\ No newline at end of file
diff --git a/recipes/deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml b/recipes/deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml
deleted file mode 100644
index 25fae60abf..0000000000
--- a/recipes/deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml
+++ /dev/null
@@ -1,51 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-enable_attention_dp: true
-max_batch_size: 256
-# 8448 = 8192 ISL + 256 OSL
-max_num_tokens: 8448
-max_seq_len: 8448
-kv_cache_config:
-  free_gpu_memory_fraction: 0.30
-  dtype: fp8
-
-# Enable the MTP(Multi-Token Prediction) in the model engine
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-
-print_iter_log: true
diff --git a/recipes/deepseek-r1/trtllm/agg/simple/agg.yaml b/recipes/deepseek-r1/trtllm/agg/simple/agg.yaml
deleted file mode 100644
index db2377a92a..0000000000
--- a/recipes/deepseek-r1/trtllm/agg/simple/agg.yaml
+++ /dev/null
@@ -1,56 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# TP/EP/PP/DP
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-pipeline_parallel_size: 1
-enable_attention_dp: false
-
-max_batch_size: 256
-# 8448 = 8192 ISL + 256 OSL
-max_num_tokens: 8448
-max_seq_len: 8448
-
-kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
-  free_gpu_memory_fraction: 0.85
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  # free_gpu_memory_fraction: 0.30
-  dtype: fp8
-
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-cuda_graph_config:
-  enable_padding: true
-# NOTE: For larger max batch size, you may want to add larger cuda graph
-# batch sizes below to match.
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-
-print_iter_log: true
diff --git a/recipes/deepseek-r1/trtllm/agg/wide_ep/dep16_agg.yaml b/recipes/deepseek-r1/trtllm/agg/wide_ep/dep16_agg.yaml
deleted file mode 100644
index 844c4ffa72..0000000000
--- a/recipes/deepseek-r1/trtllm/agg/wide_ep/dep16_agg.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Example of a Multi-node worker, but no WideEP or EPLB.
-# See wide_ep*.yaml for WideEP example configs.
-backend: pytorch
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-enable_attention_dp: true
-max_batch_size: 256
-max_num_tokens: 256
-max_seq_len: 8448
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.7
-  dtype: fp8
-
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
diff --git a/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml b/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml
deleted file mode 100644
index f2fe0a13c6..0000000000
--- a/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml
+++ /dev/null
@@ -1,7 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-# moe_load_balancer settings for TRTLLM based on:
-# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer
-num_slots: 288
-layer_updates_per_iter: 2
diff --git a/recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml b/recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml
deleted file mode 100644
index bcd6ae87e0..0000000000
--- a/recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml
+++ /dev/null
@@ -1,39 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-backend: pytorch
-
-# WideEP related settings
-moe_config:
-  backend: WIDEEP
-  # moe_max_num_tokens will default to max_num_tokens if left unspecified.
-  #
-  # If you want to set this value explicitly, one recommendation is below:
-  #   moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
-  #   4096 = 256 * 16
-  # moe_max_num_tokens: 4096
-  load_balancer: /mnt/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml
-
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-
-enable_attention_dp: true
-max_batch_size: 256
-max_num_tokens: 256
-max_seq_len: 8448
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.3
-  dtype: fp8
-
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
\ No newline at end of file
diff --git a/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_decode.yaml b/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_decode.yaml
deleted file mode 100644
index 8f0bd83919..0000000000
--- a/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_decode.yaml
+++ /dev/null
@@ -1,57 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-enable_attention_dp: false
-max_batch_size: 256
-# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula:
-# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1)
-# This is a known issue in TensorRT-LLM and will be resolved in the next release.
-max_num_tokens: 512
-# 8704 = 8192 ISL + 512 OSL
-max_seq_len: 8704
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-  dtype: fp8
-
-# Enable the MTP(Multi-Token Prediction) in decode model engine
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-
-print_iter_log: true
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_prefill.yaml b/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_prefill.yaml
deleted file mode 100644
index 46494e8d68..0000000000
--- a/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_prefill.yaml
+++ /dev/null
@@ -1,41 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-enable_attention_dp: true
-max_batch_size: 1
-max_num_tokens: 8192
-max_seq_len: 8192
-kv_cache_config:
-  free_gpu_memory_fraction: 0.75
-  dtype: fp8
-
-print_iter_log: true
-disable_overlap_scheduler: true
-
-# Enable the MTP(Multi-Token Prediction) in the prefill model engine
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/deepseek-r1/trtllm/disagg/simple/decode.yaml b/recipes/deepseek-r1/trtllm/disagg/simple/decode.yaml
deleted file mode 100644
index 28f246574b..0000000000
--- a/recipes/deepseek-r1/trtllm/disagg/simple/decode.yaml
+++ /dev/null
@@ -1,60 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# TP/EP/PP/DP
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-pipeline_parallel_size: 1
-enable_attention_dp: false
-
-max_batch_size: 256
-max_num_tokens: 256
-# 8448 = 8192 ISL + 256 OSL
-max_seq_len: 8448
-
-kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
-  free_gpu_memory_fraction: 0.85
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  # free_gpu_memory_fraction: 0.30
-  dtype: fp8
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: false
-
-cuda_graph_config:
-  enable_padding: true
-  # NOTE: For larger max batch size, you may want to
-  # add larger cuda graph batch sizes below to match.
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-
-print_iter_log: true
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/deepseek-r1/trtllm/disagg/simple/prefill.yaml b/recipes/deepseek-r1/trtllm/disagg/simple/prefill.yaml
deleted file mode 100644
index 13b2410a67..0000000000
--- a/recipes/deepseek-r1/trtllm/disagg/simple/prefill.yaml
+++ /dev/null
@@ -1,39 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# TP/EP/PP/DP
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-pipeline_parallel_size: 1
-enable_attention_dp: true
-
-max_batch_size: 1
-max_num_tokens: 8192
-max_seq_len: 8192
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.75
-  dtype: fp8 # NOTE: This dtype must match in both prefill/decode configs
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: true
-print_iter_log: true
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml b/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml
deleted file mode 100644
index f2fe0a13c6..0000000000
--- a/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml
+++ /dev/null
@@ -1,7 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-# moe_load_balancer settings for TRTLLM based on:
-# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer
-num_slots: 288
-layer_updates_per_iter: 2
diff --git a/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml b/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml
deleted file mode 100644
index 39d392afe9..0000000000
--- a/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml
+++ /dev/null
@@ -1,66 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# WideEP related settings
-moe_config:
-  backend: WIDEEP
-  load_balancer: /mnt/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml
-
-# TP/EP/PP/DP
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-pipeline_parallel_size: 1
-enable_attention_dp: true
-
-max_batch_size: 256
-max_num_tokens: 256
-# 8448 = 8192 ISL + 256 OSL
-max_seq_len: 8448
-
-kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
-  # free_gpu_memory_fraction: 0.85
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  free_gpu_memory_fraction: 0.30
-  dtype: fp8
-
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: false
-cuda_graph_config:
-  enable_padding: true
-  # NOTE: For larger max batch size, you may want to
-  # add larger cuda graph batch sizes below to match.
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-
-
-print_iter_log: true
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml b/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml
deleted file mode 100644
index 56e862a855..0000000000
--- a/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml
+++ /dev/null
@@ -1,44 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# WideEP related settings
-moe_config:
-  backend: WIDEEP
-  load_balancer: /mnt/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml
-
-# TP/EP/PP/DP
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-pipeline_parallel_size: 1
-enable_attention_dp: true
-
-max_batch_size: 1
-max_num_tokens: 8192
-max_seq_len: 8192
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.3
-  dtype: fp8 # NOTE: This dtype must match in both prefill/decode configs
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: true
-print_iter_log: true
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/gemma3/trtllm/vswa_agg.yaml b/recipes/gemma3/trtllm/vswa_agg.yaml
deleted file mode 100644
index 6cd7d1dc7f..0000000000
--- a/recipes/gemma3/trtllm/vswa_agg.yaml
+++ /dev/null
@@ -1,26 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-tensor_parallel_size: 1
-backend: pytorch
-
-kv_cache_config:
-  max_attention_window:
-    - 512
-    - 512
-    - 512
-    - 512
-    - 512
-    - 32768
diff --git a/recipes/gemma3/trtllm/vswa_decode.yaml b/recipes/gemma3/trtllm/vswa_decode.yaml
deleted file mode 100644
index c3ea683857..0000000000
--- a/recipes/gemma3/trtllm/vswa_decode.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-tensor_parallel_size: 1
-backend: pytorch
-
-kv_cache_config:
-  max_attention_window:
-    - 512
-    - 512
-    - 512
-    - 512
-    - 512
-    - 32768
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/gemma3/trtllm/vswa_prefill.yaml b/recipes/gemma3/trtllm/vswa_prefill.yaml
deleted file mode 100644
index 663d241b58..0000000000
--- a/recipes/gemma3/trtllm/vswa_prefill.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-tensor_parallel_size: 1
-backend: pytorch
-disable_overlap_scheduler: true
-
-kv_cache_config:
-  max_attention_window:
-    - 512
-    - 512
-    - 512
-    - 512
-    - 512
-    - 32768
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/gpt-oss-120b/trtllm/disagg/README.md b/recipes/gpt-oss-120b/trtllm/disagg/README.md
new file mode 100644
index 0000000000..10390c9587
--- /dev/null
+++ b/recipes/gpt-oss-120b/trtllm/disagg/README.md
@@ -0,0 +1,25 @@
+# GPT-OSS-120B Disaggregated Mode
+
+> **⚠️ INCOMPLETE**: This directory contains only engine configuration files and is not ready for Kubernetes deployment.
+
+## Current Status
+
+This directory contains TensorRT-LLM engine configurations for disaggregated serving:
+- `decode.yaml` - Decode worker engine configuration
+- `prefill.yaml` - Prefill worker engine configuration
+
+## Missing Components
+
+To complete this recipe, the following files are needed:
+- `deploy.yaml` - Kubernetes DynamoGraphDeployment manifest
+- `perf.yaml` - Performance benchmarking job (optional)
+
+## Alternative
+
+For a production-ready GPT-OSS-120B deployment, use the **aggregated mode**:
+- [gpt-oss-120b/trtllm/agg/](../agg/) - Complete with `deploy.yaml` and `perf.yaml`
+
+## Contributing
+
+If you'd like to complete this recipe, see [recipes/CONTRIBUTING.md](../../../CONTRIBUTING.md) for guidelines on creating proper Kubernetes deployment manifests.
+
diff --git a/recipes/llama4/trtllm/eagle/eagle_agg.yml b/recipes/llama4/trtllm/eagle/eagle_agg.yml
deleted file mode 100644
index f4144e42ce..0000000000
--- a/recipes/llama4/trtllm/eagle/eagle_agg.yml
+++ /dev/null
@@ -1,39 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-max_batch_size: 192
-max_num_tokens: 3072
-disable_overlap_scheduler: false
-
-# Enable Speculative Decoding in the model engine
-speculative_config:
-  decoding_type: Eagle
-  max_draft_len: 3
-  speculative_model_dir: nvidia/Llama-4-Maverick-17B-128E-Eagle3
-  eagle3_one_model: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.2
-  enable_block_reuse: false
-
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes: [1,2,3,4,5,6,7,8,16,32,48,64,128,190,191,192]
-
-print_iter_log: true
-
diff --git a/recipes/llama4/trtllm/eagle/eagle_decode.yaml b/recipes/llama4/trtllm/eagle/eagle_decode.yaml
deleted file mode 100644
index 019cac5ac6..0000000000
--- a/recipes/llama4/trtllm/eagle/eagle_decode.yaml
+++ /dev/null
@@ -1,52 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-max_batch_size: 256
-max_num_tokens: 1024
-# 8704 = 8192 ISL + 512 OSL
-max_seq_len: 8704
-disable_overlap_scheduler: true
-
-# Enable Speculative Decoding in the model engine
-speculative_config:
-  decoding_type: Eagle
-  max_draft_len: 3
-  speculative_model_dir: nvidia/Llama-4-Maverick-17B-128E-Eagle3
-  eagle3_one_model: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.5
-  enable_block_reuse: false
-
-cuda_graph_config:
-  enable_padding: true
-  batch_sizes:
-  - 1
-  - 2
-  - 4
-  - 8
-  - 16
-  - 32
-  - 64
-  - 128
-  - 256
-
-print_iter_log: true
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/llama4/trtllm/eagle/eagle_prefill.yaml b/recipes/llama4/trtllm/eagle/eagle_prefill.yaml
deleted file mode 100644
index 5b978deece..0000000000
--- a/recipes/llama4/trtllm/eagle/eagle_prefill.yaml
+++ /dev/null
@@ -1,37 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-max_batch_size: 1
-max_num_tokens: 8192
-max_seq_len: 8192
-print_iter_log: true
-disable_overlap_scheduler: true
-
-# Enable Speculative Decoding in the model engine
-speculative_config:
-  decoding_type: Eagle
-  max_draft_len: 3
-  speculative_model_dir: nvidia/Llama-4-Maverick-17B-128E-Eagle3
-  eagle3_one_model: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.5
-  enable_block_reuse: false
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/llama4/trtllm/multimodal/agg.yaml b/recipes/llama4/trtllm/multimodal/agg.yaml
deleted file mode 100644
index 754f8ce759..0000000000
--- a/recipes/llama4/trtllm/multimodal/agg.yaml
+++ /dev/null
@@ -1,33 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 8
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 4096
-max_batch_size: 8
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.3
-  enable_block_reuse: false
-
-cache_transceiver_config:
-  backend: DEFAULT
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
diff --git a/recipes/llama4/trtllm/multimodal/decode.yaml b/recipes/llama4/trtllm/multimodal/decode.yaml
deleted file mode 100644
index 262a2be1cc..0000000000
--- a/recipes/llama4/trtllm/multimodal/decode.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 8
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-disable_overlap_scheduler: false
-kv_cache_config:
-  free_gpu_memory_fraction: 0.30
-  enable_block_reuse: false
-
-cache_transceiver_config:
-  backend: DEFAULT
\ No newline at end of file
diff --git a/recipes/llama4/trtllm/multimodal/prefill.yaml b/recipes/llama4/trtllm/multimodal/prefill.yaml
deleted file mode 100644
index 3d2c144015..0000000000
--- a/recipes/llama4/trtllm/multimodal/prefill.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 8
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-# Overlap scheduler not currently supported in prefill only workers.
-disable_overlap_scheduler: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.30
-  enable_block_reuse: false
-
-cache_transceiver_config:
-  backend: DEFAULT
\ No newline at end of file
diff --git a/recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml b/recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml
deleted file mode 100644
index 754f8ce759..0000000000
--- a/recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml
+++ /dev/null
@@ -1,33 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 8
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 4096
-max_batch_size: 8
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.3
-  enable_block_reuse: false
-
-cache_transceiver_config:
-  backend: DEFAULT
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
diff --git a/recipes/qwen2-vl-7b-instruct/trtllm/decode.yaml b/recipes/qwen2-vl-7b-instruct/trtllm/decode.yaml
deleted file mode 100644
index 6dbd676ee4..0000000000
--- a/recipes/qwen2-vl-7b-instruct/trtllm/decode.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-disable_overlap_scheduler: false
-kv_cache_config:
-  free_gpu_memory_fraction: 0.30
-  enable_block_reuse: false
-
-cache_transceiver_config:
-  backend: DEFAULT
\ No newline at end of file
diff --git a/recipes/qwen2-vl-7b-instruct/trtllm/encode.yaml b/recipes/qwen2-vl-7b-instruct/trtllm/encode.yaml
deleted file mode 100644
index 6f0c20990f..0000000000
--- a/recipes/qwen2-vl-7b-instruct/trtllm/encode.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-trust_remote_code: true
-backend: pytorch
-disable_overlap_scheduler: false
-
-cuda_graph_config:
-  max_batch_size: 16
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/qwen2-vl-7b-instruct/trtllm/prefill.yaml b/recipes/qwen2-vl-7b-instruct/trtllm/prefill.yaml
deleted file mode 100644
index 83a65e8bf3..0000000000
--- a/recipes/qwen2-vl-7b-instruct/trtllm/prefill.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-# Overlap scheduler not currently supported in prefill only workers.
-disable_overlap_scheduler: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.30
-  enable_block_reuse: false
-
-cache_transceiver_config:
-  backend: DEFAULT
\ No newline at end of file
diff --git a/recipes/qwen3/trtllm/agg.yaml b/recipes/qwen3/trtllm/agg.yaml
deleted file mode 100644
index 53e0e6ce38..0000000000
--- a/recipes/qwen3/trtllm/agg.yaml
+++ /dev/null
@@ -1,34 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-
-
-cuda_graph_config:
-  max_batch_size: 16
\ No newline at end of file
diff --git a/recipes/qwen3/trtllm/decode.yaml b/recipes/qwen3/trtllm/decode.yaml
deleted file mode 100644
index a0154bb6e3..0000000000
--- a/recipes/qwen3/trtllm/decode.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-disable_overlap_scheduler: false
-
-cuda_graph_config:
-  max_batch_size: 16
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-
-cache_transceiver_config:
-  backend: DEFAULT
diff --git a/recipes/qwen3/trtllm/prefill.yaml b/recipes/qwen3/trtllm/prefill.yaml
deleted file mode 100644
index 4996c1fdc6..0000000000
--- a/recipes/qwen3/trtllm/prefill.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-# Overlap scheduler not currently supported in prefill only workers.
-disable_overlap_scheduler: true
-cuda_graph_config:
-  max_batch_size: 16
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-
-cache_transceiver_config:
-  backend: DEFAULT
\ No newline at end of file
diff --git a/recipes/run.sh b/recipes/run.sh
deleted file mode 100755
index 980c9333b6..0000000000
--- a/recipes/run.sh
+++ /dev/null
@@ -1,261 +0,0 @@
-#!/usr/bin/env bash
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-set -euo pipefail
-IFS=$'\n\t'
-
-RECIPES_DIR="$( cd "$( dirname "$0" )" && pwd )"
-# Default values
-NAMESPACE="${NAMESPACE:-dynamo}"
-DEPLOY_TYPE=""
-GAIE="${GAIE:-false}"
-DEPLOYMENT=""
-MODEL=""
-FRAMEWORK=""
-DRY_RUN=""
-
-# Frameworks - following container/build.sh pattern
-declare -A FRAMEWORKS=(["VLLM"]=1 ["TRTLLM"]=2 ["SGLANG"]=3)
-DEFAULT_FRAMEWORK=VLLM
-
-# Function to show usage
-usage() {
-    echo "Usage: $0 [OPTIONS] --model <model> --framework <framework> --deployment <deployment-type>"
-    echo ""
-    echo "Required Options:"
-    echo "  --model <model>       Model name (e.g., llama-3-70b)"
-    echo "  --framework <fw>      Framework one of ${!FRAMEWORKS[*]} (default: ${DEFAULT_FRAMEWORK})"
-    echo "  --deployment <type>   Deployment type (e.g., agg, disagg etc, please refer to the README.md for available deployment types)"
-    echo ""
-    echo "Optional:"
-    echo "  --namespace <ns>   Kubernetes namespace (default: dynamo)"
-    echo "  --dry-run          Print commands without executing them"
-    echo "  --gaie[=true|false] Enable GAIE integration subfolder (applies GAIE manifests skips benchmark) (default: ${GAIE})"
-    echo "  -h, --help         Show this help message"
-    echo ""
-    echo "Environment Variables:"
-    echo "  NAMESPACE             Kubernetes namespace (default: dynamo)"
-    echo ""
-    echo "Examples:"
-    echo "  $0 --model llama-3-70b --framework vllm --deployment agg"
-    echo "  $0 --model llama-3-70b --framework trtllm --deployment disagg-single-node"
-    echo "  $0 --namespace my-ns --model llama-3-70b --framework vllm --deployment disagg-multi-node"
-    exit 1
-}
-
-missing_requirement() {
-    echo "ERROR: $1 requires an argument."
-    usage
-}
-
-error() {
-    printf '%s %s\n' "$1" "$2" >&2
-    exit 1
-}
-
-while [[ $# -gt 0 ]]; do
-    case $1 in
-        --dry-run)
-            DRY_RUN="echo"
-            shift
-            ;;
-        --model)
-            if [ "$2" ]; then
-                MODEL=$2
-                shift 2
-            else
-                missing_requirement "$1"
-            fi
-            ;;
-        --framework)
-            if [ "$2" ]; then
-                FRAMEWORK=$2
-                shift 2
-            else
-                missing_requirement "$1"
-            fi
-            ;;
-        --deployment)
-            if [ "$2" ]; then
-                DEPLOYMENT=$2
-                shift 2
-            else
-                missing_requirement "$1"
-            fi
-            ;;
-        --namespace)
-            if [ "$2" ]; then
-                NAMESPACE=$2
-                shift 2
-            else
-                missing_requirement "$1"
-            fi
-            ;;
-        --gaie)
-            GAIE=true
-            shift
-            ;;
-        --gaie=false)
-            GAIE=false
-            shift
-            ;;
-        --gaie=*)
-            GAIE="${1#*=}"
-            case "${GAIE,,}" in
-              true|false) GAIE="${GAIE,,}";;
-              *) echo "ERROR: --gaie must be true or false"; exit 1;;
-            esac
-            shift
-            ;;
-        -h|--help)
-            usage
-            ;;
-        -*)
-            error 'ERROR: Unknown option: ' "$1"
-            ;;
-        *)
-            error "ERROR: Unknown argument: " "$1"
-            ;;
-    esac
-done
-
-if [ -z "$FRAMEWORK" ]; then
-    FRAMEWORK=$DEFAULT_FRAMEWORK
-fi
-
-if [ -n "$FRAMEWORK" ]; then
-    FRAMEWORK=${FRAMEWORK^^}
-    if [[ -z "${FRAMEWORKS[$FRAMEWORK]}" ]]; then
-        error 'ERROR: Unknown framework: ' "$FRAMEWORK"
-    fi
-fi
-
-# Validate required arguments
-if [[ -z "$MODEL" ]] || [[ -z "$DEPLOYMENT" ]]; then
-    if [[ -z "$MODEL" ]]; then
-        echo "ERROR: --model argument is required"
-    fi
-    if [[ -z "$DEPLOYMENT" ]]; then
-        echo "ERROR: --deployment argument is required"
-    fi
-    echo ""
-    usage
-fi
-
-# Construct paths based on new structure: recipes/<model>/<framework>/<deployment-type>/
-MODEL_DIR="$RECIPES_DIR/$MODEL"
-FRAMEWORK_DIR="$MODEL_DIR/${FRAMEWORK,,}"
-DEPLOY_PATH="$FRAMEWORK_DIR/$DEPLOYMENT"
-INTEGRATION="$([[ "${GAIE,,}" == "true" ]] && echo gaie || echo "")"
-
-# Check if model directory exists
-if [[ ! -d "$MODEL_DIR" ]]; then
-    echo "Error: Model directory '$MODEL' does not exist in $RECIPES_DIR"
-    echo "Available models:"
-    ls -1 "$RECIPES_DIR" | grep -v "\.sh$\|\.md$\|model-cache$" | sed 's/^/  /'
-    exit 1
-fi
-
-# Check if framework directory exists
-if [[ ! -d "$FRAMEWORK_DIR" ]]; then
-    echo "Error: Framework directory '${FRAMEWORK,,}' does not exist in $MODEL_DIR"
-    echo "Available frameworks for $MODEL:"
-    ls -1 "$MODEL_DIR" | grep -v "\.sh$\|\.md$" | sed 's/^/  /'
-    exit 1
-fi
-
-# Check if deployment directory exists
-if [[ ! -d "$DEPLOY_PATH" ]]; then
-    echo "Error: Deployment type '$DEPLOYMENT' does not exist in $FRAMEWORK_DIR"
-    echo "Available deployment types for $MODEL/${FRAMEWORK,,}:"
-    ls -1 "$FRAMEWORK_DIR" | grep -v "\.sh$\|\.md$" | sed 's/^/  /'
-    exit 1
-fi
-
-# Check if deployment files exist
-DEPLOY_FILE="$DEPLOY_PATH/deploy.yaml"
-PERF_FILE="$DEPLOY_PATH/perf.yaml"
-
-if [[ ! -f "$DEPLOY_FILE" ]]; then
-    echo "Error: Deployment file '$DEPLOY_FILE' not found"
-    exit 1
-fi
-
-# Check if perf file exists (optional)
-PERF_AVAILABLE=false
-if [[ -f "$PERF_FILE" ]]; then
-    PERF_AVAILABLE=true
-    echo "Performance benchmark file found: $PERF_FILE"
-else
-    echo "Performance benchmark file not found: $PERF_FILE (skipping benchmarks)"
-fi
-
-# Show deployment information
-echo "======================================"
-echo "Dynamo Recipe Deployment"
-echo "======================================"
-echo "Model: $MODEL"
-echo "Framework: ${FRAMEWORK,,}"
-echo "Deployment Type: $DEPLOYMENT"
-echo "Namespace: $NAMESPACE"
-echo "GAIE integration: $GAIE"
-echo "======================================"
-
-# Handle model downloading
-MODEL_CACHE_DIR="$MODEL_DIR/model-cache"
-echo "Creating PVC for model cache and downloading model..."
-$DRY_RUN kubectl apply -n $NAMESPACE -f $MODEL_CACHE_DIR/model-cache.yaml
-$DRY_RUN kubectl apply -n $NAMESPACE -f $MODEL_CACHE_DIR/model-download.yaml
-
-# Wait for the model download to complete
-MODEL_DOWNLOAD_JOB_NAME=$(grep "name:" $MODEL_CACHE_DIR/model-download.yaml | head -1 | awk '{print $2}')
-echo "Waiting for job '$MODEL_DOWNLOAD_JOB_NAME' to complete..."
-$DRY_RUN kubectl wait --for=condition=Complete job/$MODEL_DOWNLOAD_JOB_NAME -n $NAMESPACE --timeout=6000s
-
-# Deploy the specified configuration
-echo "Deploying $MODEL ${FRAMEWORK,,} $DEPLOYMENT configuration..."
-$DRY_RUN kubectl apply -n $NAMESPACE -f $DEPLOY_FILE
-
-if [[ "$INTEGRATION" == "gaie" ]]; then
-    # run gaie checks.
-    SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-    "${SCRIPT_DIR}/gaie_checks.sh"
-    kubectl apply -f "$DEPLOY_PATH/gaie/k8s-manifests" -n "$NAMESPACE"
-    # For now do not run the benchmark
-    exit
- fi
-
-# Launch the benchmark job (if available)
-if [[ "$PERF_AVAILABLE" == "true" ]]; then
-    echo "Launching benchmark job..."
-    $DRY_RUN kubectl apply -n $NAMESPACE -f $PERF_FILE
-
-    # Construct job name from the perf file
-    JOB_NAME=$(grep "name:" $PERF_FILE | head -1 | awk '{print $2}')
-    echo "Waiting for job '$JOB_NAME' to complete..."
-    $DRY_RUN kubectl wait --for=condition=Complete job/$JOB_NAME -n $NAMESPACE --timeout=6000s
-
-    # Print logs from the benchmark job
-    echo "======================================"
-    echo "Benchmark completed. Logs:"
-    echo "======================================"
-    $DRY_RUN kubectl logs job/$JOB_NAME -n $NAMESPACE
-else
-    echo "======================================"
-    echo "Deployment completed successfully!"
-    echo "No performance benchmark available for this configuration."
-    echo "======================================"
-fi
\ No newline at end of file

From 4947aaf89f9ea3a834d68d51f4447ccfbbf66480 Mon Sep 17 00:00:00 2001
From: Ben Hamm <ben.hamm@gmail.com>
Date: Thu, 6 Nov 2025 09:11:44 -0800
Subject: [PATCH 2/4] Remove emojis from README section headings
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address feedback to make the README less AI-generated looking by removing
decorative emojis from section headings while keeping status indicators
(✅ ❌) in tables and content.

Signed-off-by: Ben Hamm <ben.hamm@gmail.com>
---
 recipes/README.md | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/recipes/README.md b/recipes/README.md
index abcac27198..6b751aba32 100644
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -5,7 +5,7 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D
 > **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform.  
 > If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first.
 
-## 📊 Available Recipes
+## Available Recipes
 
 | Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes |
 |-------|-----------|------|------|------------|------------------|-------|
@@ -24,7 +24,7 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D
 - **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
 - **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided
 
-## 📁 Recipe Structure
+## Recipe Structure
 
 Each complete recipe follows this standard structure:
 
@@ -40,7 +40,7 @@ Each complete recipe follows this standard structure:
         └── perf.yaml (optional)  # AIPerf benchmark job
 ```
 
-## 🚀 Quick Start
+## Quick Start
 
 ### Prerequisites
 
@@ -147,7 +147,7 @@ kubectl logs -f job/<benchmark-job-name> -n ${NAMESPACE}
 kubectl logs job/<benchmark-job-name> -n ${NAMESPACE} | tail -50
 ```
 
-## 📖 Example Deployments
+## Example Deployments
 
 ### Llama-3-70B with vLLM (Aggregated)
 
@@ -173,7 +173,7 @@ kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
 
 See [deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml](deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml) for the complete multi-node WideEP configuration.
 
-## 🛠️ Customization
+## Customization
 
 Each `deploy.yaml` contains:
 - **ConfigMap**: Engine-specific configuration (embedded in the manifest)
@@ -220,7 +220,7 @@ image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
 # Update version tag as needed
 ```
 
-## 🔧 Troubleshooting
+## Troubleshooting
 
 ### Common Issues
 
@@ -243,7 +243,7 @@ image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
 - [Kubernetes Deployment Guide](../docs/kubernetes/README.md#troubleshooting)
 - [Observability Documentation](../docs/kubernetes/observability/)
 
-## 📖 Related Documentation
+## Related Documentation
 
 - **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Platform installation and concepts
 - **[API Reference](../docs/kubernetes/api_reference.md)** - DynamoGraphDeployment CRD specification
@@ -253,7 +253,7 @@ image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1
 - **[Observability](../docs/kubernetes/observability/)** - Monitoring and logging
 - **[Benchmarking Guide](../docs/benchmarks/benchmarking.md)** - Performance testing
 
-## 🤝 Contributing
+## Contributing
 
 We welcome contributions of new recipes! See [CONTRIBUTING.md](CONTRIBUTING.md) for:
 - Recipe submission guidelines

From 30d7e2cb10e10de233bdca7d79729ad73029b85f Mon Sep 17 00:00:00 2001
From: Ben Hamm <ben.hamm@gmail.com>
Date: Thu, 6 Nov 2025 10:30:54 -0800
Subject: [PATCH 3/4] docs: add hyperlinks to recipe table for easier
 navigation

---
 recipes/README.md | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/recipes/README.md b/recipes/README.md
index 6b751aba32..fd96639244 100644
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -9,16 +9,16 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D
 
 | Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes |
 |-------|-----------|------|------|------------|------------------|-------|
-| **Llama-3-70B** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization |
-| **Llama-3-70B** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation |
-| **Llama-3-70B** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each |
-| **Qwen3-32B-FP8** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization |
-| **Qwen3-32B-FP8** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation |
-| **GPT-OSS-120B** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP |
-| **GPT-OSS-120B** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest |
-| **DeepSeek-R1** | SGLang | Disagg WideEP | 8x H200 | ✅ | ❌ | Benchmark recipe pending |
-| **DeepSeek-R1** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | Benchmark recipe pending |
-| **DeepSeek-R1** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes |
+| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization |
+| **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation |
+| **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each |
+| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization |
+| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation |
+| **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP |
+| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest |
+| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 8x H200 | ✅ | ❌ | Benchmark recipe pending |
+| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | Benchmark recipe pending |
+| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes |
 
 **Legend:**
 - **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete

From e0592e63693114d6f5cf6476c1e62f1324e87147 Mon Sep 17 00:00:00 2001
From: Ben Hamm <ben.hamm@gmail.com>
Date: Thu, 6 Nov 2025 10:34:06 -0800
Subject: [PATCH 4/4] docs: fix trailing whitespace in README

---
 recipes/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/recipes/README.md b/recipes/README.md
index fd96639244..af5d97ff44 100644
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -2,7 +2,7 @@
 
 Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
 
-> **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform.  
+> **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform.
 > If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first.
 
 ## Available Recipes