From 8e9bcd6b0f7da008bedc9295cd009c8808ced079 Mon Sep 17 00:00:00 2001 From: Ben Hamm Date: Wed, 5 Nov 2025 18:08:01 -0800 Subject: [PATCH 1/4] recipes: Clean up incomplete recipes and clarify Kubernetes-only focus Remove incomplete model directories and non-Kubernetes configurations to streamline the recipes directory for production Kubernetes deployments. Changes: - Remove 5 incomplete model directories (deepseek-r1-distill-llama-8b, gemma3, llama4, qwen2-vl-7b-instruct, qwen3) that lack proper Kubernetes deployment manifests - Delete run.sh script (non-Kubernetes automation tool) - Remove standalone engine config YAMLs from deepseek-r1/trtllm that were not wrapped in Kubernetes manifests - Document incomplete gpt-oss-120b disagg recipe with README explaining missing components README improvements: - Restructure Available Recipes table with 'Deployment' and 'Benchmark Recipe' columns to clarify that perf.yaml files are tools for users to run benchmarks, not published performance results - Add comprehensive quick start guide with prerequisites - Link to correct Kubernetes deployment guides - Add troubleshooting section - Remove extraneous links (docs.nvidia.com, license section) Result: 4 models with 10 complete deployment recipes (7 with benchmark scripts), focused exclusively on Kubernetes deployments. Signed-off-by: Ben Hamm --- recipes/README.md | 396 ++++++++---------- .../trtllm/agg.yaml | 34 -- .../trtllm/decode.yaml | 31 -- .../trtllm/prefill.yaml | 30 -- .../deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml | 51 --- .../deepseek-r1/trtllm/agg/simple/agg.yaml | 56 --- .../trtllm/agg/wide_ep/dep16_agg.yaml | 29 -- .../deepseek-r1/trtllm/agg/wide_ep/eplb.yaml | 7 - .../trtllm/agg/wide_ep/wide_ep_agg.yaml | 39 -- .../trtllm/disagg/mtp/mtp_decode.yaml | 57 --- .../trtllm/disagg/mtp/mtp_prefill.yaml | 41 -- .../trtllm/disagg/simple/decode.yaml | 60 --- .../trtllm/disagg/simple/prefill.yaml | 39 -- .../trtllm/disagg/wide_ep/eplb.yaml | 7 - .../trtllm/disagg/wide_ep/wide_ep_decode.yaml | 66 --- .../disagg/wide_ep/wide_ep_prefill.yaml | 44 -- recipes/gemma3/trtllm/vswa_agg.yaml | 26 -- recipes/gemma3/trtllm/vswa_decode.yaml | 29 -- recipes/gemma3/trtllm/vswa_prefill.yaml | 30 -- recipes/gpt-oss-120b/trtllm/disagg/README.md | 25 ++ recipes/llama4/trtllm/eagle/eagle_agg.yml | 39 -- recipes/llama4/trtllm/eagle/eagle_decode.yaml | 52 --- .../llama4/trtllm/eagle/eagle_prefill.yaml | 37 -- recipes/llama4/trtllm/multimodal/agg.yaml | 33 -- recipes/llama4/trtllm/multimodal/decode.yaml | 29 -- recipes/llama4/trtllm/multimodal/prefill.yaml | 31 -- recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml | 33 -- .../qwen2-vl-7b-instruct/trtllm/decode.yaml | 29 -- .../qwen2-vl-7b-instruct/trtllm/encode.yaml | 30 -- .../qwen2-vl-7b-instruct/trtllm/prefill.yaml | 31 -- recipes/qwen3/trtllm/agg.yaml | 34 -- recipes/qwen3/trtllm/decode.yaml | 31 -- recipes/qwen3/trtllm/prefill.yaml | 30 -- recipes/run.sh | 261 ------------ 34 files changed, 210 insertions(+), 1587 deletions(-) delete mode 100644 recipes/deepseek-r1-distill-llama-8b/trtllm/agg.yaml delete mode 100644 recipes/deepseek-r1-distill-llama-8b/trtllm/decode.yaml delete mode 100644 recipes/deepseek-r1-distill-llama-8b/trtllm/prefill.yaml delete mode 100644 recipes/deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml delete mode 100644 recipes/deepseek-r1/trtllm/agg/simple/agg.yaml delete mode 100644 recipes/deepseek-r1/trtllm/agg/wide_ep/dep16_agg.yaml delete mode 100644 recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml delete mode 100644 recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml delete mode 100644 recipes/deepseek-r1/trtllm/disagg/mtp/mtp_decode.yaml delete mode 100644 recipes/deepseek-r1/trtllm/disagg/mtp/mtp_prefill.yaml delete mode 100644 recipes/deepseek-r1/trtllm/disagg/simple/decode.yaml delete mode 100644 recipes/deepseek-r1/trtllm/disagg/simple/prefill.yaml delete mode 100644 recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml delete mode 100644 recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml delete mode 100644 recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml delete mode 100644 recipes/gemma3/trtllm/vswa_agg.yaml delete mode 100644 recipes/gemma3/trtllm/vswa_decode.yaml delete mode 100644 recipes/gemma3/trtllm/vswa_prefill.yaml create mode 100644 recipes/gpt-oss-120b/trtllm/disagg/README.md delete mode 100644 recipes/llama4/trtllm/eagle/eagle_agg.yml delete mode 100644 recipes/llama4/trtllm/eagle/eagle_decode.yaml delete mode 100644 recipes/llama4/trtllm/eagle/eagle_prefill.yaml delete mode 100644 recipes/llama4/trtllm/multimodal/agg.yaml delete mode 100644 recipes/llama4/trtllm/multimodal/decode.yaml delete mode 100644 recipes/llama4/trtllm/multimodal/prefill.yaml delete mode 100644 recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml delete mode 100644 recipes/qwen2-vl-7b-instruct/trtllm/decode.yaml delete mode 100644 recipes/qwen2-vl-7b-instruct/trtllm/encode.yaml delete mode 100644 recipes/qwen2-vl-7b-instruct/trtllm/prefill.yaml delete mode 100644 recipes/qwen3/trtllm/agg.yaml delete mode 100644 recipes/qwen3/trtllm/decode.yaml delete mode 100644 recipes/qwen3/trtllm/prefill.yaml delete mode 100755 recipes/run.sh diff --git a/recipes/README.md b/recipes/README.md index 236a38a71a..abcac27198 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -1,297 +1,271 @@ -# Dynamo Model Serving Recipes +# Dynamo Production-Ready Recipes -This repository contains production-ready recipes for deploying large language models using the Dynamo platform. Each recipe includes deployment configurations, performance benchmarking, and model caching setup. +Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo. -## Contents -- [Available Models](#available-models) -- [Quick Start](#quick-start) -- [Prerequisites](#prerequisites) -- Deployment Methods - - [Option 1: Automated Deployment](#option-1-automated-deployment) - - [Option 2: Manual Deployment](#option-2-manual-deployment) +> **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform. +> If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first. +## 📊 Available Recipes -## Available Models - -| Model Family | Framework | Deployment Mode | GPU Requirements | Status | Benchmark |GAIE-integration | -|-----------------|-----------|---------------------|------------------|--------|-----------|------------------| -| llama-3-70b | vllm | agg | 4x H100/H200 | ✅ | ✅ |✅ | -| llama-3-70b | vllm | disagg (1 node) | 8x H100/H200 | ✅ | ✅ | 🚧 | -| llama-3-70b | vllm | disagg (multi-node) | 16x H100/H200 | ✅ | ✅ |🚧 | -| deepseek-r1 | sglang | disagg (1 node, wide-ep) | 8x H200 | ✅ | 🚧 |🚧 | -| deepseek-r1 | sglang | disagg (multi-node, wide-ep) | 16x H200 | ✅ | 🚧 |🚧 | -| gpt-oss-120b | trtllm | agg | 4x GB200 | ✅ | ✅ |🚧 | +| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | +|-------|-----------|------|------|------------|------------------|-------| +| **Llama-3-70B** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | +| **Llama-3-70B** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | +| **Llama-3-70B** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | +| **Qwen3-32B-FP8** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization | +| **Qwen3-32B-FP8** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | +| **GPT-OSS-120B** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | +| **GPT-OSS-120B** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | +| **DeepSeek-R1** | SGLang | Disagg WideEP | 8x H200 | ✅ | ❌ | Benchmark recipe pending | +| **DeepSeek-R1** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | Benchmark recipe pending | +| **DeepSeek-R1** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | **Legend:** -- ✅ Functional -- 🚧 Under development +- **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete +- **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided + +## 📁 Recipe Structure +Each complete recipe follows this standard structure: -**Recipe Directory Structure:** -Recipes are organized into a directory structure that follows the pattern: -```text +``` / +├── README.md (optional) # Model-specific deployment notes ├── model-cache/ -│ ├── model-cache.yaml # PVC for model cache -│ └── model-download.yaml # Job for model download -├── / -│ └── / -│ ├── deploy.yaml # DynamoGraphDeployment CRD and optional configmap for custom configuration -│ └── perf.yaml (optional) # Performance benchmark -└── README.md (optional) # Model documentation +│ ├── model-cache.yaml # PersistentVolumeClaim for model storage +│ └── model-download.yaml # Job to download model from HuggingFace +└── / # vllm, sglang, or trtllm + └── / # agg, disagg, disagg-single-node, etc. + ├── deploy.yaml # Complete DynamoGraphDeployment manifest + └── perf.yaml (optional) # AIPerf benchmark job ``` -## Quick Start - -Follow the instructions in the [Prerequisites](#prerequisites) section to set up your environment. - -Choose your preferred deployment method: using the `run.sh` script or manual deployment steps. +## 🚀 Quick Start +### Prerequisites -## Prerequisites +**1. Dynamo Platform Installed** -### 1. Environment Setup +The recipes require the Dynamo Kubernetes Platform to be installed. Follow the installation guide: -Create a Kubernetes namespace and set environment variable: +- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Quickstart (~10 minutes) +- **[Detailed Installation Guide](../docs/kubernetes/installation_guide.md)** - Advanced options -```bash -export NAMESPACE=your-namespace -kubectl create namespace ${NAMESPACE} -``` +**2. GPU Cluster Requirements** -### 2. Deploy Dynamo Platform - -Install the Dynamo Cloud Platform following the [Quickstart Guide](../docs/kubernetes/README.md). - -### 3. GPU Cluster - -Ensure your Kubernetes cluster has: -- GPU nodes with appropriate GPU types (see model requirements above) +Ensure your cluster has: +- GPU nodes matching recipe requirements (see table above) - GPU operator installed -- Sufficient GPU memory and compute resources - -### 4. Container Registry Access +- Appropriate GPU drivers and container runtime -Ensure access to NVIDIA container registry for runtime images: -- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z` -- `nvcr.io/nvidia/ai-dynamo/trtllm-runtime:x.y.z` -- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:x.y.z` +**3. HuggingFace Access** -### 5. HuggingFace Access and Kubernetes Secret Creation - -Set up a kubernetes secret with the HuggingFace token for model download: +Configure authentication to download models: ```bash -# Update the token in the secret file -vim hf_hub_secret/hf_hub_secret.yaml +export NAMESPACE=your-namespace +kubectl create namespace ${NAMESPACE} -# Apply the secret -kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} +# Create HuggingFace token secret +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN="your-token-here" \ + -n ${NAMESPACE} ``` -6. Configure Storage Class +**4. Storage Configuration** + +Update the `storageClassName` in `/model-cache/model-cache.yaml` to match your cluster: ```bash -# Check available storage classes +# Find your storage class name kubectl get storageclass -``` - -Replace "your-storage-class-name" with your actual storage class in the file: `/model-cache/model-cache.yaml` -```yaml -# In /model-cache/model-cache.yaml -spec: - storageClassName: "your-actual-storage-class" # Replace this +# Edit the model-cache.yaml file and update: +# spec: +# storageClassName: "your-actual-storage-class" ``` -## Option 1: Automated Deployment - -Use the `run.sh` script for fully automated deployment: - -**Note:** The script automatically: -- Create model cache PVC and downloads the model -- Deploy the model service -- Runs performance benchmark if a `perf.yaml` file is present in the deployment directory +### Deploy a Recipe - -#### Script Usage +**Step 1: Download Model** ```bash -./run.sh [OPTIONS] --model --framework --deployment -``` +# Update storageClassName in model-cache.yaml first! +kubectl apply -f /model-cache/ -n ${NAMESPACE} -**Required Options:** -- `--model `: Model name matching the directory name in the recipes directory (e.g., llama-3-70b, gpt-oss-120b, deepseek-r1) -- `--framework `: Backend framework (`vllm`, `trtllm`, `sglang`) -- `--deployment `: Deployment mode (e.g., agg, disagg, disagg-single-node, disagg-multi-node) +# Wait for download to complete (may take 10-60 minutes depending on model size) +kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s -**Optional Options:** -- `--namespace `: Kubernetes namespace (default: dynamo) -- `--dry-run`: Show commands without executing them -- `-h, --help`: Show help message +# Monitor progress +kubectl logs -f job/model-download -n ${NAMESPACE} +``` -**Environment Variables:** -- `NAMESPACE`: Kubernetes namespace (default: dynamo) +**Step 2: Deploy Service** -#### Example Usage ```bash -# Set up environment -export NAMESPACE=your-namespace -kubectl create namespace ${NAMESPACE} -# Configure HuggingFace token -kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} - -# use run.sh script to deploy the model -# Deploy Llama-3-70B with vLLM (aggregated mode) -./run.sh --model llama-3-70b --framework vllm --deployment agg +kubectl apply -f ///deploy.yaml -n ${NAMESPACE} -# Deploy GPT-OSS-120B with TensorRT-LLM -./run.sh --model gpt-oss-120b --framework trtllm --deployment agg - -# Deploy DeepSeek-R1 with SGLang (disaggregated mode) -./run.sh --model deepseek-r1 --framework sglang --deployment disagg +# Check deployment status +kubectl get dynamographdeployment -n ${NAMESPACE} -# Deploy with custom namespace -./run.sh --namespace my-namespace --model llama-3-70b --framework vllm --deployment agg +# Check pod status +kubectl get pods -n ${NAMESPACE} -# Dry run to see what would be executed -./run.sh --dry-run --model llama-3-70b --framework vllm --deployment agg +# Wait for pods to be ready +kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name= -n ${NAMESPACE} --timeout=600s ``` -## If deploying with Gateway API Inference extension GAIE +**Step 3: Test Deployment** -1. Follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE. +```bash +# Port forward to access the service locally +kubectl port-forward svc/-frontend 8000:8000 -n ${NAMESPACE} -2. Apply manifests by running a script. +# In another terminal, test the endpoint +curl http://localhost:8000/v1/models -```bash -# Match the block size to the cli value in your deployment file deploy.yaml: - "python3 -m dynamo.vllm ... --block-size 128" -export DYNAMO_KV_BLOCK_SIZE=128 -export EPP_IMAGE=nvcr.io/you/epp:tag -# Add --gaie argument to the script i.e.: -./run.sh --model llama-3-70b --framework vllm --gaie agg +# Send a test request +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Hello!"}], + "max_tokens": 50 + }' ``` -The script will perform gateway checks and apply the manifests. -## Option 2: Manual Deployment - -For step-by-step manual deployment follow these steps : +**Step 4: Run Benchmark (Optional)** ```bash -# 0. Set up environment (see Prerequisites section) -export NAMESPACE=your-namespace -kubectl create namespace ${NAMESPACE} -kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} +# Only if perf.yaml exists in the recipe directory +kubectl apply -f ///perf.yaml -n ${NAMESPACE} -# 1. Download model (see Model Download section) -kubectl apply -n $NAMESPACE -f /model-cache/ +# Monitor benchmark progress +kubectl logs -f job/ -n ${NAMESPACE} -# 2. Deploy model (see Deployment section) -kubectl apply -n $NAMESPACE -f ///deploy.yaml - -# 3. Run benchmarks (optional, if perf.yaml exists) -kubectl apply -n $NAMESPACE -f ///perf.yaml +# View results after completion +kubectl logs job/ -n ${NAMESPACE} | tail -50 ``` -### Step 1: Download Model +## 📖 Example Deployments -```bash -# Start the download job -kubectl apply -n $NAMESPACE -f /model-cache - -# Verify job creation -kubectl get jobs -n $NAMESPACE | grep model-download -``` - -Monitor and wait for the model download to complete: +### Llama-3-70B with vLLM (Aggregated) ```bash +export NAMESPACE=dynamo-demo +kubectl create namespace ${NAMESPACE} -# Wait for job completion (timeout after 100 minutes) -kubectl wait --for=condition=Complete job/model-download -n $NAMESPACE --timeout=6000s +# Create HF token secret +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN="your-token" \ + -n ${NAMESPACE} -# Check job status -kubectl get job model-download -n $NAMESPACE +# Deploy +kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE} +kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s +kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE} -# View download logs -kubectl logs job/model-download -n $NAMESPACE +# Test +kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE} ``` -### Step 2: Deploy Model Service +### DeepSeek-R1 on GB200 (Multi-node) -```bash -# Navigate to the specific deployment configuration -cd /// - -# Deploy the model service -kubectl apply -n $NAMESPACE -f deploy.yaml - -# Verify deployment creation -kubectl get deployments -n $NAMESPACE -``` +See [deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml](deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml) for the complete multi-node WideEP configuration. -#### Wait for Deployment Ready +## 🛠️ Customization -```bash -# Get deployment name from the deploy.yaml file -DEPLOYMENT_NAME=$(grep "name:" deploy.yaml | head -1 | awk '{print $2}') - -# Wait for deployment to be ready (timeout after 10 minutes) -kubectl wait --for=condition=available deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=1200s +Each `deploy.yaml` contains: +- **ConfigMap**: Engine-specific configuration (embedded in the manifest) +- **DynamoGraphDeployment**: Kubernetes resource definitions +- **Resource limits**: GPU count, memory, CPU requests/limits +- **Image references**: Container images with version tags -# Check deployment status -kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE +### Key Customization Points -# Check pod status -kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT_NAME +**Model Configuration:** +```yaml +# In deploy.yaml under worker args: +args: + - python3 -m dynamo.vllm --model --served-model-name ``` -#### Verify Model Service - -```bash -# Check if service is running -kubectl get services -n $NAMESPACE +**GPU Resources:** +```yaml +resources: + limits: + gpu: "4" # Adjust based on your requirements + requests: + gpu: "4" +``` -# Test model endpoint (port-forward to test locally) -kubectl port-forward service/${DEPLOYMENT_NAME}-frontend 8000:8000 -n $NAMESPACE +**Scaling:** +```yaml +services: + VllmDecodeWorker: + replicas: 2 # Scale to multiple workers +``` -# Test the model API (in another terminal) -curl http://localhost:8000/v1/models +**Router Mode:** +```yaml +# In Frontend args: +args: + - python3 -m dynamo.frontend --router-mode kv --http-port 8000 +# Options: round-robin, kv (KV-aware routing) +``` -# Stop port-forward when done -pkill -f "kubectl port-forward" +**Container Images:** +```yaml +image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1 +# Update version tag as needed ``` -### Step 3: Performance Benchmarking (Optional) +## 🔧 Troubleshooting -Run performance benchmarks to evaluate model performance. Note that benchmarking is only available for models that include a `perf.yaml` file (optional): +### Common Issues -#### Launch Benchmark Job +**Pods stuck in Pending:** +- Check GPU availability: `kubectl describe node ` +- Verify storage class exists: `kubectl get storageclass` +- Check resource requests vs. available resources -```bash -# From the deployment directory -kubectl apply -n $NAMESPACE -f perf.yaml +**Model download fails:** +- Verify HuggingFace token is correct +- Check network connectivity from cluster +- Review job logs: `kubectl logs job/model-download -n ${NAMESPACE}` -# Verify benchmark job creation -kubectl get jobs -n $NAMESPACE -``` +**Workers fail to start:** +- Check GPU compatibility (driver version, CUDA version) +- Verify image pull secrets if using private registries +- Review pod logs: `kubectl logs -n ${NAMESPACE}` -#### Monitor Benchmark Progress +**For more troubleshooting:** +- [Kubernetes Deployment Guide](../docs/kubernetes/README.md#troubleshooting) +- [Observability Documentation](../docs/kubernetes/observability/) -```bash -# Get benchmark job name -PERF_JOB_NAME=$(grep "name:" perf.yaml | head -1 | awk '{print $2}') +## 📖 Related Documentation -# Monitor benchmark logs in real-time -kubectl logs -f job/$PERF_JOB_NAME -n $NAMESPACE +- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Platform installation and concepts +- **[API Reference](../docs/kubernetes/api_reference.md)** - DynamoGraphDeployment CRD specification +- **[vLLM Backend Guide](../docs/backends/vllm/README.md)** - vLLM-specific features +- **[SGLang Backend Guide](../docs/backends/sglang/README.md)** - SGLang-specific features +- **[TensorRT-LLM Backend Guide](../docs/backends/trtllm/README.md)** - TensorRT-LLM features +- **[Observability](../docs/kubernetes/observability/)** - Monitoring and logging +- **[Benchmarking Guide](../docs/benchmarks/benchmarking.md)** - Performance testing -# Wait for benchmark completion (timeout after 100 minutes) -kubectl wait --for=condition=Complete job/$PERF_JOB_NAME -n $NAMESPACE --timeout=6000s -``` +## 🤝 Contributing -#### View Benchmark Results +We welcome contributions of new recipes! See [CONTRIBUTING.md](CONTRIBUTING.md) for: +- Recipe submission guidelines +- Required components checklist +- Testing and validation requirements +- Documentation standards -```bash -# Check final benchmark results -kubectl logs job/$PERF_JOB_NAME -n $NAMESPACE | tail -50 -``` \ No newline at end of file +### Recipe Quality Standards + +A production-ready recipe must include: +- ✅ Complete `deploy.yaml` with DynamoGraphDeployment +- ✅ Model cache PVC and download job +- ✅ Benchmark recipe (`perf.yaml`) for performance testing +- ✅ Verification on target hardware +- ✅ Documentation of GPU requirements diff --git a/recipes/deepseek-r1-distill-llama-8b/trtllm/agg.yaml b/recipes/deepseek-r1-distill-llama-8b/trtllm/agg.yaml deleted file mode 100644 index 53e0e6ce38..0000000000 --- a/recipes/deepseek-r1-distill-llama-8b/trtllm/agg.yaml +++ /dev/null @@ -1,34 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -max_batch_size: 16 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true - -kv_cache_config: - free_gpu_memory_fraction: 0.85 - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 - - -cuda_graph_config: - max_batch_size: 16 \ No newline at end of file diff --git a/recipes/deepseek-r1-distill-llama-8b/trtllm/decode.yaml b/recipes/deepseek-r1-distill-llama-8b/trtllm/decode.yaml deleted file mode 100644 index a0154bb6e3..0000000000 --- a/recipes/deepseek-r1-distill-llama-8b/trtllm/decode.yaml +++ /dev/null @@ -1,31 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -disable_overlap_scheduler: false - -cuda_graph_config: - max_batch_size: 16 - -kv_cache_config: - free_gpu_memory_fraction: 0.85 - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/deepseek-r1-distill-llama-8b/trtllm/prefill.yaml b/recipes/deepseek-r1-distill-llama-8b/trtllm/prefill.yaml deleted file mode 100644 index 4996c1fdc6..0000000000 --- a/recipes/deepseek-r1-distill-llama-8b/trtllm/prefill.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -# Overlap scheduler not currently supported in prefill only workers. -disable_overlap_scheduler: true -cuda_graph_config: - max_batch_size: 16 -kv_cache_config: - free_gpu_memory_fraction: 0.85 - -cache_transceiver_config: - backend: DEFAULT \ No newline at end of file diff --git a/recipes/deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml b/recipes/deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml deleted file mode 100644 index 25fae60abf..0000000000 --- a/recipes/deepseek-r1/trtllm/agg/mtp/mtp_agg.yaml +++ /dev/null @@ -1,51 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# NOTE: FP4 only supported starting with Blackwell GPUs. -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# You can also specify the full path to locally downloaded weights -# instead of a HuggingFace ID here. - -backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -enable_attention_dp: true -max_batch_size: 256 -# 8448 = 8192 ISL + 256 OSL -max_num_tokens: 8448 -max_seq_len: 8448 -kv_cache_config: - free_gpu_memory_fraction: 0.30 - dtype: fp8 - -# Enable the MTP(Multi-Token Prediction) in the model engine -speculative_config: - decoding_type: MTP - num_nextn_predict_layers: 1 - -cuda_graph_config: - enable_padding: true - batch_sizes: - - 1 - - 2 - - 4 - - 8 - - 16 - - 32 - - 64 - - 128 - - 256 - -print_iter_log: true diff --git a/recipes/deepseek-r1/trtllm/agg/simple/agg.yaml b/recipes/deepseek-r1/trtllm/agg/simple/agg.yaml deleted file mode 100644 index db2377a92a..0000000000 --- a/recipes/deepseek-r1/trtllm/agg/simple/agg.yaml +++ /dev/null @@ -1,56 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# TP/EP/PP/DP -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -pipeline_parallel_size: 1 -enable_attention_dp: false - -max_batch_size: 256 -# 8448 = 8192 ISL + 256 OSL -max_num_tokens: 8448 -max_seq_len: 8448 - -kv_cache_config: - # With dp attention disabled: high free_gpu_memory_fraction is fine. - free_gpu_memory_fraction: 0.85 - # With dp attention enabled: large ISL at high concurrency may need - # free_gpu_memory_fraction low to have enough available memory. - # free_gpu_memory_fraction: 0.30 - dtype: fp8 - - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -cuda_graph_config: - enable_padding: true -# NOTE: For larger max batch size, you may want to add larger cuda graph -# batch sizes below to match. - batch_sizes: - - 1 - - 2 - - 4 - - 8 - - 16 - - 32 - - 64 - - 128 - - 256 - -print_iter_log: true diff --git a/recipes/deepseek-r1/trtllm/agg/wide_ep/dep16_agg.yaml b/recipes/deepseek-r1/trtllm/agg/wide_ep/dep16_agg.yaml deleted file mode 100644 index 844c4ffa72..0000000000 --- a/recipes/deepseek-r1/trtllm/agg/wide_ep/dep16_agg.yaml +++ /dev/null @@ -1,29 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Example of a Multi-node worker, but no WideEP or EPLB. -# See wide_ep*.yaml for WideEP example configs. -backend: pytorch -tensor_parallel_size: 16 -moe_expert_parallel_size: 16 -enable_attention_dp: true -max_batch_size: 256 -max_num_tokens: 256 -max_seq_len: 8448 - -kv_cache_config: - free_gpu_memory_fraction: 0.7 - dtype: fp8 - -cuda_graph_config: - enable_padding: true - batch_sizes: - - 1 - - 2 - - 4 - - 8 - - 16 - - 32 - - 64 - - 128 - - 256 diff --git a/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml b/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml deleted file mode 100644 index f2fe0a13c6..0000000000 --- a/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml +++ /dev/null @@ -1,7 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -# moe_load_balancer settings for TRTLLM based on: -# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer -num_slots: 288 -layer_updates_per_iter: 2 diff --git a/recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml b/recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml deleted file mode 100644 index bcd6ae87e0..0000000000 --- a/recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml +++ /dev/null @@ -1,39 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -backend: pytorch - -# WideEP related settings -moe_config: - backend: WIDEEP - # moe_max_num_tokens will default to max_num_tokens if left unspecified. - # - # If you want to set this value explicitly, one recommendation is below: - # moe_max_num_tokens = max_batch_size * moe_expert_parallel_size - # 4096 = 256 * 16 - # moe_max_num_tokens: 4096 - load_balancer: /mnt/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml - -tensor_parallel_size: 16 -moe_expert_parallel_size: 16 - -enable_attention_dp: true -max_batch_size: 256 -max_num_tokens: 256 -max_seq_len: 8448 - -kv_cache_config: - free_gpu_memory_fraction: 0.3 - dtype: fp8 - -cuda_graph_config: - enable_padding: true - batch_sizes: - - 1 - - 2 - - 4 - - 8 - - 16 - - 32 - - 64 - - 128 - - 256 \ No newline at end of file diff --git a/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_decode.yaml b/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_decode.yaml deleted file mode 100644 index 8f0bd83919..0000000000 --- a/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_decode.yaml +++ /dev/null @@ -1,57 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# NOTE: FP4 only supported starting with Blackwell GPUs. -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# You can also specify the full path to locally downloaded weights -# instead of a HuggingFace ID here. - -backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -enable_attention_dp: false -max_batch_size: 256 -# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula: -# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1) -# This is a known issue in TensorRT-LLM and will be resolved in the next release. -max_num_tokens: 512 -# 8704 = 8192 ISL + 512 OSL -max_seq_len: 8704 -kv_cache_config: - free_gpu_memory_fraction: 0.85 - dtype: fp8 - -# Enable the MTP(Multi-Token Prediction) in decode model engine -speculative_config: - decoding_type: MTP - num_nextn_predict_layers: 1 - -cuda_graph_config: - enable_padding: true - batch_sizes: - - 1 - - 2 - - 4 - - 8 - - 16 - - 32 - - 64 - - 128 - - 256 - -print_iter_log: true - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_prefill.yaml b/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_prefill.yaml deleted file mode 100644 index 46494e8d68..0000000000 --- a/recipes/deepseek-r1/trtllm/disagg/mtp/mtp_prefill.yaml +++ /dev/null @@ -1,41 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# NOTE: FP4 only supported starting with Blackwell GPUs. -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# You can also specify the full path to locally downloaded weights -# instead of a HuggingFace ID here. - -backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -enable_attention_dp: true -max_batch_size: 1 -max_num_tokens: 8192 -max_seq_len: 8192 -kv_cache_config: - free_gpu_memory_fraction: 0.75 - dtype: fp8 - -print_iter_log: true -disable_overlap_scheduler: true - -# Enable the MTP(Multi-Token Prediction) in the prefill model engine -speculative_config: - decoding_type: MTP - num_nextn_predict_layers: 1 - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/deepseek-r1/trtllm/disagg/simple/decode.yaml b/recipes/deepseek-r1/trtllm/disagg/simple/decode.yaml deleted file mode 100644 index 28f246574b..0000000000 --- a/recipes/deepseek-r1/trtllm/disagg/simple/decode.yaml +++ /dev/null @@ -1,60 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# TP/EP/PP/DP -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -pipeline_parallel_size: 1 -enable_attention_dp: false - -max_batch_size: 256 -max_num_tokens: 256 -# 8448 = 8192 ISL + 256 OSL -max_seq_len: 8448 - -kv_cache_config: - # With dp attention disabled: high free_gpu_memory_fraction is fine. - free_gpu_memory_fraction: 0.85 - # With dp attention enabled: large ISL at high concurrency may need - # free_gpu_memory_fraction low to have enough available memory. - # free_gpu_memory_fraction: 0.30 - dtype: fp8 - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -disable_overlap_scheduler: false - -cuda_graph_config: - enable_padding: true - # NOTE: For larger max batch size, you may want to - # add larger cuda graph batch sizes below to match. - batch_sizes: - - 1 - - 2 - - 4 - - 8 - - 16 - - 32 - - 64 - - 128 - - 256 - -print_iter_log: true - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/deepseek-r1/trtllm/disagg/simple/prefill.yaml b/recipes/deepseek-r1/trtllm/disagg/simple/prefill.yaml deleted file mode 100644 index 13b2410a67..0000000000 --- a/recipes/deepseek-r1/trtllm/disagg/simple/prefill.yaml +++ /dev/null @@ -1,39 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# TP/EP/PP/DP -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -pipeline_parallel_size: 1 -enable_attention_dp: true - -max_batch_size: 1 -max_num_tokens: 8192 -max_seq_len: 8192 - -kv_cache_config: - free_gpu_memory_fraction: 0.75 - dtype: fp8 # NOTE: This dtype must match in both prefill/decode configs - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -disable_overlap_scheduler: true -print_iter_log: true - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml b/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml deleted file mode 100644 index f2fe0a13c6..0000000000 --- a/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml +++ /dev/null @@ -1,7 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -# moe_load_balancer settings for TRTLLM based on: -# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer -num_slots: 288 -layer_updates_per_iter: 2 diff --git a/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml b/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml deleted file mode 100644 index 39d392afe9..0000000000 --- a/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml +++ /dev/null @@ -1,66 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# WideEP related settings -moe_config: - backend: WIDEEP - load_balancer: /mnt/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml - -# TP/EP/PP/DP -tensor_parallel_size: 16 -moe_expert_parallel_size: 16 -pipeline_parallel_size: 1 -enable_attention_dp: true - -max_batch_size: 256 -max_num_tokens: 256 -# 8448 = 8192 ISL + 256 OSL -max_seq_len: 8448 - -kv_cache_config: - # With dp attention disabled: high free_gpu_memory_fraction is fine. - # free_gpu_memory_fraction: 0.85 - # With dp attention enabled: large ISL at high concurrency may need - # free_gpu_memory_fraction low to have enough available memory. - free_gpu_memory_fraction: 0.30 - dtype: fp8 - - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -disable_overlap_scheduler: false -cuda_graph_config: - enable_padding: true - # NOTE: For larger max batch size, you may want to - # add larger cuda graph batch sizes below to match. - batch_sizes: - - 1 - - 2 - - 4 - - 8 - - 16 - - 32 - - 64 - - 128 - - 256 - - -print_iter_log: true - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml b/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml deleted file mode 100644 index 56e862a855..0000000000 --- a/recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml +++ /dev/null @@ -1,44 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# WideEP related settings -moe_config: - backend: WIDEEP - load_balancer: /mnt/recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml - -# TP/EP/PP/DP -tensor_parallel_size: 16 -moe_expert_parallel_size: 16 -pipeline_parallel_size: 1 -enable_attention_dp: true - -max_batch_size: 1 -max_num_tokens: 8192 -max_seq_len: 8192 - -kv_cache_config: - free_gpu_memory_fraction: 0.3 - dtype: fp8 # NOTE: This dtype must match in both prefill/decode configs - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -disable_overlap_scheduler: true -print_iter_log: true - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/gemma3/trtllm/vswa_agg.yaml b/recipes/gemma3/trtllm/vswa_agg.yaml deleted file mode 100644 index 6cd7d1dc7f..0000000000 --- a/recipes/gemma3/trtllm/vswa_agg.yaml +++ /dev/null @@ -1,26 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -tensor_parallel_size: 1 -backend: pytorch - -kv_cache_config: - max_attention_window: - - 512 - - 512 - - 512 - - 512 - - 512 - - 32768 diff --git a/recipes/gemma3/trtllm/vswa_decode.yaml b/recipes/gemma3/trtllm/vswa_decode.yaml deleted file mode 100644 index c3ea683857..0000000000 --- a/recipes/gemma3/trtllm/vswa_decode.yaml +++ /dev/null @@ -1,29 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -tensor_parallel_size: 1 -backend: pytorch - -kv_cache_config: - max_attention_window: - - 512 - - 512 - - 512 - - 512 - - 512 - - 32768 - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/gemma3/trtllm/vswa_prefill.yaml b/recipes/gemma3/trtllm/vswa_prefill.yaml deleted file mode 100644 index 663d241b58..0000000000 --- a/recipes/gemma3/trtllm/vswa_prefill.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -tensor_parallel_size: 1 -backend: pytorch -disable_overlap_scheduler: true - -kv_cache_config: - max_attention_window: - - 512 - - 512 - - 512 - - 512 - - 512 - - 32768 - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/gpt-oss-120b/trtllm/disagg/README.md b/recipes/gpt-oss-120b/trtllm/disagg/README.md new file mode 100644 index 0000000000..10390c9587 --- /dev/null +++ b/recipes/gpt-oss-120b/trtllm/disagg/README.md @@ -0,0 +1,25 @@ +# GPT-OSS-120B Disaggregated Mode + +> **⚠️ INCOMPLETE**: This directory contains only engine configuration files and is not ready for Kubernetes deployment. + +## Current Status + +This directory contains TensorRT-LLM engine configurations for disaggregated serving: +- `decode.yaml` - Decode worker engine configuration +- `prefill.yaml` - Prefill worker engine configuration + +## Missing Components + +To complete this recipe, the following files are needed: +- `deploy.yaml` - Kubernetes DynamoGraphDeployment manifest +- `perf.yaml` - Performance benchmarking job (optional) + +## Alternative + +For a production-ready GPT-OSS-120B deployment, use the **aggregated mode**: +- [gpt-oss-120b/trtllm/agg/](../agg/) - Complete with `deploy.yaml` and `perf.yaml` + +## Contributing + +If you'd like to complete this recipe, see [recipes/CONTRIBUTING.md](../../../CONTRIBUTING.md) for guidelines on creating proper Kubernetes deployment manifests. + diff --git a/recipes/llama4/trtllm/eagle/eagle_agg.yml b/recipes/llama4/trtllm/eagle/eagle_agg.yml deleted file mode 100644 index f4144e42ce..0000000000 --- a/recipes/llama4/trtllm/eagle/eagle_agg.yml +++ /dev/null @@ -1,39 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -max_batch_size: 192 -max_num_tokens: 3072 -disable_overlap_scheduler: false - -# Enable Speculative Decoding in the model engine -speculative_config: - decoding_type: Eagle - max_draft_len: 3 - speculative_model_dir: nvidia/Llama-4-Maverick-17B-128E-Eagle3 - eagle3_one_model: true - -kv_cache_config: - free_gpu_memory_fraction: 0.2 - enable_block_reuse: false - -cuda_graph_config: - enable_padding: true - batch_sizes: [1,2,3,4,5,6,7,8,16,32,48,64,128,190,191,192] - -print_iter_log: true - diff --git a/recipes/llama4/trtllm/eagle/eagle_decode.yaml b/recipes/llama4/trtllm/eagle/eagle_decode.yaml deleted file mode 100644 index 019cac5ac6..0000000000 --- a/recipes/llama4/trtllm/eagle/eagle_decode.yaml +++ /dev/null @@ -1,52 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -max_batch_size: 256 -max_num_tokens: 1024 -# 8704 = 8192 ISL + 512 OSL -max_seq_len: 8704 -disable_overlap_scheduler: true - -# Enable Speculative Decoding in the model engine -speculative_config: - decoding_type: Eagle - max_draft_len: 3 - speculative_model_dir: nvidia/Llama-4-Maverick-17B-128E-Eagle3 - eagle3_one_model: true - -kv_cache_config: - free_gpu_memory_fraction: 0.5 - enable_block_reuse: false - -cuda_graph_config: - enable_padding: true - batch_sizes: - - 1 - - 2 - - 4 - - 8 - - 16 - - 32 - - 64 - - 128 - - 256 - -print_iter_log: true - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/llama4/trtllm/eagle/eagle_prefill.yaml b/recipes/llama4/trtllm/eagle/eagle_prefill.yaml deleted file mode 100644 index 5b978deece..0000000000 --- a/recipes/llama4/trtllm/eagle/eagle_prefill.yaml +++ /dev/null @@ -1,37 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -max_batch_size: 1 -max_num_tokens: 8192 -max_seq_len: 8192 -print_iter_log: true -disable_overlap_scheduler: true - -# Enable Speculative Decoding in the model engine -speculative_config: - decoding_type: Eagle - max_draft_len: 3 - speculative_model_dir: nvidia/Llama-4-Maverick-17B-128E-Eagle3 - eagle3_one_model: true - -kv_cache_config: - free_gpu_memory_fraction: 0.5 - enable_block_reuse: false - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/llama4/trtllm/multimodal/agg.yaml b/recipes/llama4/trtllm/multimodal/agg.yaml deleted file mode 100644 index 754f8ce759..0000000000 --- a/recipes/llama4/trtllm/multimodal/agg.yaml +++ /dev/null @@ -1,33 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 8 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 4096 -max_batch_size: 8 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true - -kv_cache_config: - free_gpu_memory_fraction: 0.3 - enable_block_reuse: false - -cache_transceiver_config: - backend: DEFAULT -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 diff --git a/recipes/llama4/trtllm/multimodal/decode.yaml b/recipes/llama4/trtllm/multimodal/decode.yaml deleted file mode 100644 index 262a2be1cc..0000000000 --- a/recipes/llama4/trtllm/multimodal/decode.yaml +++ /dev/null @@ -1,29 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 8 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -max_batch_size: 16 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -disable_overlap_scheduler: false -kv_cache_config: - free_gpu_memory_fraction: 0.30 - enable_block_reuse: false - -cache_transceiver_config: - backend: DEFAULT \ No newline at end of file diff --git a/recipes/llama4/trtllm/multimodal/prefill.yaml b/recipes/llama4/trtllm/multimodal/prefill.yaml deleted file mode 100644 index 3d2c144015..0000000000 --- a/recipes/llama4/trtllm/multimodal/prefill.yaml +++ /dev/null @@ -1,31 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 8 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -max_batch_size: 16 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -# Overlap scheduler not currently supported in prefill only workers. -disable_overlap_scheduler: true - -kv_cache_config: - free_gpu_memory_fraction: 0.30 - enable_block_reuse: false - -cache_transceiver_config: - backend: DEFAULT \ No newline at end of file diff --git a/recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml b/recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml deleted file mode 100644 index 754f8ce759..0000000000 --- a/recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml +++ /dev/null @@ -1,33 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 8 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 4096 -max_batch_size: 8 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true - -kv_cache_config: - free_gpu_memory_fraction: 0.3 - enable_block_reuse: false - -cache_transceiver_config: - backend: DEFAULT -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 diff --git a/recipes/qwen2-vl-7b-instruct/trtllm/decode.yaml b/recipes/qwen2-vl-7b-instruct/trtllm/decode.yaml deleted file mode 100644 index 6dbd676ee4..0000000000 --- a/recipes/qwen2-vl-7b-instruct/trtllm/decode.yaml +++ /dev/null @@ -1,29 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -max_batch_size: 16 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -disable_overlap_scheduler: false -kv_cache_config: - free_gpu_memory_fraction: 0.30 - enable_block_reuse: false - -cache_transceiver_config: - backend: DEFAULT \ No newline at end of file diff --git a/recipes/qwen2-vl-7b-instruct/trtllm/encode.yaml b/recipes/qwen2-vl-7b-instruct/trtllm/encode.yaml deleted file mode 100644 index 6f0c20990f..0000000000 --- a/recipes/qwen2-vl-7b-instruct/trtllm/encode.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -trust_remote_code: true -backend: pytorch -disable_overlap_scheduler: false - -cuda_graph_config: - max_batch_size: 16 - -kv_cache_config: - free_gpu_memory_fraction: 0.85 - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/qwen2-vl-7b-instruct/trtllm/prefill.yaml b/recipes/qwen2-vl-7b-instruct/trtllm/prefill.yaml deleted file mode 100644 index 83a65e8bf3..0000000000 --- a/recipes/qwen2-vl-7b-instruct/trtllm/prefill.yaml +++ /dev/null @@ -1,31 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -max_batch_size: 16 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -# Overlap scheduler not currently supported in prefill only workers. -disable_overlap_scheduler: true - -kv_cache_config: - free_gpu_memory_fraction: 0.30 - enable_block_reuse: false - -cache_transceiver_config: - backend: DEFAULT \ No newline at end of file diff --git a/recipes/qwen3/trtllm/agg.yaml b/recipes/qwen3/trtllm/agg.yaml deleted file mode 100644 index 53e0e6ce38..0000000000 --- a/recipes/qwen3/trtllm/agg.yaml +++ /dev/null @@ -1,34 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -max_batch_size: 16 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true - -kv_cache_config: - free_gpu_memory_fraction: 0.85 - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 - - -cuda_graph_config: - max_batch_size: 16 \ No newline at end of file diff --git a/recipes/qwen3/trtllm/decode.yaml b/recipes/qwen3/trtllm/decode.yaml deleted file mode 100644 index a0154bb6e3..0000000000 --- a/recipes/qwen3/trtllm/decode.yaml +++ /dev/null @@ -1,31 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -disable_overlap_scheduler: false - -cuda_graph_config: - max_batch_size: 16 - -kv_cache_config: - free_gpu_memory_fraction: 0.85 - -cache_transceiver_config: - backend: DEFAULT diff --git a/recipes/qwen3/trtllm/prefill.yaml b/recipes/qwen3/trtllm/prefill.yaml deleted file mode 100644 index 4996c1fdc6..0000000000 --- a/recipes/qwen3/trtllm/prefill.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -# Overlap scheduler not currently supported in prefill only workers. -disable_overlap_scheduler: true -cuda_graph_config: - max_batch_size: 16 -kv_cache_config: - free_gpu_memory_fraction: 0.85 - -cache_transceiver_config: - backend: DEFAULT \ No newline at end of file diff --git a/recipes/run.sh b/recipes/run.sh deleted file mode 100755 index 980c9333b6..0000000000 --- a/recipes/run.sh +++ /dev/null @@ -1,261 +0,0 @@ -#!/usr/bin/env bash -# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -set -euo pipefail -IFS=$'\n\t' - -RECIPES_DIR="$( cd "$( dirname "$0" )" && pwd )" -# Default values -NAMESPACE="${NAMESPACE:-dynamo}" -DEPLOY_TYPE="" -GAIE="${GAIE:-false}" -DEPLOYMENT="" -MODEL="" -FRAMEWORK="" -DRY_RUN="" - -# Frameworks - following container/build.sh pattern -declare -A FRAMEWORKS=(["VLLM"]=1 ["TRTLLM"]=2 ["SGLANG"]=3) -DEFAULT_FRAMEWORK=VLLM - -# Function to show usage -usage() { - echo "Usage: $0 [OPTIONS] --model --framework --deployment " - echo "" - echo "Required Options:" - echo " --model Model name (e.g., llama-3-70b)" - echo " --framework Framework one of ${!FRAMEWORKS[*]} (default: ${DEFAULT_FRAMEWORK})" - echo " --deployment Deployment type (e.g., agg, disagg etc, please refer to the README.md for available deployment types)" - echo "" - echo "Optional:" - echo " --namespace Kubernetes namespace (default: dynamo)" - echo " --dry-run Print commands without executing them" - echo " --gaie[=true|false] Enable GAIE integration subfolder (applies GAIE manifests skips benchmark) (default: ${GAIE})" - echo " -h, --help Show this help message" - echo "" - echo "Environment Variables:" - echo " NAMESPACE Kubernetes namespace (default: dynamo)" - echo "" - echo "Examples:" - echo " $0 --model llama-3-70b --framework vllm --deployment agg" - echo " $0 --model llama-3-70b --framework trtllm --deployment disagg-single-node" - echo " $0 --namespace my-ns --model llama-3-70b --framework vllm --deployment disagg-multi-node" - exit 1 -} - -missing_requirement() { - echo "ERROR: $1 requires an argument." - usage -} - -error() { - printf '%s %s\n' "$1" "$2" >&2 - exit 1 -} - -while [[ $# -gt 0 ]]; do - case $1 in - --dry-run) - DRY_RUN="echo" - shift - ;; - --model) - if [ "$2" ]; then - MODEL=$2 - shift 2 - else - missing_requirement "$1" - fi - ;; - --framework) - if [ "$2" ]; then - FRAMEWORK=$2 - shift 2 - else - missing_requirement "$1" - fi - ;; - --deployment) - if [ "$2" ]; then - DEPLOYMENT=$2 - shift 2 - else - missing_requirement "$1" - fi - ;; - --namespace) - if [ "$2" ]; then - NAMESPACE=$2 - shift 2 - else - missing_requirement "$1" - fi - ;; - --gaie) - GAIE=true - shift - ;; - --gaie=false) - GAIE=false - shift - ;; - --gaie=*) - GAIE="${1#*=}" - case "${GAIE,,}" in - true|false) GAIE="${GAIE,,}";; - *) echo "ERROR: --gaie must be true or false"; exit 1;; - esac - shift - ;; - -h|--help) - usage - ;; - -*) - error 'ERROR: Unknown option: ' "$1" - ;; - *) - error "ERROR: Unknown argument: " "$1" - ;; - esac -done - -if [ -z "$FRAMEWORK" ]; then - FRAMEWORK=$DEFAULT_FRAMEWORK -fi - -if [ -n "$FRAMEWORK" ]; then - FRAMEWORK=${FRAMEWORK^^} - if [[ -z "${FRAMEWORKS[$FRAMEWORK]}" ]]; then - error 'ERROR: Unknown framework: ' "$FRAMEWORK" - fi -fi - -# Validate required arguments -if [[ -z "$MODEL" ]] || [[ -z "$DEPLOYMENT" ]]; then - if [[ -z "$MODEL" ]]; then - echo "ERROR: --model argument is required" - fi - if [[ -z "$DEPLOYMENT" ]]; then - echo "ERROR: --deployment argument is required" - fi - echo "" - usage -fi - -# Construct paths based on new structure: recipes//// -MODEL_DIR="$RECIPES_DIR/$MODEL" -FRAMEWORK_DIR="$MODEL_DIR/${FRAMEWORK,,}" -DEPLOY_PATH="$FRAMEWORK_DIR/$DEPLOYMENT" -INTEGRATION="$([[ "${GAIE,,}" == "true" ]] && echo gaie || echo "")" - -# Check if model directory exists -if [[ ! -d "$MODEL_DIR" ]]; then - echo "Error: Model directory '$MODEL' does not exist in $RECIPES_DIR" - echo "Available models:" - ls -1 "$RECIPES_DIR" | grep -v "\.sh$\|\.md$\|model-cache$" | sed 's/^/ /' - exit 1 -fi - -# Check if framework directory exists -if [[ ! -d "$FRAMEWORK_DIR" ]]; then - echo "Error: Framework directory '${FRAMEWORK,,}' does not exist in $MODEL_DIR" - echo "Available frameworks for $MODEL:" - ls -1 "$MODEL_DIR" | grep -v "\.sh$\|\.md$" | sed 's/^/ /' - exit 1 -fi - -# Check if deployment directory exists -if [[ ! -d "$DEPLOY_PATH" ]]; then - echo "Error: Deployment type '$DEPLOYMENT' does not exist in $FRAMEWORK_DIR" - echo "Available deployment types for $MODEL/${FRAMEWORK,,}:" - ls -1 "$FRAMEWORK_DIR" | grep -v "\.sh$\|\.md$" | sed 's/^/ /' - exit 1 -fi - -# Check if deployment files exist -DEPLOY_FILE="$DEPLOY_PATH/deploy.yaml" -PERF_FILE="$DEPLOY_PATH/perf.yaml" - -if [[ ! -f "$DEPLOY_FILE" ]]; then - echo "Error: Deployment file '$DEPLOY_FILE' not found" - exit 1 -fi - -# Check if perf file exists (optional) -PERF_AVAILABLE=false -if [[ -f "$PERF_FILE" ]]; then - PERF_AVAILABLE=true - echo "Performance benchmark file found: $PERF_FILE" -else - echo "Performance benchmark file not found: $PERF_FILE (skipping benchmarks)" -fi - -# Show deployment information -echo "======================================" -echo "Dynamo Recipe Deployment" -echo "======================================" -echo "Model: $MODEL" -echo "Framework: ${FRAMEWORK,,}" -echo "Deployment Type: $DEPLOYMENT" -echo "Namespace: $NAMESPACE" -echo "GAIE integration: $GAIE" -echo "======================================" - -# Handle model downloading -MODEL_CACHE_DIR="$MODEL_DIR/model-cache" -echo "Creating PVC for model cache and downloading model..." -$DRY_RUN kubectl apply -n $NAMESPACE -f $MODEL_CACHE_DIR/model-cache.yaml -$DRY_RUN kubectl apply -n $NAMESPACE -f $MODEL_CACHE_DIR/model-download.yaml - -# Wait for the model download to complete -MODEL_DOWNLOAD_JOB_NAME=$(grep "name:" $MODEL_CACHE_DIR/model-download.yaml | head -1 | awk '{print $2}') -echo "Waiting for job '$MODEL_DOWNLOAD_JOB_NAME' to complete..." -$DRY_RUN kubectl wait --for=condition=Complete job/$MODEL_DOWNLOAD_JOB_NAME -n $NAMESPACE --timeout=6000s - -# Deploy the specified configuration -echo "Deploying $MODEL ${FRAMEWORK,,} $DEPLOYMENT configuration..." -$DRY_RUN kubectl apply -n $NAMESPACE -f $DEPLOY_FILE - -if [[ "$INTEGRATION" == "gaie" ]]; then - # run gaie checks. - SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" && pwd)" - "${SCRIPT_DIR}/gaie_checks.sh" - kubectl apply -f "$DEPLOY_PATH/gaie/k8s-manifests" -n "$NAMESPACE" - # For now do not run the benchmark - exit - fi - -# Launch the benchmark job (if available) -if [[ "$PERF_AVAILABLE" == "true" ]]; then - echo "Launching benchmark job..." - $DRY_RUN kubectl apply -n $NAMESPACE -f $PERF_FILE - - # Construct job name from the perf file - JOB_NAME=$(grep "name:" $PERF_FILE | head -1 | awk '{print $2}') - echo "Waiting for job '$JOB_NAME' to complete..." - $DRY_RUN kubectl wait --for=condition=Complete job/$JOB_NAME -n $NAMESPACE --timeout=6000s - - # Print logs from the benchmark job - echo "======================================" - echo "Benchmark completed. Logs:" - echo "======================================" - $DRY_RUN kubectl logs job/$JOB_NAME -n $NAMESPACE -else - echo "======================================" - echo "Deployment completed successfully!" - echo "No performance benchmark available for this configuration." - echo "======================================" -fi \ No newline at end of file From 4947aaf89f9ea3a834d68d51f4447ccfbbf66480 Mon Sep 17 00:00:00 2001 From: Ben Hamm Date: Thu, 6 Nov 2025 09:11:44 -0800 Subject: [PATCH 2/4] Remove emojis from README section headings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address feedback to make the README less AI-generated looking by removing decorative emojis from section headings while keeping status indicators (✅ ❌) in tables and content. Signed-off-by: Ben Hamm --- recipes/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/recipes/README.md b/recipes/README.md index abcac27198..6b751aba32 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -5,7 +5,7 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D > **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform. > If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first. -## 📊 Available Recipes +## Available Recipes | Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | |-------|-----------|------|------|------------|------------------|-------| @@ -24,7 +24,7 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D - **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete - **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided -## 📁 Recipe Structure +## Recipe Structure Each complete recipe follows this standard structure: @@ -40,7 +40,7 @@ Each complete recipe follows this standard structure: └── perf.yaml (optional) # AIPerf benchmark job ``` -## 🚀 Quick Start +## Quick Start ### Prerequisites @@ -147,7 +147,7 @@ kubectl logs -f job/ -n ${NAMESPACE} kubectl logs job/ -n ${NAMESPACE} | tail -50 ``` -## 📖 Example Deployments +## Example Deployments ### Llama-3-70B with vLLM (Aggregated) @@ -173,7 +173,7 @@ kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE} See [deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml](deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml) for the complete multi-node WideEP configuration. -## 🛠️ Customization +## Customization Each `deploy.yaml` contains: - **ConfigMap**: Engine-specific configuration (embedded in the manifest) @@ -220,7 +220,7 @@ image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1 # Update version tag as needed ``` -## 🔧 Troubleshooting +## Troubleshooting ### Common Issues @@ -243,7 +243,7 @@ image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1 - [Kubernetes Deployment Guide](../docs/kubernetes/README.md#troubleshooting) - [Observability Documentation](../docs/kubernetes/observability/) -## 📖 Related Documentation +## Related Documentation - **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Platform installation and concepts - **[API Reference](../docs/kubernetes/api_reference.md)** - DynamoGraphDeployment CRD specification @@ -253,7 +253,7 @@ image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1 - **[Observability](../docs/kubernetes/observability/)** - Monitoring and logging - **[Benchmarking Guide](../docs/benchmarks/benchmarking.md)** - Performance testing -## 🤝 Contributing +## Contributing We welcome contributions of new recipes! See [CONTRIBUTING.md](CONTRIBUTING.md) for: - Recipe submission guidelines From 30d7e2cb10e10de233bdca7d79729ad73029b85f Mon Sep 17 00:00:00 2001 From: Ben Hamm Date: Thu, 6 Nov 2025 10:30:54 -0800 Subject: [PATCH 3/4] docs: add hyperlinks to recipe table for easier navigation --- recipes/README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/recipes/README.md b/recipes/README.md index 6b751aba32..fd96639244 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -9,16 +9,16 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA D | Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes | |-------|-----------|------|------|------------|------------------|-------| -| **Llama-3-70B** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | -| **Llama-3-70B** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | -| **Llama-3-70B** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | -| **Qwen3-32B-FP8** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization | -| **Qwen3-32B-FP8** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | -| **GPT-OSS-120B** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | -| **GPT-OSS-120B** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | -| **DeepSeek-R1** | SGLang | Disagg WideEP | 8x H200 | ✅ | ❌ | Benchmark recipe pending | -| **DeepSeek-R1** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | Benchmark recipe pending | -| **DeepSeek-R1** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | +| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | +| **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | +| **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | +| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization | +| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | +| **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | +| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | +| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 8x H200 | ✅ | ❌ | Benchmark recipe pending | +| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | Benchmark recipe pending | +| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | **Legend:** - **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete From e0592e63693114d6f5cf6476c1e62f1324e87147 Mon Sep 17 00:00:00 2001 From: Ben Hamm Date: Thu, 6 Nov 2025 10:34:06 -0800 Subject: [PATCH 4/4] docs: fix trailing whitespace in README --- recipes/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/recipes/README.md b/recipes/README.md index fd96639244..af5d97ff44 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -2,7 +2,7 @@ Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo. -> **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform. +> **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform. > If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first. ## Available Recipes