| title |
|---|
TensorRT-LLM |
We recommend using the latest stable release of Dynamo to avoid breaking changes.
Dynamo TensorRT-LLM integrates TensorRT-LLM engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.
| Feature | TensorRT-LLM | Notes |
|---|---|---|
| Disaggregated Serving | ✅ | |
| Conditional Disaggregation | 🚧 | Not supported yet |
| KV-Aware Routing | ✅ | |
| SLA-Based Planner | ✅ | |
| Load Based Planner | 🚧 | Planned |
| KVBM | ✅ |
| Feature | TensorRT-LLM | Notes |
|---|---|---|
| WideEP | ✅ | |
| DP Rank Routing | ✅ | |
| GB200 Support | ✅ |
Step 1 (host terminal): Start infrastructure services:
docker compose -f deploy/docker-compose.yml up -dStep 2 (host terminal): Pull and run the prebuilt container:
DYNAMO_VERSION=0.9.0
docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
docker run --gpus all -it --network host --ipc host \
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSIONNote
The DYNAMO_VERSION variable above can be set to any specific available version of the container.
To find the available tensorrtllm-runtime versions for Dynamo, visit the NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime.
Step 3 (inside the container): Launch an aggregated serving deployment (uses Qwen/Qwen3-0.6B by default):
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.shThe launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting MODEL_PATH and SERVED_MODEL_NAME environment variables before running the script.
Step 4 (host terminal): Verify the deployment:
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
"stream": true,
"max_tokens": 30
}'You can deploy TensorRT-LLM with Dynamo on Kubernetes using a DynamoGraphDeployment. For more details, see the TensorRT-LLM Kubernetes Deployment Guide.
- Reference Guide: Features, configuration, and operational details
- Examples: All deployment patterns with launch scripts
- KV Cache Transfer: KV cache transfer methods for disaggregated serving
- Prometheus Metrics: Metrics and monitoring
- Multinode Examples: Multi-node deployment with SLURM
- Deploying TensorRT-LLM with Dynamo on Kubernetes: Kubernetes deployment guide