dynamo/docs/backends/trtllm/README.md at main · drivenets/dynamo

title
TensorRT-LLM

Use the Latest Release

We recommend using the latest stable release of Dynamo to avoid breaking changes.

Dynamo TensorRT-LLM integrates TensorRT-LLM engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.

Feature Support Matrix

Core Dynamo Features

Feature	TensorRT-LLM	Notes
Disaggregated Serving	✅
Conditional Disaggregation	🚧	Not supported yet
KV-Aware Routing	✅
SLA-Based Planner	✅
Load Based Planner	🚧	Planned
KVBM	✅

Large Scale P/D and WideEP Features

Feature	TensorRT-LLM	Notes
WideEP	✅
DP Rank Routing	✅
GB200 Support	✅

Quick Start

Step 1 (host terminal): Start infrastructure services:

docker compose -f deploy/docker-compose.yml up -d

Step 2 (host terminal): Pull and run the prebuilt container:

DYNAMO_VERSION=0.9.0
docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
docker run --gpus all -it --network host --ipc host \
  nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION

Note

The DYNAMO_VERSION variable above can be set to any specific available version of the container. To find the available tensorrtllm-runtime versions for Dynamo, visit the NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime.

Step 3 (inside the container): Launch an aggregated serving deployment (uses Qwen/Qwen3-0.6B by default):

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh

The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting MODEL_PATH and SERVED_MODEL_NAME environment variables before running the script.

Step 4 (host terminal): Verify the deployment:

curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
    "stream": true,
    "max_tokens": 30
  }'

Kubernetes Deployment

You can deploy TensorRT-LLM with Dynamo on Kubernetes using a DynamoGraphDeployment. For more details, see the TensorRT-LLM Kubernetes Deployment Guide.

Next Steps

Reference Guide: Features, configuration, and operational details
Examples: All deployment patterns with launch scripts
KV Cache Transfer: KV cache transfer methods for disaggregated serving
Prometheus Metrics: Metrics and monitoring
Multinode Examples: Multi-node deployment with SLURM
Deploying TensorRT-LLM with Dynamo on Kubernetes: Kubernetes deployment guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the Latest Release

Feature Support Matrix

Core Dynamo Features

Large Scale P/D and WideEP Features

Quick Start

Kubernetes Deployment

Next Steps

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Use the Latest Release

Feature Support Matrix

Core Dynamo Features

Large Scale P/D and WideEP Features

Quick Start

Kubernetes Deployment

Next Steps