This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
We recommend using the latest stable release of dynamo to avoid breaking changes:
You can find the latest release here and check out the corresponding branch with:
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))- Feature Support Matrix
- Quick Start
- Single Node Examples
- Advanced Examples
- KV Cache Transfer
- Client
- Benchmarking
- Multimodal Support
- Logits Processing
- Performance Sweep
| Feature | TensorRT-LLM | Notes |
|---|---|---|
| Disaggregated Serving | ✅ | |
| Conditional Disaggregation | 🚧 | Not supported yet |
| KV-Aware Routing | ✅ | |
| SLA-Based Planner | ✅ | |
| Load Based Planner | 🚧 | Planned |
| KVBM | ✅ |
| Feature | TensorRT-LLM | Notes |
|---|---|---|
| WideEP | ✅ | |
| DP Rank Routing | ✅ | |
| GB200 Support | ✅ |
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
docker compose -f deploy/docker-compose.yml up -dNote
- etcd is optional but is the default local discovery backend. You can also use
--kv_store fileto use file system based discovery. - NATS is optional - only needed if using KV routing with events (default). You can disable it with
--no-kv-eventsflag for prediction-based routing - On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets
DYN_DISCOVERY_BACKEND=kubernetesto enable native K8s service discovery (DynamoWorkerMetadata CRD)
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
# On an x86 machine:
./container/build.sh --framework trtllm
# On an ARM machine:
./container/build.sh --framework trtllm --platform linux/arm64
# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit main./container/run.sh --framework trtllm -itImportant
Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the python3 -m dynamo.frontend <args> to start up the ingress and using python3 -m dynamo.trtllm <args> to start up the workers. You can easily take each command and run them in separate terminals.
For detailed information about the architecture and how KV-aware routing works, see the KV Cache Routing documentation.
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.shcd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.shcd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.shImportant
In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.shcd $DYNAMO_HOME/examples/backends/trtllm
export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.shNotes:
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally,
ignore_eosshould generally be omitted or set tofalsewhen using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
For comprehensive instructions on multinode serving, see the multinode-examples.md guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see Llama4+eagle guide to learn how to use these scripts when a single worker fits on the single node.
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see TensorRT-LLM Kubernetes Deployment Guide.
See client section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.
To benchmark your deployment with AIPerf, see this utility script, configuring the
model name and host based on your deployment: perf.sh
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the KV cache transfer guide.
You can enable request migration to handle worker failures gracefully. Use the --migration-limit flag to specify how many times a request can be migrated to another worker:
# For decode and aggregated workers
python3 -m dynamo.trtllm ... --migration-limit=3Important
Prefill workers do not support request migration and must use --migration-limit=0 (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
See the Request Migration Architecture documentation for details on how this works.
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
| Prefill | Decode | |
|---|---|---|
| Aggregated | ✅ | ✅ |
| Disaggregated | ✅ | ✅ |
For more details, see the Request Cancellation Architecture documentation.
See client section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.
To benchmark your deployment with AIPerf, see this utility script, configuring the
model name and host based on your deployment: perf.sh
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the TensorRT-LLM Multimodal Guide.
Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
- Interface: Implement
dynamo.logits_processing.BaseLogitsProcessorwhich defines__call__(input_ids, logits)and modifieslogitsin-place. - TRT-LLM adapter: Use
dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)to convert Dynamo processors into TRT-LLM-compatible processors and assign them toSamplingParams.logits_processor. - Examples: See example processors in
lib/bindings/python/src/dynamo/logits_processing/examples/(temperature, hello_world).
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
cd $DYNAMO_HOME/examples/backends/trtllm
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.shNotes:
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
- Expected chat response contains "Hello world".
Implement a processor by conforming to BaseLogitsProcessor and modify logits in-place. For example, temperature scaling:
from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor
class TemperatureProcessor(BaseLogitsProcessor):
def __init__(self, temperature: float = 1.0):
if temperature <= 0:
raise ValueError("Temperature must be positive")
self.temperature = temperature
def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
if self.temperature == 1.0:
return
logits.div_(self.temperature)Wire it into TRT-LLM by adapting and attaching to SamplingParams:
from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor
processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
- Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model. This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: Running KVBM in TensorRT-LLM .