Skip to content

Latest commit

 

History

History
122 lines (83 loc) · 5.05 KB

File metadata and controls

122 lines (83 loc) · 5.05 KB
title
Examples

For quick start instructions, see the TensorRT-LLM README. This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments.

Table of Contents

Infrastructure Setup

For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

docker compose -f deploy/docker-compose.yml up -d
- **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery. - **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events. - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD). Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs `python3 -m dynamo.frontend ` to start up the ingress and `python3 -m dynamo.trtllm ` to start up the workers.

For detailed information about the architecture and how KV-aware routing works, see the Router Guide.

Single Node Examples

Aggregated

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh

Aggregated with KV Routing

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.sh

Disaggregated

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.sh

Disaggregated with KV Routing

In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.sh

Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1

cd $DYNAMO_HOME/examples/backends/trtllm

export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark. - MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Advanced Examples

Multinode Deployment

For comprehensive instructions on multinode serving, see the Multinode Examples guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the Llama4 + Eagle guide to learn how to use these scripts when a single worker fits on a single node.

Speculative Decoding

Model-Specific Guides

Kubernetes Deployment

For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the TensorRT-LLM Kubernetes Deployment Guide.

Performance Sweep

For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model.

Client

See the client section to learn how to send requests to the deployment.

To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `.

Benchmarking

To benchmark your deployment with AIPerf, see this utility script, configuring the model name and host based on your deployment: perf.sh