Skip to content

Latest commit

 

History

History
474 lines (377 loc) · 17.6 KB

File metadata and controls

474 lines (377 loc) · 17.6 KB

Dynamo Deployment with Aiconfigurator Guide

This guide walks through

  • installing aiconfigurator
  • building the Dynamo container
  • generating configuration files(currently, we only support trtllm)
  • deploying Dynamo (single-node and two-node)
  • benchmarking the service and comparison

Take qwen3-32b-fp8 model as an example.

Currently auto configuration / script generation only support trtllm backend

All-in-one Automation process

we're now supporting automate everything in one script, starting from configuring the deployment, generating the configs, preparing docker image and container, pulling model checkpoints, deploying the service, benchmarking and summarizing. Refer to Automation for more details.

Step-by-step Manual Deployment and Performance Alignment

Methodology

First, we need to define the problem we want to solve clearly. For a given model, we need to understand what's the constraint, we use ISL, OSL to define the target seqeunce length, and TTFT, TPOT to set the SLA constraint. Let's define the problem:

Can we find a config {parellel, concurrency} for given {ISL, OSL, Model, GPU}, which maximizes tokens/s/gpu, under TPOT and TTFT 
constraints.

Let's take a look at the pareto frontier,

  Pareto Frontier:
              QWEN3_32B Pareto Frontier: tokens/s/gpu vs tokens/s/user          
    ┌──────────────────────────────────────────────────────────────────────────┐
2250┤ •• disagg                                                                │
    │ ff agg                                                                   │
    │ xx disagg best                                                           │
    │                                                                          │
1875┤  ff                                                                      │
    │   fff                                                                    │
    │     ff                                                                   │
    │      fff••                                                               │
1500┤         f •••                                                            │
    │         ff   ••••••••                                                    │
    │          ffff       •                                                    │
    │              f       •••••••                                             │
1125┤               ff            •                                            │
    │                ff            ••••                                        │
    │                  ffff            ••••x                                   │
    │                     fff              ••••                                │
 750┤                        fff               •                               │
    │                          ffffff           •                              │
    │                                ffffff      ••                            │
    │                                      fffffff ••••••                      │
 375┤                                             ff    •                      │
    │                                               fffffff•••••••••           │
    │                                                      ffffffffff          │
    │                                                                          │
   0┤                                                                          │
    └┬─────────────────┬──────────────────┬─────────────────┬─────────────────┬┘
     0                60                 120               180              240 
tokens/s/gpu                        tokens/s/user                               

Here the TPOT_limit=10ms. All datapoints shown on the pareto frontier meet TTFT_limit=1000ms
Each point on the pareto frontier can represent a different config {parallel, concurrency}.
The pareto frontier means, no matter how you change your deployment parallel strategy and benchmark with different concurrency, the datapoint will be under the frontier.
What we need is the highest point on the frontier which is left to 1000ms/TPOT_limit = 100 tokens/s/user. The point tagged as x is the one we find. This point indicates the parellel strategy as well as the concurrency level
We can find that, the config in this parallel strategy is potentially only best for this given concurrency instead of being generally better.
Thus we need corresponding benchmark way to make it work. Set concurrency sweep from 1 to target_concurrency predicted by aiconfigurator. E.g., [1 2 4 8 ... target_concurrency] Compare the result at target_concurrency with TTFT, TPOT, tokens/s/gpu and previous baseline you have

In order to reduce the impact of first batch of requests, we use concurrency * 10 as num_requests In order to aovid undefined cache hit rate when benchmarking with random data, we delebrately disable cache reuse to make it fair.

Step-by-step Manual Deployment

Problem:
16 H200 in total. QWen3 32B FP8.
ISL=4000, OSL=512, TTFT=300ms, TPOT=10ms, optimize tokens/s/gpu

If you would like to deploy by your own, when running the aiconfigurator cli exp|default, engine configuration files and executable scripts are automatically generated under the --save_dir, in the topx folder. The directory structure is:

results/QWEN3_32B_isl4000_osl1000_ttft1000_tpot20_904495
├── agg
│   ├── best_config_topn.csv
│   ├── config.yaml
│   ├── pareto.csv
│   ├── top1
│   │   ├── agg
│   │   │   ├── agg_config.yaml
│   │   │   ├── k8s_deploy.yaml
│   │   │   └── node_0_run.sh 
│   │   └── generator_config.yaml
│   ...
├── disagg
│   ├── best_config_topn.csv
│   ├── config.yaml
│   ├── pareto.csv
│   ├── top1
│   │   ├── disagg
│   │   │   ├── decode_config.yaml
│   │   │   ├── k8s_deploy.yaml
│   │   │   ├── node_0_run.sh
│   │   │   └── prefill_config.yaml
│   │   └── generator_config.yaml
│   ...
└── pareto_frontier.png

Here, agg_config.yaml, prefill_config.yaml, and decode_config.yaml are TRTLLM engine configuration files, and node_x_run.sh are the executable scripts. k8s_deploy.yaml is for deployment in k8s. In this guide, we're not using k8s.

For multi-node setups, there will be multiple node_x_run.sh scripts (one per node), each invoking the same TRTLLM engine config file. By default, node_0_run.sh starts both the frontend service and the workers, assuming ETCD and NATS are already running on node0, while other nodes only start the workers. Therefore, in multi-node deployments, please specify --head_node_ip to indicate the IP address of node0.

FIXME, info below is all outdated. Need to be updated

Typically, the command is:

aiconfigurator cli default \
  --system h200_sxm \
  --model QWEN3_32B \
  --isl 5000 \
  --osl 1000 \
  --ttft 2000 \
  --tpot 50 \
  --save_dir results \
  --total_gpus 16 \
  --model_path /workspace/model_hub/Qwen3-32B-FP8 \
  --served_model_name Qwen3-32B-FP8 \
  --head_node_ip x.x.x.x

Since TRTLLM has many custom parameters, run:

aiconfigurator cli --help

to see all supported options. We haven’t listed all trtllm configurations here. Feel free to modify the generated configuration file to include any additional ones (trtllm args).

To customize parameters per worker type, prefix them with prefill_, decode_, or agg_. For example:

aiconfigurator cli default \
  --system h200_sxm \
  --model QWEN3_32B \
  --isl 5000 \
  --osl 1000 \
  --ttft 2000 \
  --tpot 50 \
  --save_dir results \
  --total_gpus 16 \
  --model_path /workspace/model_hub/Qwen3-32B-FP8 \
  --served_model_name Qwen3-32B-FP8 \
  --head_node_ip x.x.x.x \
  --free_gpu_memory_fraction 0.9 \        # default for all worker types
  --prefill_free_gpu_memory_fraction 0.8 \  # for prefill workers
  --decode_free_gpu_memory_fraction 0.5     # for decode workers

At runtime, copy the generated backend_configs directory to each node and execute the corresponding script:

# On node0
bash node_0_run.sh

# On other nodes
bash node_x_run.sh

Note: The generated configs are for deploying 1 replica instead of the cluster (defined as total_gpus). We'll bridge this gap in future.


Prerequisites

  • Docker with GPU support

1. Environment Setup

1.1 Install aiconfigurator

Use a minimal Ubuntu base image with python installed.

# Install Git LFS
apt-get update && apt-get install -y git-lfs

# Clone the repo
git clone https://github.com/ai-dynamo/aiconfigurator.git
cd aiconfigurator

# Install build tools and aiconfigurator
pip3 install "."

1.2 Build the Dynamo Container

In this example, we're using Dynamo 0.5.0, please switch to release/0.5.0 first.

# other version of trtllm can be used as well
# currently dynamo is at version 0.4.0, indicated in the tag
./container/build.sh \
  --framework TRTLLM \
  --tensorrtllm-pip-wheel tensorrt-llm==1.0.0rc6 \
  --tag dynamo:0.4.0-trtllm-1.0.0rc6

Please refer to Dynamo Getting Started for detailed dynamo installation

1.3 Download model checkpoint

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Qwen/Qwen3-32B-FP8 --local-dir /raid/hub/qwen3-32b-fp8

Please modify based on your own path '/raid/hub/qwen3-32b-fp8'


2. Running etcd and NATS

On Node 0, start etcd and NATS.io:

docker compose -f deploy/docker-compose.yml up -d

3. Single-Node Deployment

3.1 Generate Configuration with aiconfigurator

aiconfigurator cli default \
  --system h200_sxm \
  --isl 5000 \
  --osl 1000 \
  --ttft 1000 \
  --tpot 10 \
  --save_dir ./ \
  --model QWEN3_32B \
  --model_path /workspace/model_hub/qwen3-32b-fp8 \
  --served_model_name Qwen/Qwen3-32B-FP8 \
  --total_gpus 8 \
  --head_node_ip 0.0.0.0 \
  --generated_config_version 1.0.0rc4 \
  --prefill_free_gpu_memory_fraction 0.9 \
  --free_gpu_memory_fraction 0.7 \
  --decode_free_gpu_memory_fraction 0.5

We use 1.0.0rc3 (our latest data) for aiconfigurator and we can support generate configurations for running with trtllm 1.0.0rc4 worker.
--model is for aiconfigurator and --served_model_name is for dynamo deployment

For other supported configurations, please run aiconfigurator cli --help.

3.2 Verify Generated Configuration

engine configuration files and executable scripts are automatically generated under the --save_dir, in the backend_configs folder. The directory structure is:

backend_configs/
├── agg/
│   ├── agg_config.yaml
│   └── node_0_run.sh
└── disagg/
│   ├── decode_config.yaml
│   ├── prefill_config.yaml
│   ├── node_0_run.sh
│   ├── node_1_run.sh
│   └── ...
└──

3.3 Launch the Dynamo Container

cd ..
docker run --gpus all --net=host --ipc=host \
  -v $(pwd):/workspace/mount_dir \
  -v /raid/hub:/workspace/model_hub/ \
  --rm -it dynamo:0.4.0-trtllm-1.0.0rc4

3.4 Deploy the service

Inside the container:

cd /workspace/mount_dir/dynamo/components/backends/trtllm

# Copy generated configs from save_dir
cp -r ${your_save_dir}/QWEN3_32B_isl5000_osl1000_ttft1000_tpot50_*/backend_configs/* ./

# Launch dynamo
bash disagg/node_0_run.sh

Tip: If you see a Triton version mismatch error, reinstall Triton:

pip uninstall -y triton
pip install triton==3.3.1

3.5 Test the Service

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "Qwen/Qwen3-32B-FP8",
    "messages": [
      { "role": "user", "content": "Introduce yourself" }
    ],
    "stream": true
  }'

3.6 Benchmark


4. Two-Node Deployment

4.1 Generate Configuration for Two Nodes

# For head_node_ip, ensure that the IP passed here corresponds to node 0, etcd and NATS.io have already been started on node 0 in Step 2
aiconfigurator cli default \
  --system h200_sxm \
  --isl 5000 \
  --osl 1000 \
  --ttft 200 \
  --tpot 8 \
  --save_dir ./ \
  --model QWEN3_32B \
  --model_path /workspace/model_hub/qwen3-32b-fp8 \
  --served_model_name Qwen/Qwen3-32B-FP8 \
  --total_gpus 16 \
  --head_node_ip NODE_0_IP \
  --generated_config_version 1.0.0rc4 \
  --prefill_free_gpu_memory_fraction 0.8 \
  --free_gpu_memory_fraction 0.7 \
  --decode_free_gpu_memory_fraction 0.5

Note that even if --total_gpus 16, the optimal configuration generated by aiconfigurator may not require 16 GPUs. If only 8 GPUs are needed, it may produce just a node_0_run.sh, which can then be executed on each node.

Refer to the single node example to run the container on both node 0 and node 1.

4.2 Deploy on Node 0

Inside the container:

cd /workspace/mount_dir/dynamo/components/backends/trtllm
cp -r QWEN3_32B_isl5000_osl1000_ttft200_tpot8_*/backend_configs/* ./
bash disagg/node_0_run.sh

4.3 Deploy on Node 1

Inside the container:

cd /workspace/mount_dir/dynamo/components/backends/trtllm
cp -r QWEN3_32B_isl5000_osl1000_ttft200_tpot8_*/backend_configs/* ./
bash disagg/node_1_run.sh

4.4 Test the Service

curl http://NODE_0_IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "Qwen/Qwen3-32B-FP8",
    "messages": [
      { "role": "user", "content": "Introduce yourself" }
    ],
    "stream": true
  }'

5. Deploy on Kubernetes

The generator can also emit a Kubernetes CR (k8s_deploy.yaml) for the K8S deployment.

For deploying Dynamo on Kubernetes, please refer to this dynamo/deploy and make sure to install the CRDs and platform first.

5.1 Generate Configuration for K8S

This produces disagg/k8s_deploy.yaml (and for Agg, agg/k8s_deploy.yaml) under --save_dir.

# Example (Disagg)
aiconfigurator cli default \
  --system h200_sxm \
  --isl 5000 \
  --osl 1000 \
  --ttft 200 \
  --tpot 8 \
  --save_dir ./ \
  --model QWEN3_32B \
  --model_path Qwen/Qwen3-32B-FP8 \
  --served_model_name Qwen/Qwen3-32B-FP8 \
  --total_gpus 8 \
  --generated_config_version 1.0.0rc6 \
  --prefill_free_gpu_memory_fraction 0.8 \
  --cache_transceiver_backend default \
  --free_gpu_memory_fraction 0.7 \
  --decode_free_gpu_memory_fraction 0.5 \
  --k8s_image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0 \
  --k8s_engine_mode inline \
  --k8s_model_cache pvc:model-cache \
  --k8s_namespace dynamo-custom-ns \

Since different versions of TensorRT-LLM often have variations in configuration, please specify --generated_config_version to match the version used when generating configs. For the specific TensorRT-LLM version corresponding to a official dynamo image, you can refer to, for example, this pyproject, or check directly inside the container by running: python -c import tensorrt_llm; print(tensorrt_llm.__version__). In this case, since the image is nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0, you should set --generated_config_version 1.0.0rc6.

Apply (inline mode - default)

--k8s_engine_mode inline

Inline mode embeds the engine configs into the Pod startup script; no ConfigMap is needed:

kubectl apply -f disagg/k8s_deploy.yaml
# or
kubectl apply -f agg/k8s_deploy.yaml

Apply (configmap mode - optional)

--k8s_engine_mode configmap

If you prefer a separate ConfigMap:

# Disagg
kubectl -n dynamo-custom-ns create configmap engine-configs \
  --from-file=prefill_config.yaml=disagg/prefill_config.yaml \
  --from-file=decode_config.yaml=disagg/decode_config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -

# Agg
kubectl -n dynamo-custom-ns create configmap engine-configs \
  --from-file=agg_config.yaml=agg/agg_config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -

# Then apply the CR
kubectl apply -f disagg/k8s_deploy.yaml   # or agg/k8s_deploy.yaml

Additional arguments specific to Kubernetes

All of the following are CLI flags (because they are referenced as dynamo_config.* inside the templates). Defaults shown in bold.

  • --k8s_engine_mode={inline|configmap} - engine config delivery. inline by default.
  • --k8s_model_cache={disabled|pvc:<claimName>} - model cache mount. Default pvc:model-cache (mounted at /workspace/model_cache). It only takes effect when you use a PVC to store the model; otherwise, if you directly specify something like --model_path Qwen/Qwen3-32B-FP8, the model will be downloaded from Hugging Face, and you don’t need to provide --k8s_model_cache.
  • --k8s_namespace=<ns> - target namespace. Default dynamo.
  • --k8s_image=<image> - runtime image. Default nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0.
  • --k8s_image_pull_secret=<secret> - optional pull secret name.