Skip to content

Commit 2147c63

Browse files
lancellyLanyu Liao
authored andcommitted
[TRTLLM-9513][docs] Qwen3 deployment guide (NVIDIA#9488)
Signed-off-by: Lanyu Liao <laliao@laliao-mlt.client.nvidia.com> Co-authored-by: Lanyu Liao <laliao@laliao-mlt.client.nvidia.com>
1 parent 7ecd542 commit 2147c63

File tree

2 files changed

+257
-0
lines changed

2 files changed

+257
-0
lines changed
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# Deployment Guide for Qwen3 on TensorRT LLM - Blackwell & Hopper Hardware
2+
3+
## Introduction
4+
5+
This is a functional quick-start guide for running the Qwen3 model on TensorRT LLM. It focuses on a working setup with recommended defaults. Additional performance optimizations and support will be rolled out in future updates.
6+
7+
## Prerequisites
8+
9+
* GPU: NVIDIA Blackwell or Hopper Architecture
10+
* OS: Linux
11+
* Drivers: CUDA Driver 575 or Later
12+
* Docker with NVIDIA Container Toolkit installed
13+
* Python3 and python3-pip (Optional, for accuracy evaluation only)
14+
15+
## Models
16+
17+
* [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)
18+
* [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)
19+
* [Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8)
20+
* [Qwen3-30B-A3B-NVFP4](https://huggingface.co/nvidia/Qwen3-30B-A3B-NVFP4)
21+
* [Qwen3-235B-A22B-NVFP4](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4)
22+
23+
## Deployment Steps
24+
25+
### Run Docker Container
26+
27+
Build and run the docker container. See the [Docker guide](../../../docker/README.md) for details.
28+
29+
```shell
30+
cd TensorRT-LLM
31+
32+
make -C docker release_build IMAGE_TAG=qwen3-local
33+
34+
make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=qwen3-local LOCAL_USER=1
35+
```
36+
37+
### Recommended Performance Settings
38+
39+
We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.
40+
41+
```shell
42+
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
43+
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3.yaml
44+
```
45+
46+
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
47+
48+
````{admonition} Show code
49+
:class: dropdown
50+
51+
```{literalinclude} ../../../examples/configs/qwen3.yaml
52+
---
53+
language: shell
54+
prepend: |
55+
EXTRA_LLM_API_FILE=/tmp/config.yml
56+
57+
cat << EOF > ${EXTRA_LLM_API_FILE}
58+
append: EOF
59+
---
60+
```
61+
````
62+
63+
64+
### Launch the TensorRT LLM Server
65+
66+
Below is an example command to launch the TensorRT LLM server with the Qwen3 model from within the container.
67+
68+
```shell
69+
trtllm-serve Qwen/Qwen3-30B-A3B --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE}
70+
```
71+
72+
After the server is set up, the client can now send prompt requests to the server and receive results.
73+
74+
### LLM API Options (YAML Configuration)
75+
76+
<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->
77+
78+
These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
79+
80+
#### `tensor_parallel_size`
81+
82+
* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
83+
84+
#### `moe_expert_parallel_size`
85+
86+
* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
87+
88+
#### `kv_cache_free_gpu_memory_fraction`
89+
90+
* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
91+
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
92+
93+
94+
#### `max_batch_size`
95+
96+
* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
97+
98+
#### `max_num_tokens`
99+
100+
* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
101+
102+
#### `max_seq_len`
103+
104+
* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config.
105+
106+
#### `trust_remote_code`
107+
* **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
108+
109+
#### `cuda_graph_config`
110+
111+
* **Description**: A section for configuring CUDA graphs to optimize performance.
112+
113+
* **Options**:
114+
115+
* `enable_padding`: If `true`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance.
116+
117+
**Default**: `false`
118+
119+
* `batch_sizes`: List of batch sizes for which CUDA graphs will be pre-captured.
120+
121+
**Recommendation**: Set this to cover the range of batch sizes you expect in production.
122+
123+
#### `moe_config`
124+
125+
* **Description**: Configuration for Mixture-of-Experts (MoE) models.
126+
127+
* **Options**:
128+
129+
* `backend`: The backend to use for MoE operations.
130+
131+
**Default**: `CUTLASS`
132+
133+
See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the `extra_llm_api_options`.
134+
135+
## Testing API Endpoint
136+
137+
### Basic Test
138+
139+
Start a new terminal on the host to test the TensorRT LLM server you just launched.
140+
141+
You can query the health/readiness of the server using:
142+
143+
```shell
144+
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
145+
```
146+
147+
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
148+
149+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
150+
151+
```shell
152+
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
153+
"model": "Qwen/Qwen3-30B-A3B",
154+
"messages": [
155+
{
156+
"role": "user",
157+
"content": "What is the capital of France?"
158+
}
159+
],
160+
"max_tokens": 512,
161+
"temperature": 0.7,
162+
"top_p": 0.95
163+
}' -w "\n"
164+
```
165+
166+
Here is an example response:
167+
168+
```json
169+
{
170+
"id": "chatcmpl-abc123def456",
171+
"object": "chat.completion",
172+
"created": 1759022940,
173+
"model": "Qwen/Qwen3-30B-A3B",
174+
"choices": [
175+
{
176+
"index": 0,
177+
"message": {
178+
"role": "assistant",
179+
"content": "The capital of France is Paris. Paris is not only the capital but also the largest city in France, known for its rich history, culture, art, and iconic landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral."
180+
},
181+
"logprobs": null,
182+
"finish_reason": "stop"
183+
}
184+
],
185+
"usage": {
186+
"prompt_tokens": 15,
187+
"completion_tokens": 58,
188+
"total_tokens": 73
189+
}
190+
}
191+
```
192+
193+
### Troubleshooting Tips
194+
195+
* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size`, `max_num_tokens`, or `kv_cache_free_gpu_memory_fraction`.
196+
* Ensure your model checkpoints are compatible with the expected format.
197+
* For performance issues, check GPU utilization with `nvidia-smi` while the server is running.
198+
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
199+
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.
200+
* For MoE models (Qwen3-30B-A3B, Qwen3-235B-A22B), ensure `moe_expert_parallel_size` is properly configured.
201+
202+
## Benchmarking Performance
203+
204+
To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first create a wrapper `bench.sh` script.
205+
206+
```shell
207+
cat <<'EOF' > bench.sh
208+
#!/usr/bin/env bash
209+
set -euo pipefail
210+
211+
# Adjust the model name based on which Qwen3 model you're benchmarking
212+
MODEL_NAME="Qwen/Qwen3-30B-A3B"
213+
214+
concurrency_list="1 2 4 8 16 32 64 128"
215+
multi_round=5
216+
isl=1024
217+
osl=1024
218+
result_dir=/tmp/qwen3_output
219+
220+
for concurrency in ${concurrency_list}; do
221+
num_prompts=$((concurrency * multi_round))
222+
python -m tensorrt_llm.serve.scripts.benchmark_serving \
223+
--model ${MODEL_NAME} \
224+
--backend openai \
225+
--dataset-name "random" \
226+
--random-input-len ${isl} \
227+
--random-output-len ${osl} \
228+
--random-prefix-len 0 \
229+
--random-ids \
230+
--num-prompts ${num_prompts} \
231+
--max-concurrency ${concurrency} \
232+
--ignore-eos \
233+
--tokenize-on-client \
234+
--percentile-metrics "ttft,tpot,itl,e2el"
235+
done
236+
EOF
237+
chmod +x bench.sh
238+
```
239+
240+
To achieve max through-put, with attention DP on, one needs to sweep up to `concurrency = max_batch_size * num_gpus`.
241+
242+
If you want to save the results to a file add the following options.
243+
244+
```shell
245+
--save-result \
246+
--result-dir "${result_dir}" \
247+
--result-filename "concurrency_${concurrency}.json"
248+
```
249+
250+
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
251+
252+
Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
253+
254+
```shell
255+
./bench.sh
256+
```

docs/source/deployment-guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,4 +91,5 @@ The deployment guides below provide more detailed instructions for serving speci
9191
deployment-guide-for-llama3.3-70b-on-trtllm.md
9292
deployment-guide-for-llama4-scout-on-trtllm.md
9393
deployment-guide-for-gpt-oss-on-trtllm.md
94+
deployment-guide-for-qwen3-on-trtllm.md
9495
deployment-guide-for-qwen3-next-on-trtllm.md

0 commit comments

Comments
 (0)