Skip to content

Commit 6a63177

Browse files
[TRTLLM-8680][doc] Add table with one-line deployment commands to docs (#8173)
Signed-off-by: Anish Shanbhag <[email protected]>
1 parent d0f107e commit 6a63177

21 files changed

+511
-349
lines changed

docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ A complete reference for the API is available in the [OpenAI API Reference](http
55

66
This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B and Qwen2.5-VL-7B for multimodal models:
77
* Methodology Introduction
8-
* Launch the OpenAI-Compatibale Server with NGC container
8+
* Launch the OpenAI-Compatible Server with NGC container
99
* Run the performance benchmark
1010
* Using `extra_llm_api_options`
1111
* Multimodal Serving and Benchmarking

docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md renamed to docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md

Lines changed: 58 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
1+
# Deployment Guide for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
22

33
## Introduction
44

@@ -47,7 +47,7 @@ docker run --rm -it \
4747
-p 8000:8000 \
4848
-v ~/.cache:/root/.cache:rw \
4949
--name tensorrt_llm \
50-
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
50+
nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
5151
/bin/bash
5252
```
5353

@@ -60,108 +60,102 @@ Note:
6060

6161
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
6262

63-
### Creating the TensorRT LLM Server config
63+
### Recommended Performance Settings
6464

65-
We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
65+
We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.
6666

6767
```shell
68-
EXTRA_LLM_API_FILE=/tmp/config.yml
69-
70-
cat << EOF > ${EXTRA_LLM_API_FILE}
71-
enable_attention_dp: true
72-
cuda_graph_config:
73-
enable_padding: true
74-
max_batch_size: 128
75-
kv_cache_config:
76-
dtype: fp8
77-
stream_interval: 10
78-
speculative_config:
79-
decoding_type: MTP
80-
num_nextn_predict_layers: 1
81-
EOF
68+
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
69+
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml
70+
```
71+
72+
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
73+
74+
````{admonition} Show code
75+
:class: dropdown
76+
77+
```{literalinclude} ../../../examples/configs/deepseek-r1-throughput.yaml
78+
---
79+
language: shell
80+
prepend: |
81+
EXTRA_LLM_API_FILE=/tmp/config.yml
82+
83+
cat << EOF > ${EXTRA_LLM_API_FILE}
84+
append: EOF
85+
---
8286
```
87+
````
8388

84-
For FP8 model, we need extra `moe_config`:
89+
To use the `DeepGEMM` MOE backend on B200/GB200, use this config instead:
8590

8691
```shell
87-
EXTRA_LLM_API_FILE=/tmp/config.yml
88-
89-
cat << EOF > ${EXTRA_LLM_API_FILE}
90-
enable_attention_dp: true
91-
cuda_graph_config:
92-
enable_padding: true
93-
max_batch_size: 128
94-
kv_cache_config:
95-
dtype: fp8
96-
stream_interval: 10
97-
speculative_config:
98-
decoding_type: MTP
99-
num_nextn_predict_layers: 1
100-
moe_config:
101-
backend: DEEPGEMM
102-
max_num_tokens: 3200
103-
EOF
92+
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
93+
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml
10494
```
10595

96+
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
97+
98+
````{admonition} Show code
99+
:class: dropdown
100+
101+
```{literalinclude} ../../../examples/configs/deepseek-r1-deepgemm.yaml
102+
---
103+
language: shell
104+
prepend: |
105+
EXTRA_LLM_API_FILE=/tmp/config.yml
106+
107+
cat << EOF > ${EXTRA_LLM_API_FILE}
108+
append: EOF
109+
---
110+
```
111+
````
112+
106113
### Launch the TensorRT LLM Server
107114

108-
Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
115+
Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.
109116

110117
```shell
111-
trtllm-serve deepseek-ai/DeepSeek-R1-0528 \
112-
--host 0.0.0.0 \
113-
--port 8000 \
114-
--max_batch_size 1024 \
115-
--max_num_tokens 3200 \
116-
--max_seq_len 2048 \
117-
--kv_cache_free_gpu_memory_fraction 0.8 \
118-
--tp_size 8 \
119-
--ep_size 8 \
120-
--trust_remote_code \
121-
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
118+
trtllm-serve deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE}
122119
```
123120

124121
After the server is set up, the client can now send prompt requests to the server and receive results.
125122

126-
### Configs and Parameters
123+
### LLM API Options (YAML Configuration)
124+
125+
<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->
126+
127+
These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
127128

128-
These options are used directly on the command line when you start the `trtllm-serve` process.
129129

130-
#### `--tp_size`
130+
#### `tensor_parallel_size`
131131

132132
* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
133133

134-
#### `--ep_size`
134+
#### `moe_expert_parallel_size`
135135

136-
* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
136+
* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
137137

138-
#### `--kv_cache_free_gpu_memory_fraction`
138+
#### `kv_cache_free_gpu_memory_fraction`
139139

140140
* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
141141
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
142142

143-
144-
#### `--max_batch_size`
143+
#### `max_batch_size`
145144

146145
* **Description:** The maximum number of user requests that can be grouped into a single batch for processing.
147146

148-
#### `--max_num_tokens`
147+
#### `max_num_tokens`
149148

150149
* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
151150

152-
#### `--max_seq_len`
151+
#### `max_seq_len`
153152

154153
* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.
155154

156-
#### `--trust_remote_code`
155+
#### `trust_remote_code`
157156

158157
&emsp;**Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
159158

160-
161-
#### Extra LLM API Options (YAML Configuration)
162-
163-
These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
164-
165159
#### `kv_cache_config`
166160

167161
* **Description**: A section for configuring the Key-Value (KV) cache.

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md renamed to docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md

Lines changed: 59 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware
1+
# Deployment Guide for GPT-OSS on TensorRT-LLM - Blackwell Hardware
22

33
## Introduction
44

@@ -43,7 +43,7 @@ docker run --rm -it \
4343
-p 8000:8000 \
4444
-v ~/.cache:/root/.cache:rw \
4545
--name tensorrt_llm \
46-
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
46+
nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
4747
/bin/bash
4848
```
4949

@@ -56,105 +56,103 @@ Note:
5656

5757
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
5858

59-
### Creating the TensorRT LLM Server config
59+
### Recommended Performance Settings
6060

61-
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
61+
We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.
6262

63-
For low-latency with `TRTLLM` MOE backend:
63+
For low-latency use cases:
6464

6565
```shell
66-
EXTRA_LLM_API_FILE=/tmp/config.yml
67-
68-
cat << EOF > ${EXTRA_LLM_API_FILE}
69-
enable_attention_dp: false
70-
cuda_graph_config:
71-
enable_padding: true
72-
max_batch_size: 720
73-
moe_config:
74-
backend: TRTLLM
75-
stream_interval: 20
76-
num_postprocess_workers: 4
77-
EOF
66+
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
67+
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml
68+
```
69+
70+
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
71+
72+
````{admonition} Show code
73+
:class: dropdown
74+
75+
```{literalinclude} ../../../examples/configs/gpt-oss-120b-latency.yaml
76+
---
77+
language: shell
78+
prepend: |
79+
EXTRA_LLM_API_FILE=/tmp/config.yml
80+
81+
cat << EOF > ${EXTRA_LLM_API_FILE}
82+
append: EOF
83+
---
7884
```
85+
````
7986

80-
For max-throughput with `CUTLASS` MOE backend:
87+
For max-throughput use cases:
8188

8289
```shell
83-
EXTRA_LLM_API_FILE=/tmp/config.yml
84-
85-
cat << EOF > ${EXTRA_LLM_API_FILE}
86-
enable_attention_dp: true
87-
cuda_graph_config:
88-
enable_padding: true
89-
max_batch_size: 720
90-
moe_config:
91-
backend: CUTLASS
92-
stream_interval: 20
93-
num_postprocess_workers: 4
94-
attention_dp_config:
95-
enable_balance: true
96-
batching_wait_iters: 50
97-
timeout_iters: 1
98-
EOF
90+
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
91+
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml
92+
```
93+
94+
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
95+
96+
````{admonition} Show code
97+
:class: dropdown
98+
99+
```{literalinclude} ../../../examples/configs/gpt-oss-120b-throughput.yaml
100+
---
101+
language: shell
102+
prepend: |
103+
EXTRA_LLM_API_FILE=/tmp/config.yml
104+
105+
cat << EOF > ${EXTRA_LLM_API_FILE}
106+
append: EOF
107+
---
99108
```
109+
````
100110

101111
### Launch the TensorRT LLM Server
102112

103-
Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
113+
Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.
104114

105115
```shell
106-
trtllm-serve openai/gpt-oss-120b \
107-
--host 0.0.0.0 \
108-
--port 8000 \
109-
--max_batch_size 720 \
110-
--max_num_tokens 16384 \
111-
--kv_cache_free_gpu_memory_fraction 0.9 \
112-
--tp_size 8 \
113-
--ep_size 8 \
114-
--trust_remote_code \
115-
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
116+
trtllm-serve openai/gpt-oss-120b --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE}
116117
```
117118

118119
After the server is set up, the client can now send prompt requests to the server and receive results.
119120

120-
### Configs and Parameters
121+
### LLM API Options (YAML Configuration)
121122

122-
These options are used directly on the command line when you start the `trtllm-serve` process.
123+
<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->
123124

124-
#### `--tp_size`
125+
These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
126+
127+
#### `tensor_parallel_size`
125128

126129
* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
127130

128-
#### `--ep_size`
131+
#### `moe_expert_parallel_size`
129132

130-
* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
133+
* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
131134

132-
#### `--kv_cache_free_gpu_memory_fraction`
135+
#### `kv_cache_free_gpu_memory_fraction`
133136

134137
* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
135138
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
136139

137-
#### `--max_batch_size`
140+
#### `max_batch_size`
138141

139142
* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
140143

141-
#### `--max_num_tokens`
144+
#### `max_num_tokens`
142145

143146
* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
144147

145-
#### `--max_seq_len`
148+
#### `max_seq_len`
146149

147-
* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config.
150+
* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. If not set, it will be inferred from model config.
148151

149-
#### `--trust_remote_code`
152+
#### `trust_remote_code`
150153

151154
* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
152155

153-
154-
#### Extra LLM API Options (YAML Configuration)
155-
156-
These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
157-
158156
#### `cuda_graph_config`
159157

160158
* **Description**: A section for configuring CUDA graphs to optimize performance.

0 commit comments

Comments
 (0)