Skip to content

Commit a193867

Browse files
authored
[None][doc] Refine deployment guide by renaming TRT-LLM to TensorRT L… (#8214)
Signed-off-by: nv-guomingz <[email protected]>
1 parent 27677a3 commit a193867

File tree

4 files changed

+22
-41
lines changed

4 files changed

+22
-41
lines changed

docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The guide is intended for developers and practitioners seeking high-throughput o
2222

2323
## MoE Backend Support Matrix
2424

25-
There are multiple MOE backends inside TRT-LLM, not all of them supporting every precision on every GPUs. Here are the support matrix of the MOE backends.
25+
There are multiple MOE backends inside TensorRT LLM, not all of them supporting every precision on every GPUs. Here are the support matrix of the MOE backends.
2626

2727
| device | Checkpoint | Supported moe_backend |
2828
|----------|----------|----------|
@@ -60,7 +60,7 @@ Note:
6060

6161
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
6262

63-
### Creating the TRT-LLM Server config
63+
### Creating the TensorRT LLM Server config
6464

6565
We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
6666

@@ -103,15 +103,14 @@ moe_config:
103103
EOF
104104
```
105105

106-
### Launch the TRT-LLM Server
106+
### Launch the TensorRT LLM Server
107107

108-
Below is an example command to launch the TRT-LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
108+
Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
109109

110110
```shell
111111
trtllm-serve deepseek-ai/DeepSeek-R1-0528 \
112112
--host 0.0.0.0 \
113113
--port 8000 \
114-
--backend pytorch \
115114
--max_batch_size 1024 \
116115
--max_num_tokens 3200 \
117116
--max_seq_len 2048 \
@@ -141,9 +140,6 @@ These options are used directly on the command line when you start the `trtllm-s
141140
* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
142141
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
143142

144-
#### `--backend pytorch`
145-
146-
&emsp;**Description:** Tells TensorRT LLM to use the **pytorch** backend.
147143

148144
#### `--max_batch_size`
149145

@@ -240,7 +236,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
240236
241237
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
242238

243-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
239+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
244240

245241
```shell
246242
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
@@ -251,7 +247,7 @@ curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -
251247
}'
252248
```
253249

254-
Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
250+
Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
255251

256252
```json
257253
{"id":"cmpl-e728f08114c042309efeae4df86a50ca","object":"text_completion","created":1754294810,"model":"deepseek-ai/DeepSeek-R1-0528","choices":[{"index":0,"text":" / by Megan Stine ; illustrated by John Hinderliter.\n\nBook | Gross","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ The guide is intended for developers and practitioners seeking high-throughput o
2121

2222
## MoE Backend Support Matrix
2323

24-
There are multiple MOE backends inside TRT-LLM. Here are the support matrix of the MOE backends.
24+
There are multiple MOE backends inside TensorRT LLM. Here are the support matrix of the MOE backends.
2525

2626
| Device | Activation Type | MoE Weights Type | MoE Backend | Use Case |
2727
|------------|------------------|------------------|-------------|----------------|
@@ -56,7 +56,7 @@ Note:
5656

5757
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
5858

59-
### Creating the TRT-LLM Server config
59+
### Creating the TensorRT LLM Server config
6060

6161
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
6262

@@ -98,15 +98,14 @@ attention_dp_config:
9898
EOF
9999
```
100100

101-
### Launch the TRT-LLM Server
101+
### Launch the TensorRT LLM Server
102102

103-
Below is an example command to launch the TRT-LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
103+
Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
104104

105105
```shell
106106
trtllm-serve openai/gpt-oss-120b \
107107
--host 0.0.0.0 \
108108
--port 8000 \
109-
--backend pytorch \
110109
--max_batch_size 720 \
111110
--max_num_tokens 16384 \
112111
--kv_cache_free_gpu_memory_fraction 0.9 \
@@ -135,10 +134,6 @@ These options are used directly on the command line when you start the `trtllm-s
135134
* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
136135
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
137136

138-
#### `--backend pytorch`
139-
140-
* **Description:** Tells TensorRT-LLM to use the **pytorch** backend.
141-
142137
#### `--max_batch_size`
143138

144139
* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
@@ -201,7 +196,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
201196

202197
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
203198

204-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
199+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
205200

206201
```shell
207202
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
@@ -217,7 +212,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
217212
}' -w "\n"
218213
```
219214

220-
Here is an example response, showing that the TRT-LLM server reasons and answers the questions.
215+
Here is an example response, showing that the TensorRT LLM server reasons and answers the questions.
221216

222217
TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
223218

docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ Note:
5252

5353
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
5454

55-
### Creating the TRT-LLM Server config
55+
### Creating the TensorRT LLM Server config
5656

5757
We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
5858

@@ -69,15 +69,14 @@ kv_cache_config:
6969
EOF
7070
```
7171

72-
### Launch the TRT-LLM Server
72+
### Launch the TensorRT LLM Server
7373

74-
Below is an example command to launch the TRT-LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
74+
Below is an example command to launch the TensorRT LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
7575

7676
```shell
7777
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \
7878
--host 0.0.0.0 \
7979
--port 8000 \
80-
--backend pytorch \
8180
--max_batch_size 1024 \
8281
--max_num_tokens 2048 \
8382
--max_seq_len 2048 \
@@ -107,10 +106,6 @@ These options are used directly on the command line when you start the `trtllm-s
107106

108107
&emsp;**Recommendation:** If you experience OOM errors, try reducing this value to **0.8** or lower.
109108

110-
#### `--backend pytorch`
111-
112-
&emsp;**Description:** Tells TensorRT LLM to use the **pytorch** backend.
113-
114109
#### `--max_batch_size`
115110

116111
&emsp;**Description:** The maximum number of user requests that can be grouped into a single batch for processing.
@@ -194,7 +189,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
194189

195190
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
196191

197-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
192+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
198193

199194
```shell
200195
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
@@ -205,7 +200,7 @@ curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -
205200
}'
206201
```
207202

208-
Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
203+
Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
209204

210205
```json
211206
{"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"nvidia/Llama-3.3-70B-Instruct-FP8","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}

docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ Note:
5151

5252
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
5353

54-
### Creating the TRT-LLM Server config
54+
### Creating the TensorRT LLM Server config
5555

5656
We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
5757

@@ -68,15 +68,14 @@ kv_cache_config:
6868
EOF
6969
```
7070

71-
### Launch the TRT-LLM Server
71+
### Launch the TensorRT LLM Server
7272

73-
Below is an example command to launch the TRT-LLM server with the Llama-4-Scout-17B-16E-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
73+
Below is an example command to launch the TensorRT LLM server with the Llama-4-Scout-17B-16E-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
7474

7575
```shell
7676
trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 \
7777
--host 0.0.0.0 \
7878
--port 8000 \
79-
--backend pytorch \
8079
--max_batch_size 1024 \
8180
--max_num_tokens 2048 \
8281
--max_seq_len 2048 \
@@ -106,10 +105,6 @@ These options are used directly on the command line when you start the `trtllm-s
106105
* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
107106
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
108107

109-
#### `--backend pytorch`
110-
111-
&emsp;**Description:** Tells TensorRT LLM to use the **pytorch** backend.
112-
113108
#### `--max_batch_size`
114109

115110
* **Description:** The maximum number of user requests that can be grouped into a single batch for processing.
@@ -191,7 +186,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
191186

192187
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
193188

194-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
189+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
195190

196191
```shell
197192
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
@@ -202,7 +197,7 @@ curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -
202197
}'
203198
```
204199

205-
Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
200+
Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
206201

207202
```json
208203
{"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"$MODEL","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}

0 commit comments

Comments
 (0)