You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
+6-10Lines changed: 6 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ The guide is intended for developers and practitioners seeking high-throughput o
22
22
23
23
## MoE Backend Support Matrix
24
24
25
-
There are multiple MOE backends inside TRT-LLM, not all of them supporting every precision on every GPUs. Here are the support matrix of the MOE backends.
25
+
There are multiple MOE backends inside TensorRT LLM, not all of them supporting every precision on every GPUs. Here are the support matrix of the MOE backends.
26
26
27
27
| device | Checkpoint | Supported moe_backend |
28
28
|----------|----------|----------|
@@ -60,7 +60,7 @@ Note:
60
60
61
61
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
62
62
63
-
### Creating the TRT-LLM Server config
63
+
### Creating the TensorRT LLM Server config
64
64
65
65
We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
66
66
@@ -103,15 +103,14 @@ moe_config:
103
103
EOF
104
104
```
105
105
106
-
### Launch the TRT-LLM Server
106
+
### Launch the TensorRT LLM Server
107
107
108
-
Below is an example command to launch the TRT-LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
108
+
Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
109
109
110
110
```shell
111
111
trtllm-serve deepseek-ai/DeepSeek-R1-0528 \
112
112
--host 0.0.0.0 \
113
113
--port 8000 \
114
-
--backend pytorch \
115
114
--max_batch_size 1024 \
116
115
--max_num_tokens 3200 \
117
116
--max_seq_len 2048 \
@@ -141,9 +140,6 @@ These options are used directly on the command line when you start the `trtllm-s
141
140
***Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
142
141
***Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
143
142
144
-
#### `--backend pytorch`
145
-
146
-
 **Description:** Tells TensorRT LLM to use the **pytorch** backend.
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
242
238
243
-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
239
+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
250
+
Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
255
251
256
252
```json
257
253
{"id":"cmpl-e728f08114c042309efeae4df86a50ca","object":"text_completion","created":1754294810,"model":"deepseek-ai/DeepSeek-R1-0528","choices":[{"index":0,"text":" / by Megan Stine ; illustrated by John Hinderliter.\n\nBook | Gross","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
58
58
59
-
### Creating the TRT-LLM Server config
59
+
### Creating the TensorRT LLM Server config
60
60
61
61
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
62
62
@@ -98,15 +98,14 @@ attention_dp_config:
98
98
EOF
99
99
```
100
100
101
-
### Launch the TRT-LLM Server
101
+
### Launch the TensorRT LLM Server
102
102
103
-
Below is an example command to launch the TRT-LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
103
+
Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
104
104
105
105
```shell
106
106
trtllm-serve openai/gpt-oss-120b \
107
107
--host 0.0.0.0 \
108
108
--port 8000 \
109
-
--backend pytorch \
110
109
--max_batch_size 720 \
111
110
--max_num_tokens 16384 \
112
111
--kv_cache_free_gpu_memory_fraction 0.9 \
@@ -135,10 +134,6 @@ These options are used directly on the command line when you start the `trtllm-s
135
134
***Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
136
135
***Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
137
136
138
-
#### `--backend pytorch`
139
-
140
-
***Description:** Tells TensorRT-LLM to use the **pytorch** backend.
141
-
142
137
#### `--max_batch_size`
143
138
144
139
***Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
203
198
204
-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
199
+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
+5-10Lines changed: 5 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ Note:
52
52
53
53
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
54
54
55
-
### Creating the TRT-LLM Server config
55
+
### Creating the TensorRT LLM Server config
56
56
57
57
We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
58
58
@@ -69,15 +69,14 @@ kv_cache_config:
69
69
EOF
70
70
```
71
71
72
-
### Launch the TRT-LLM Server
72
+
### Launch the TensorRT LLM Server
73
73
74
-
Below is an example command to launch the TRT-LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
74
+
Below is an example command to launch the TensorRT LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
75
75
76
76
```shell
77
77
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \
78
78
--host 0.0.0.0 \
79
79
--port 8000 \
80
-
--backend pytorch \
81
80
--max_batch_size 1024 \
82
81
--max_num_tokens 2048 \
83
82
--max_seq_len 2048 \
@@ -107,10 +106,6 @@ These options are used directly on the command line when you start the `trtllm-s
107
106
108
107
 **Recommendation:** If you experience OOM errors, try reducing this value to **0.8** or lower.
109
108
110
-
#### `--backend pytorch`
111
-
112
-
 **Description:** Tells TensorRT LLM to use the **pytorch** backend.
113
-
114
109
#### `--max_batch_size`
115
110
116
111
 **Description:** The maximum number of user requests that can be grouped into a single batch for processing.
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
196
191
197
-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
192
+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
203
+
Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
209
204
210
205
```json
211
206
{"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"nvidia/Llama-3.3-70B-Instruct-FP8","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
+5-10Lines changed: 5 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,7 +51,7 @@ Note:
51
51
52
52
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
53
53
54
-
### Creating the TRT-LLM Server config
54
+
### Creating the TensorRT LLM Server config
55
55
56
56
We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
57
57
@@ -68,15 +68,14 @@ kv_cache_config:
68
68
EOF
69
69
```
70
70
71
-
### Launch the TRT-LLM Server
71
+
### Launch the TensorRT LLM Server
72
72
73
-
Below is an example command to launch the TRT-LLM server with the Llama-4-Scout-17B-16E-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
73
+
Below is an example command to launch the TensorRT LLM server with the Llama-4-Scout-17B-16E-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
@@ -106,10 +105,6 @@ These options are used directly on the command line when you start the `trtllm-s
106
105
***Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
107
106
***Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
108
107
109
-
#### `--backend pytorch`
110
-
111
-
 **Description:** Tells TensorRT LLM to use the **pytorch** backend.
112
-
113
108
#### `--max_batch_size`
114
109
115
110
***Description:** The maximum number of user requests that can be grouped into a single batch for processing.
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
193
188
194
-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
189
+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
200
+
Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
206
201
207
202
```json
208
203
{"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"$MODEL","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
0 commit comments