You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ A complete reference for the API is available in the [OpenAI API Reference](http
5
5
6
6
This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B and Qwen2.5-VL-7B for multimodal models:
7
7
* Methodology Introduction
8
-
* Launch the OpenAI-Compatibale Server with NGC container
8
+
* Launch the OpenAI-Compatible Server with NGC container
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md
+58-64Lines changed: 58 additions & 64 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
1
+
# Deployment Guide for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
2
2
3
3
## Introduction
4
4
@@ -47,7 +47,7 @@ docker run --rm -it \
47
47
-p 8000:8000 \
48
48
-v ~/.cache:/root/.cache:rw \
49
49
--name tensorrt_llm \
50
-
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
50
+
nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
51
51
/bin/bash
52
52
```
53
53
@@ -60,108 +60,102 @@ Note:
60
60
61
61
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
62
62
63
-
### Creating the TensorRT LLM Server config
63
+
### Recommended Performance Settings
64
64
65
-
We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
65
+
We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.
66
66
67
67
```shell
68
-
EXTRA_LLM_API_FILE=/tmp/config.yml
69
-
70
-
cat <<EOF > ${EXTRA_LLM_API_FILE}
71
-
enable_attention_dp: true
72
-
cuda_graph_config:
73
-
enable_padding: true
74
-
max_batch_size: 128
75
-
kv_cache_config:
76
-
dtype: fp8
77
-
stream_interval: 10
78
-
speculative_config:
79
-
decoding_type: MTP
80
-
num_nextn_predict_layers: 1
81
-
EOF
68
+
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
115
+
Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.
After the server is set up, the client can now send prompt requests to the server and receive results.
125
122
126
-
### Configs and Parameters
123
+
### LLM API Options (YAML Configuration)
124
+
125
+
<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->
126
+
127
+
These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
127
128
128
-
These options are used directly on the command line when you start the `trtllm-serve` process.
129
129
130
-
#### `--tp_size`
130
+
#### `tensor_parallel_size`
131
131
132
132
***Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
133
133
134
-
#### `--ep_size`
134
+
#### `moe_expert_parallel_size`
135
135
136
-
***Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
136
+
***Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
137
137
138
-
#### `--kv_cache_free_gpu_memory_fraction`
138
+
#### `kv_cache_free_gpu_memory_fraction`
139
139
140
140
***Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
141
141
***Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
142
142
143
-
144
-
#### `--max_batch_size`
143
+
#### `max_batch_size`
145
144
146
145
***Description:** The maximum number of user requests that can be grouped into a single batch for processing.
147
146
148
-
#### `--max_num_tokens`
147
+
#### `max_num_tokens`
149
148
150
149
***Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
151
150
152
-
#### `--max_seq_len`
151
+
#### `max_seq_len`
153
152
154
153
***Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.
155
154
156
-
#### `--trust_remote_code`
155
+
#### `trust_remote_code`
157
156
158
157
 **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
159
158
160
-
161
-
#### Extra LLM API Options (YAML Configuration)
162
-
163
-
These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
164
-
165
159
#### `kv_cache_config`
166
160
167
161
***Description**: A section for configuring the Key-Value (KV) cache.
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md
+59-61Lines changed: 59 additions & 61 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware
1
+
# Deployment Guide for GPT-OSS on TensorRT-LLM - Blackwell Hardware
2
2
3
3
## Introduction
4
4
@@ -43,7 +43,7 @@ docker run --rm -it \
43
43
-p 8000:8000 \
44
44
-v ~/.cache:/root/.cache:rw \
45
45
--name tensorrt_llm \
46
-
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
46
+
nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
47
47
/bin/bash
48
48
```
49
49
@@ -56,105 +56,103 @@ Note:
56
56
57
57
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
58
58
59
-
### Creating the TensorRT LLM Server config
59
+
### Recommended Performance Settings
60
60
61
-
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
61
+
We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRTLLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.
62
62
63
-
For low-latency with `TRTLLM` MOE backend:
63
+
For low-latency use cases:
64
64
65
65
```shell
66
-
EXTRA_LLM_API_FILE=/tmp/config.yml
67
-
68
-
cat <<EOF > ${EXTRA_LLM_API_FILE}
69
-
enable_attention_dp: false
70
-
cuda_graph_config:
71
-
enable_padding: true
72
-
max_batch_size: 720
73
-
moe_config:
74
-
backend: TRTLLM
75
-
stream_interval: 20
76
-
num_postprocess_workers: 4
77
-
EOF
66
+
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
113
+
Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.
After the server is set up, the client can now send prompt requests to the server and receive results.
119
120
120
-
### Configs and Parameters
121
+
### LLM API Options (YAML Configuration)
121
122
122
-
These options are used directly on the command line when you start the `trtllm-serve` process.
123
+
<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->
123
124
124
-
#### `--tp_size`
125
+
These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
126
+
127
+
#### `tensor_parallel_size`
125
128
126
129
***Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
127
130
128
-
#### `--ep_size`
131
+
#### `moe_expert_parallel_size`
129
132
130
-
***Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
133
+
***Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
131
134
132
-
#### `--kv_cache_free_gpu_memory_fraction`
135
+
#### `kv_cache_free_gpu_memory_fraction`
133
136
134
137
***Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
135
138
***Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
136
139
137
-
#### `--max_batch_size`
140
+
#### `max_batch_size`
138
141
139
142
***Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
140
143
141
-
#### `--max_num_tokens`
144
+
#### `max_num_tokens`
142
145
143
146
***Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
144
147
145
-
#### `--max_seq_len`
148
+
#### `max_seq_len`
146
149
147
-
***Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config.
150
+
***Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. If not set, it will be inferred from model config.
148
151
149
-
#### `--trust_remote_code`
152
+
#### `trust_remote_code`
150
153
151
154
***Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
152
155
153
-
154
-
#### Extra LLM API Options (YAML Configuration)
155
-
156
-
These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
157
-
158
156
#### `cuda_graph_config`
159
157
160
158
***Description**: A section for configuring CUDA graphs to optimize performance.
0 commit comments