Skip to content

Commit dfa11d8

Browse files
authored
[TRTC-102][docs] --extra_llm_api_options->--config in docs/examples/tests (#10005)
1 parent 7b71ff6 commit dfa11d8

File tree

70 files changed

+625
-498
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+625
-498
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ tensorrt_llm/scripts
5656
docs/source/**/*.rst
5757
!docs/source/examples/index.rst
5858
!docs/source/deployment-guide/config_table.rst
59-
!docs/source/deployment-guide/note_sections.rst
59+
!docs/source/_includes/note_sections.rst
6060
*.swp
6161

6262
# Testing

docs/source/deployment-guide/note_sections.rst renamed to docs/source/_includes/note_sections.rst

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,20 @@
11
..
2-
Reusable note sections for deployment guides.
2+
Reusable note sections for docs.
33
Include specific notes using:
44
5-
.. include:: note_sections.rst
5+
.. include:: <path-to>/note_sections.rst
66
:start-after: .. start-note-<name>
77
:end-before: .. end-note-<name>
88

9+
.. start-note-config-flag-alias
10+
11+
.. note::
12+
13+
**Non-breaking**: ``--config <file.yaml>`` is the preferred flag for passing a :ref:`YAML configuration file <configuring-with-yaml-files>`.
14+
Existing workflows using ``--extra_llm_api_options <file.yaml>`` continue to work; it is an equivalent alias.
15+
16+
.. end-note-config-flag-alias
17+
918
.. start-note-traffic-patterns
1019
1120
.. note::

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ To do the benchmark, run the following command:
139139
```bash
140140
YOUR_DATA_PATH=<your dataset file following the format>
141141

142-
cat >./extra-llm-api-config.yml<<EOF
142+
cat >./config.yml<<EOF
143143
moe_config:
144144
backend: TRTLLM
145145
speculative_config:
@@ -157,7 +157,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
157157
--max_batch_size 1 \
158158
--tp 8 \
159159
--ep 2 \
160-
--extra_llm_api_options ./extra-llm-api-config.yml
160+
--config ./config.yml
161161
```
162162

163163
Explanation:
@@ -168,7 +168,7 @@ Explanation:
168168
- `--max_batch_size`: Max batch size in each rank.
169169
- `--tp`: Tensor parallel size.
170170
- `--ep`: Expert parallel size.
171-
- `--extra_llm_api_options`: Used to specify some extra config. The content of the file is as follows:
171+
- `--config`: Used to specify extra YAML configuration. The content of the file is as follows:
172172

173173
#### Expected Results
174174
The perf can be different when using different datasets and different machines.
@@ -195,7 +195,7 @@ We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers
195195

196196
#### Benchmark
197197
```bash
198-
cat >./extra-llm-api-config.yml <<EOF
198+
cat >./config.yml <<EOF
199199
cuda_graph_config:
200200
enable_padding: true
201201
batch_sizes:
@@ -218,7 +218,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-0528-FP4
218218
throughput
219219
--dataset ${YOUR_DATA_PATH}
220220
--tp 8 --ep 8
221-
--extra_llm_api_options ./extra-llm-api-config.yml
221+
--config ./config.yml
222222
--max_batch_size 896
223223
--max_num_tokens 2048
224224
--kv_cache_free_gpu_mem_fraction 0.93
@@ -261,7 +261,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
261261

262262
YOUR_DATA_PATH=./dataset.txt
263263

264-
cat >./extra-llm-api-config.yml <<EOF
264+
cat >./config.yml <<EOF
265265
cuda_graph_config:
266266
enable_padding: true
267267
batch_sizes:
@@ -290,7 +290,7 @@ trtllm-bench -m nvidia/DeepSeek-R1-FP4 \
290290
--num_requests 49152 \
291291
--concurrency 3072 \
292292
--kv_cache_free_gpu_mem_fraction 0.85 \
293-
--extra_llm_api_options ./extra-llm-api-config.yml
293+
--config ./config.yml
294294
```
295295

296296
#### Expected Result Format
@@ -315,7 +315,7 @@ To do the benchmark, run the following command:
315315
```bash
316316
YOUR_DATA_PATH=<your dataset file following the format>
317317

318-
cat >./extra-llm-api-config.yml<<EOF
318+
cat >./config.yml<<EOF
319319
speculative_config:
320320
decoding_type: MTP
321321
num_nextn_predict_layers: 3
@@ -329,7 +329,7 @@ trtllm-bench --model deepseek-ai/DeepSeek-R1 \
329329
--tp 8 \
330330
--ep 4 \
331331
--concurrency 1 \
332-
--extra_llm_api_options ./extra-llm-api-config.yml
332+
--config ./config.yml
333333
```
334334

335335
#### Expected Result Format
@@ -363,7 +363,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
363363

364364
YOUR_DATA_PATH=./dataset.txt
365365

366-
cat >./extra-llm-api-config.yml<<EOF
366+
cat >./config.yml<<EOF
367367
cuda_graph_config:
368368
batch_sizes:
369369
- 128
@@ -384,7 +384,7 @@ trtllm-bench -m deepseek-ai/DeepSeek-R1 \
384384
--num_requests 5120 \
385385
--concurrency 1024 \
386386
--kv_cache_free_gpu_mem_fraction 0.8 \
387-
--extra_llm_api_options ./extra-llm-api-config.yml
387+
--config ./config.yml
388388
```
389389

390390
#### Expected Result Format
@@ -408,7 +408,7 @@ Average request latency (ms): 181540.5739
408408
To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use the `trtllm-bench prepare-dataset` subcommand to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
409409
### WIP: Enable more features by default
410410

411-
Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
411+
Currently, there are some features that need to be enabled through a user-defined file `config.yml`, such as attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
412412

413413
Note that, `max_batch_size` and `max_num_tokens` can easily affect the performance. The default values for them are already carefully designed and should deliver good performance on overall cases, however, you may still need to tune it for peak performance.
414414

docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ Notes:
105105
Run the following command inside the container to start the endpoint:
106106

107107
```bash
108-
TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10 --tp_size 8 --ep_size 4 --trust_remote_code --extra_llm_api_options /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
108+
TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10 --tp_size 8 --ep_size 4 --trust_remote_code --config /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
109109
```
110110

111111
The server initializes, loads, and optimizes the models. After it is ready, it listens on port 8000.

docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ To benchmark min-latency performance with MTP, you need to follow [this document
122122
```bash
123123
YOUR_DATA_PATH=<your dataset file following the format>
124124

125-
cat >./extra-llm-api-config.yml<<EOF
125+
cat >./config.yml<<EOF
126126
cuda_graph_config: {}
127127
moe_config:
128128
backend: TRTLLM
@@ -142,7 +142,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
142142
--max_batch_size 1 \
143143
--tp 8 \
144144
--ep 2 \
145-
--extra_llm_api_options ./extra-llm-api-config.yml
145+
--config ./config.yml
146146
```
147147

148148
## MTP optimization - Relaxed Acceptance
@@ -178,7 +178,7 @@ To benchmark min-latency performance with MTP Relaxed Acceptance, you need to fo
178178
```bash
179179
YOUR_DATA_PATH=<your dataset file following the format>
180180

181-
cat >./extra-llm-api-config.yml<<EOF
181+
cat >./config.yml<<EOF
182182
cuda_graph_config: {}
183183
moe_config:
184184
backend: TRTLLM
@@ -201,7 +201,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
201201
--max_batch_size 1 \
202202
--tp 8 \
203203
--ep 2 \
204-
--extra_llm_api_options ./extra-llm-api-config.yml
204+
--config ./config.yml
205205
```
206206

207207
## Evaluation

docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -541,7 +541,7 @@ Prepare a dataset following the [benchmarking documentation](https://github.com/
541541
Run 32-way expert parallelism inference on the prepared dataset. Please refer to the [LLM API MGMN example](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/llm_mgmn_trtllm_bench.sh) for details on running `trtllm-bench` on Slurm.
542542

543543
```bash
544-
cat > ./extra_llm_api_options.yaml <<EOF
544+
cat > ./config.yaml <<EOF
545545
enable_attention_dp: true
546546
EOF
547547

@@ -551,7 +551,7 @@ trtllm-bench --model ${MODEL_NAME} \
551551
throughput \
552552
--tp 32 \
553553
--ep 32 \
554-
--extra_llm_api_options ./extra_llm_api_options.yaml \
554+
--config ./config.yaml \
555555
--kv_cache_free_gpu_mem_fraction 0.75 \
556556
--backend pytorch \
557557
--dataset ./dataset.json \
@@ -621,7 +621,7 @@ export EXPERT_STATISTIC_ITER_RANGE=100-200
621621
Run 36-way expert parallelism inference with the EPLB configuration incorporated:
622622

623623
```bash
624-
cat > ./extra_llm_api_options_eplb.yaml <<EOF
624+
cat > ./config_eplb.yaml <<EOF
625625
enable_attention_dp: true
626626
moe_config:
627627
load_balancer: ./moe_load_balancer.yaml
@@ -633,7 +633,7 @@ trtllm-bench --model ${MODEL_NAME} \
633633
throughput \
634634
--tp 36 \
635635
--ep 36 \
636-
--extra_llm_api_options ./extra_llm_api_options_eplb.yaml \
636+
--config ./config_eplb.yaml \
637637
--kv_cache_free_gpu_mem_fraction 0.75 \
638638
--backend pytorch \
639639
--dataset ./dataset.json \

docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ docker run -d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
7373
trtllm-serve /config/models/maverick \
7474
--host 0.0.0.0 --port 8000 \
7575
--tp_size 8 --ep_size 1 \
76-
--trust_remote_code --extra_llm_api_options c.yaml \
76+
--trust_remote_code --config c.yaml \
7777
--kv_cache_free_gpu_memory_fraction 0.75"
7878
```
7979

docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ trtllm-bench \
8686
--backend pytorch \
8787
--tp ${num_gpus} \
8888
--ep 1 \
89-
--extra_llm_api_options low_latency.yaml \
89+
--config low_latency.yaml \
9090
--dataset gpt-oss-120b-1k2k.txt \
9191
--max_batch_size ${max_batch_size} \
9292
--concurrency ${max_batch_size} \
@@ -149,7 +149,7 @@ trtllm-bench \
149149
--backend pytorch \
150150
--tp ${num_gpus} \
151151
--ep ${num_gpus} \
152-
--extra_llm_api_options max_throughput.yaml \
152+
--config max_throughput.yaml \
153153
--dataset gpt-oss-120b-1k2k.txt \
154154
--max_batch_size ${max_batch_size} \
155155
--concurrency $((max_batch_size * num_gpus)) \
@@ -171,7 +171,7 @@ Currently, the best throughput **19.5k tps/gpu** is achieved with DP4EP4 using 4
171171

172172
## Launch the TensorRT-LLM Server
173173

174-
We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
174+
We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
175175
**Note:** You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
176176

177177
```bash
@@ -184,7 +184,7 @@ trtllm-serve openai/gpt-oss-120b \
184184
--ep_size 8 \
185185
--max_batch_size 640 \
186186
--trust_remote_code \
187-
--extra_llm_api_options max_throughput.yaml \
187+
--config max_throughput.yaml \
188188
--kv_cache_free_gpu_memory_fraction 0.9
189189
```
190190
</details>
@@ -201,7 +201,7 @@ trtllm-serve \
201201
--ep_size 4 \
202202
--max_batch_size 640 \
203203
--trust_remote_code \
204-
--extra_llm_api_options max_throughput.yaml \
204+
--config max_throughput.yaml \
205205
--kv_cache_free_gpu_memory_fraction 0.9
206206
```
207207
</details>
@@ -223,7 +223,7 @@ OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT LLM
223223

224224
### Selecting Triton as the MoE backend
225225

226-
To use the Triton MoE backend with **trtllm-serve** (or other similar commands) add this snippet to the YAML file passed via `--extra_llm_api_options`:
226+
To use the Triton MoE backend with **trtllm-serve** (or other similar commands) add this snippet to the YAML file passed via `--config`:
227227

228228
```yaml
229229
moe_config:
@@ -347,7 +347,7 @@ OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM
347347

348348
### Selecting Triton as the MoE backend
349349

350-
To use the Triton MoE backend with **trtllm-serve** (or other commands), add this snippet to the YAML file passed via `--extra_llm_api_options`:
350+
To use the Triton MoE backend with **trtllm-serve** (or other commands), add this snippet to the YAML file passed via `--config`:
351351

352352
```yaml
353353
moe_config:

docs/source/commands/trtllm-bench.rst

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,12 @@ trtllm-bench
33

44
trtllm-bench is a comprehensive benchmarking tool for TensorRT LLM engines. It provides three main subcommands for different benchmarking scenarios:
55

6-
**Common Options for All Commands:**
6+
.. include:: ../_includes/note_sections.rst
7+
:start-after: .. start-note-config-flag-alias
8+
:end-before: .. end-note-config-flag-alias
79

8-
**Usage:**
10+
Syntax
11+
------
912

1013
.. click:: tensorrt_llm.commands.bench:main
1114
:prog: trtllm-bench
@@ -14,8 +17,11 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT LLM engines. It p
1417

1518

1619

20+
Dataset preparation
21+
------------------
22+
1723
prepare_dataset.py
18-
===========================
24+
^^^^^^^^^^^^^^^^^^
1925

2026
trtllm-bench is designed to work with the `prepare_dataset.py <https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/prepare_dataset.py>`_ script, which generates benchmark datasets in the required format. The prepare_dataset script supports:
2127

@@ -38,7 +44,7 @@ trtllm-bench is designed to work with the `prepare_dataset.py <https://github.co
3844
**Usage:**
3945

4046
prepare_dataset
41-
-------------------
47+
"""""""""""""""
4248

4349
.. code-block:: bash
4450
@@ -72,7 +78,7 @@ prepare_dataset
7278
- Logging level: info or debug (default: info)
7379

7480
dataset
75-
-------------------
81+
"""""""
7682

7783
Process real datasets from various sources.
7884

@@ -103,7 +109,7 @@ Process real datasets from various sources.
103109

104110

105111
token_norm_dist
106-
-------------------
112+
"""""""""""""""
107113

108114
Generate synthetic datasets with normal token distribution.
109115

@@ -134,7 +140,7 @@ Generate synthetic datasets with normal token distribution.
134140

135141

136142
token_unif_dist
137-
-------------------
143+
"""""""""""""""
138144

139145
Generate synthetic datasets with uniform token distribution
140146

docs/source/commands/trtllm-eval.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,10 @@ Alternatively, the ``--model`` argument also accepts a local path to pre-built T
7979

8080
For more details, see ``trtllm-eval --help`` and ``trtllm-eval <task> --help``.
8181

82+
.. include:: ../_includes/note_sections.rst
83+
:start-after: .. start-note-config-flag-alias
84+
:end-before: .. end-note-config-flag-alias
85+
8286

8387

8488
Syntax

0 commit comments

Comments
 (0)