NVIDIA
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 1 deletion b/‎.gitignore‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎…ource/deployment-guide/note_sections.rst‎ ‎docs/source/_includes/note_sections.rst‎docs/source/deployment-guide/note_sections.rst renamed to docs/source/_includes/note_sections.rst
Lines changed: 11 additions & 2 deletions b/‎…ource/deployment-guide/note_sections.rst‎ ‎docs/source/_includes/note_sections.rst‎docs/source/deployment-guide/note_sections.rst renamed to docs/source/_includes/note_sections.rst
Lines changed: 11 additions & 2 deletions
diff --git a/‎docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md‎
Lines changed: 12 additions & 12 deletions b/‎docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md‎
Lines changed: 7 additions & 7 deletions b/‎docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎docs/source/commands/trtllm-bench.rst‎
Lines changed: 13 additions & 7 deletions b/‎docs/source/commands/trtllm-bench.rst‎
Lines changed: 13 additions & 7 deletions
diff --git a/‎docs/source/commands/trtllm-eval.rst‎
Lines changed: 4 additions & 0 deletions b/‎docs/source/commands/trtllm-eval.rst‎
Lines changed: 4 additions & 0 deletions
@@ -56,7 +56,7 @@ tensorrt_llm/scripts
 docs/source/**/*.rst
 !docs/source/examples/index.rst
 !docs/source/deployment-guide/config_table.rst
-!docs/source/deployment-guide/note_sections.rst
+!docs/source/_includes/note_sections.rst
 *.swp
 
 # Testing
 
@@ -1,11 +1,20 @@
 ..
-   Reusable note sections for deployment guides.
+   Reusable note sections for docs.
    Include specific notes using:
 
-   .. include:: note_sections.rst
+   .. include:: <path-to>/note_sections.rst
       :start-after: .. start-note-<name>
       :end-before: .. end-note-<name>
 
+.. start-note-config-flag-alias
+
+.. note::
+
+   **Non-breaking**: ``--config <file.yaml>`` is the preferred flag for passing a :ref:`YAML configuration file <configuring-with-yaml-files>`.
+   Existing workflows using ``--extra_llm_api_options <file.yaml>`` continue to work; it is an equivalent alias.
+
+.. end-note-config-flag-alias
+
 .. start-note-traffic-patterns
 
 .. note::
 
@@ -139,7 +139,7 @@ To do the benchmark, run the following command:
 ```bash
 YOUR_DATA_PATH=<your dataset file following the format>
 
-cat >./extra-llm-api-config.yml<<EOF
+cat >./config.yml<<EOF
 moe_config:
   backend: TRTLLM
 speculative_config:
@@ -157,7 +157,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
     --max_batch_size 1 \
     --tp 8 \
     --ep 2 \
-    --extra_llm_api_options ./extra-llm-api-config.yml
+    --config ./config.yml
 ```
 
 Explanation:
@@ -168,7 +168,7 @@ Explanation:
 - `--max_batch_size`: Max batch size in each rank.
 - `--tp`: Tensor parallel size.
 - `--ep`: Expert parallel size.
-- `--extra_llm_api_options`: Used to specify some extra config. The content of the file is as follows:
+- `--config`: Used to specify extra YAML configuration. The content of the file is as follows:
 
 #### Expected Results
 The perf can be different when using different datasets and different machines.
@@ -195,7 +195,7 @@ We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers
 
 #### Benchmark
 ```bash
-cat >./extra-llm-api-config.yml <<EOF
+cat >./config.yml <<EOF
 cuda_graph_config:
   enable_padding: true
   batch_sizes:
@@ -218,7 +218,7 @@ trtllm-bench  --model nvidia/DeepSeek-R1-0528-FP4
      throughput
      --dataset ${YOUR_DATA_PATH}
      --tp 8  --ep 8
-     --extra_llm_api_options ./extra-llm-api-config.yml
+     --config ./config.yml
      --max_batch_size 896
      --max_num_tokens 2048
      --kv_cache_free_gpu_mem_fraction 0.93
@@ -261,7 +261,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
 
 YOUR_DATA_PATH=./dataset.txt
 
-cat >./extra-llm-api-config.yml <<EOF
+cat >./config.yml <<EOF
 cuda_graph_config:
   enable_padding: true
   batch_sizes:
@@ -290,7 +290,7 @@ trtllm-bench -m nvidia/DeepSeek-R1-FP4 \
     --num_requests 49152 \
     --concurrency 3072 \
     --kv_cache_free_gpu_mem_fraction 0.85 \
-    --extra_llm_api_options ./extra-llm-api-config.yml
+    --config ./config.yml
 ```
 
 #### Expected Result Format
@@ -315,7 +315,7 @@ To do the benchmark, run the following command:
 ```bash
 YOUR_DATA_PATH=<your dataset file following the format>
 
-cat >./extra-llm-api-config.yml<<EOF
+cat >./config.yml<<EOF
 speculative_config:
     decoding_type: MTP
     num_nextn_predict_layers: 3
@@ -329,7 +329,7 @@ trtllm-bench --model deepseek-ai/DeepSeek-R1 \
     --tp 8 \
     --ep 4 \
     --concurrency 1 \
-    --extra_llm_api_options ./extra-llm-api-config.yml
+    --config ./config.yml
 ```
 
 #### Expected Result Format
@@ -363,7 +363,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
 
 YOUR_DATA_PATH=./dataset.txt
 
-cat >./extra-llm-api-config.yml<<EOF
+cat >./config.yml<<EOF
 cuda_graph_config:
   batch_sizes:
   - 128
@@ -384,7 +384,7 @@ trtllm-bench -m deepseek-ai/DeepSeek-R1 \
     --num_requests 5120 \
     --concurrency 1024 \
     --kv_cache_free_gpu_mem_fraction 0.8 \
-    --extra_llm_api_options ./extra-llm-api-config.yml
+    --config ./config.yml
 ```
 
 #### Expected Result Format
@@ -408,7 +408,7 @@ Average request latency (ms):                     181540.5739
 To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use the `trtllm-bench prepare-dataset` subcommand to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
 ### WIP: Enable more features by default
 
-Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
+Currently, there are some features that need to be enabled through a user-defined file `config.yml`, such as attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
 
 Note that, `max_batch_size` and `max_num_tokens` can easily affect the performance. The default values for them are already carefully designed and should deliver good performance on overall cases, however, you may still need to tune it for peak performance.
 
 
@@ -105,7 +105,7 @@ Notes:
 Run the following command inside the container to start the endpoint:
 
 ```bash
-TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10  --tp_size 8 --ep_size 4 --trust_remote_code --extra_llm_api_options /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
+TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10  --tp_size 8 --ep_size 4 --trust_remote_code --config /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
 ```
 
 The server initializes, loads, and optimizes the models. After it is ready, it listens on port 8000.
 
@@ -122,7 +122,7 @@ To benchmark min-latency performance with MTP, you need to follow [this document
 ```bash
 YOUR_DATA_PATH=<your dataset file following the format>
 
-cat >./extra-llm-api-config.yml<<EOF
+cat >./config.yml<<EOF
 cuda_graph_config: {}
 moe_config:
   backend: TRTLLM
@@ -142,7 +142,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
     --max_batch_size 1 \
     --tp 8 \
     --ep 2 \
-    --extra_llm_api_options ./extra-llm-api-config.yml
+    --config ./config.yml
 ```
 
 ## MTP optimization - Relaxed Acceptance
@@ -178,7 +178,7 @@ To benchmark min-latency performance with MTP Relaxed Acceptance, you need to fo
 ```bash
 YOUR_DATA_PATH=<your dataset file following the format>
 
-cat >./extra-llm-api-config.yml<<EOF
+cat >./config.yml<<EOF
 cuda_graph_config: {}
 moe_config:
   backend: TRTLLM
@@ -201,7 +201,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
     --max_batch_size 1 \
     --tp 8 \
     --ep 2 \
-    --extra_llm_api_options ./extra-llm-api-config.yml
+    --config ./config.yml
 ```
 
 ## Evaluation
 
@@ -541,7 +541,7 @@ Prepare a dataset following the [benchmarking documentation](https://github.com/
 Run 32-way expert parallelism inference on the prepared dataset. Please refer to the [LLM API MGMN example](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/llm_mgmn_trtllm_bench.sh) for details on running `trtllm-bench` on Slurm.
 
 ```bash
-cat > ./extra_llm_api_options.yaml <<EOF
+cat > ./config.yaml <<EOF
 enable_attention_dp: true
 EOF
 
@@ -551,7 +551,7 @@ trtllm-bench --model ${MODEL_NAME} \
     throughput \
     --tp 32 \
     --ep 32 \
-    --extra_llm_api_options ./extra_llm_api_options.yaml \
+    --config ./config.yaml \
     --kv_cache_free_gpu_mem_fraction 0.75 \
     --backend pytorch \
     --dataset ./dataset.json \
@@ -621,7 +621,7 @@ export EXPERT_STATISTIC_ITER_RANGE=100-200
 Run 36-way expert parallelism inference with the EPLB configuration incorporated:
 
 ```bash
-cat > ./extra_llm_api_options_eplb.yaml <<EOF
+cat > ./config_eplb.yaml <<EOF
 enable_attention_dp: true
 moe_config:
   load_balancer: ./moe_load_balancer.yaml
@@ -633,7 +633,7 @@ trtllm-bench --model ${MODEL_NAME} \
     throughput \
     --tp 36 \
     --ep 36 \
-    --extra_llm_api_options ./extra_llm_api_options_eplb.yaml \
+    --config ./config_eplb.yaml \
     --kv_cache_free_gpu_mem_fraction 0.75 \
     --backend pytorch \
     --dataset ./dataset.json \
 
@@ -73,7 +73,7 @@ docker run -d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
         trtllm-serve /config/models/maverick \
             --host 0.0.0.0 --port 8000 \
             --tp_size 8 --ep_size 1 \
-            --trust_remote_code --extra_llm_api_options c.yaml \
+            --trust_remote_code --config c.yaml \
             --kv_cache_free_gpu_memory_fraction 0.75"
 ```
 
 
@@ -86,7 +86,7 @@ trtllm-bench \
     --backend pytorch \
     --tp ${num_gpus} \
     --ep 1 \
-    --extra_llm_api_options low_latency.yaml \
+    --config low_latency.yaml \
     --dataset gpt-oss-120b-1k2k.txt \
     --max_batch_size ${max_batch_size} \
     --concurrency ${max_batch_size} \
@@ -149,7 +149,7 @@ trtllm-bench \
     --backend pytorch \
     --tp ${num_gpus} \
     --ep ${num_gpus} \
-    --extra_llm_api_options max_throughput.yaml \
+    --config max_throughput.yaml \
     --dataset gpt-oss-120b-1k2k.txt \
     --max_batch_size ${max_batch_size} \
     --concurrency $((max_batch_size * num_gpus)) \
@@ -171,7 +171,7 @@ Currently, the best throughput **19.5k tps/gpu** is achieved with DP4EP4 using 4
 
 ## Launch the TensorRT-LLM Server
 
-We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:  
+We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
 **Note:** You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
 
 ```bash
@@ -184,7 +184,7 @@ trtllm-serve  openai/gpt-oss-120b \
   --ep_size 8 \
   --max_batch_size 640 \
   --trust_remote_code \
-  --extra_llm_api_options max_throughput.yaml \
+  --config max_throughput.yaml \
   --kv_cache_free_gpu_memory_fraction 0.9
 ```
 </details>
@@ -201,7 +201,7 @@ trtllm-serve \
   --ep_size 4 \
   --max_batch_size 640 \
   --trust_remote_code \
-  --extra_llm_api_options max_throughput.yaml \
+  --config max_throughput.yaml \
   --kv_cache_free_gpu_memory_fraction 0.9
 ```
 </details>
@@ -223,7 +223,7 @@ OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT LLM
 
 ### Selecting Triton as the MoE backend
 
-To use the Triton MoE backend with **trtllm-serve** (or other similar commands) add this snippet to the YAML file passed via `--extra_llm_api_options`:
+To use the Triton MoE backend with **trtllm-serve** (or other similar commands) add this snippet to the YAML file passed via `--config`:
 
 ```yaml
 moe_config:
@@ -347,7 +347,7 @@ OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM
 
 ### Selecting Triton as the MoE backend
 
-To use the Triton MoE backend with **trtllm-serve** (or other commands), add this snippet to the YAML file passed via `--extra_llm_api_options`:
+To use the Triton MoE backend with **trtllm-serve** (or other commands), add this snippet to the YAML file passed via `--config`:
 
 ```yaml
 moe_config:
 
@@ -3,9 +3,12 @@ trtllm-bench
 
 trtllm-bench is a comprehensive benchmarking tool for TensorRT LLM engines. It provides three main subcommands for different benchmarking scenarios:
 
-**Common Options for All Commands:**
+.. include:: ../_includes/note_sections.rst
+   :start-after: .. start-note-config-flag-alias
+   :end-before: .. end-note-config-flag-alias
 
-**Usage:**
+Syntax
+------
 
 .. click:: tensorrt_llm.commands.bench:main
    :prog: trtllm-bench
@@ -14,8 +17,11 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT LLM engines. It p
 
 
 
+Dataset preparation
+------------------
+
 prepare_dataset.py
-===========================
+^^^^^^^^^^^^^^^^^^
 
 trtllm-bench is designed to work with the `prepare_dataset.py <https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/prepare_dataset.py>`_ script, which generates benchmark datasets in the required format. The prepare_dataset script supports:
 
@@ -38,7 +44,7 @@ trtllm-bench is designed to work with the `prepare_dataset.py <https://github.co
 **Usage:**
 
 prepare_dataset
--------------------
+"""""""""""""""
 
 .. code-block:: bash
 
@@ -72,7 +78,7 @@ prepare_dataset
      - Logging level: info or debug (default: info)
 
 dataset
--------------------
+"""""""
 
 Process real datasets from various sources.
 
@@ -103,7 +109,7 @@ Process real datasets from various sources.
 
 
 token_norm_dist
--------------------
+"""""""""""""""
 
 Generate synthetic datasets with normal token distribution.
 
@@ -134,7 +140,7 @@ Generate synthetic datasets with normal token distribution.
 
 
 token_unif_dist
--------------------
+"""""""""""""""
 
 Generate synthetic datasets with uniform token distribution
 
 
@@ -79,6 +79,10 @@ Alternatively, the ``--model`` argument also accepts a local path to pre-built T
 
 For more details, see ``trtllm-eval --help`` and ``trtllm-eval <task> --help``.
 
+.. include:: ../_includes/note_sections.rst
+   :start-after: .. start-note-config-flag-alias
+   :end-before: .. end-note-config-flag-alias
+
 
 
 Syntax