NVIDIA
diff --git a/‎docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md‎ renamed to ‎docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md‎
Lines changed: 58 additions & 64 deletions b/‎docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md‎ renamed to ‎docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md‎
Lines changed: 58 additions & 64 deletions
diff --git a/‎docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md‎ renamed to ‎docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md‎
Lines changed: 59 additions & 61 deletions b/‎docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md‎ renamed to ‎docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md‎
Lines changed: 59 additions & 61 deletions
@@ -5,7 +5,7 @@ A complete reference for the API is available in the [OpenAI API Reference](http
 
 This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B and Qwen2.5-VL-7B for multimodal models:
  * Methodology Introduction
- * Launch the OpenAI-Compatibale Server with NGC container
+ * Launch the OpenAI-Compatible Server with NGC container
  * Run the performance benchmark
  * Using `extra_llm_api_options`
  * Multimodal Serving and Benchmarking
 
@@ -1,4 +1,4 @@
-# Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
+# Deployment Guide for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
 
 ## Introduction
 
@@ -47,7 +47,7 @@ docker run --rm -it \
 -p 8000:8000 \
 -v ~/.cache:/root/.cache:rw \
 --name tensorrt_llm \
-nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
+nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
 /bin/bash
 ```
 
@@ -60,108 +60,102 @@ Note:
 
 If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
 
-### Creating the TensorRT LLM Server config
+### Recommended Performance Settings
 
-We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
+We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.
 
 ```shell
-EXTRA_LLM_API_FILE=/tmp/config.yml
-
-cat << EOF > ${EXTRA_LLM_API_FILE}
-enable_attention_dp: true
-cuda_graph_config:
-  enable_padding: true
-  max_batch_size: 128
-kv_cache_config:
-  dtype: fp8
-stream_interval: 10
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-EOF
+TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml
+```
+
+Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
+
+````{admonition} Show code
+:class: dropdown
+
+```{literalinclude} ../../../examples/configs/deepseek-r1-throughput.yaml
+---
+language: shell
+prepend: |
+  EXTRA_LLM_API_FILE=/tmp/config.yml
+
+  cat << EOF > ${EXTRA_LLM_API_FILE}
+append: EOF
+---
 ```
+````
 
-For FP8 model, we need extra `moe_config`:
+To use the `DeepGEMM` MOE backend on B200/GB200, use this config instead:
 
 ```shell
-EXTRA_LLM_API_FILE=/tmp/config.yml
-
-cat << EOF > ${EXTRA_LLM_API_FILE}
-enable_attention_dp: true
-cuda_graph_config:
-  enable_padding: true
-  max_batch_size: 128
-kv_cache_config:
-  dtype: fp8
-stream_interval: 10
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-moe_config:
-  backend: DEEPGEMM
-  max_num_tokens: 3200
-EOF
+TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml
 ```
 
+Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
+
+````{admonition} Show code
+:class: dropdown
+
+```{literalinclude} ../../../examples/configs/deepseek-r1-deepgemm.yaml
+---
+language: shell
+prepend: |
+  EXTRA_LLM_API_FILE=/tmp/config.yml
+
+  cat << EOF > ${EXTRA_LLM_API_FILE}
+append: EOF
+---
+```
+````
+
 ### Launch the TensorRT LLM Server
 
-Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
+Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.
 
 ```shell
-trtllm-serve deepseek-ai/DeepSeek-R1-0528 \
-    --host 0.0.0.0 \
-    --port 8000 \
-    --max_batch_size 1024 \
-    --max_num_tokens 3200 \
-    --max_seq_len 2048 \
-    --kv_cache_free_gpu_memory_fraction 0.8 \
-    --tp_size 8 \
-    --ep_size 8 \
-    --trust_remote_code \
-    --extra_llm_api_options ${EXTRA_LLM_API_FILE}
+trtllm-serve deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE}
 ```
 
 After the server is set up, the client can now send prompt requests to the server and receive results.
 
-### Configs and Parameters
+### LLM API Options (YAML Configuration)
+
+<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->
+
+These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
 
-These options are used directly on the command line when you start the `trtllm-serve` process.
 
-#### `--tp_size`
+#### `tensor_parallel_size`
 
 * **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
 
-#### `--ep_size`
+#### `moe_expert_parallel_size`
 
-* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
+* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
 
-#### `--kv_cache_free_gpu_memory_fraction`
+#### `kv_cache_free_gpu_memory_fraction`
 
 * **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
 * **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
 
-
-#### `--max_batch_size`
+#### `max_batch_size`
 
 * **Description:** The maximum number of user requests that can be grouped into a single batch for processing.
 
-#### `--max_num_tokens`
+#### `max_num_tokens`
 
 * **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
 
-#### `--max_seq_len`
+#### `max_seq_len`
 
 * **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.
 
-#### `--trust_remote_code`
+#### `trust_remote_code`
 
 &emsp;**Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
 
-
-#### Extra LLM API Options (YAML Configuration)
-
-These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
-
 #### `kv_cache_config`
 
 * **Description**: A section for configuring the Key-Value (KV) cache.
 
@@ -1,4 +1,4 @@
-# Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware
+# Deployment Guide for GPT-OSS on TensorRT-LLM - Blackwell Hardware
 
 ## Introduction
 
@@ -43,7 +43,7 @@ docker run --rm -it \
 -p 8000:8000 \
 -v ~/.cache:/root/.cache:rw \
 --name tensorrt_llm \
-nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
+nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
 /bin/bash
 ```
 
@@ -56,105 +56,103 @@ Note:
 
 If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
 
-### Creating the TensorRT LLM Server config
+### Recommended Performance Settings
 
-We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
+We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.
 
-For low-latency with `TRTLLM` MOE backend:
+For low-latency use cases:
 
 ```shell
-EXTRA_LLM_API_FILE=/tmp/config.yml
-
-cat << EOF > ${EXTRA_LLM_API_FILE}
-enable_attention_dp: false
-cuda_graph_config:
-  enable_padding: true
-  max_batch_size: 720
-moe_config:
-    backend: TRTLLM
-stream_interval: 20
-num_postprocess_workers: 4
-EOF
+TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml
+```
+
+Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
+
+````{admonition} Show code
+:class: dropdown
+
+```{literalinclude} ../../../examples/configs/gpt-oss-120b-latency.yaml
+---
+language: shell
+prepend: |
+  EXTRA_LLM_API_FILE=/tmp/config.yml
+
+  cat << EOF > ${EXTRA_LLM_API_FILE}
+append: EOF
+---
 ```
+````
 
-For max-throughput with `CUTLASS` MOE backend:
+For max-throughput use cases:
 
 ```shell
-EXTRA_LLM_API_FILE=/tmp/config.yml
-
-cat << EOF > ${EXTRA_LLM_API_FILE}
-enable_attention_dp: true
-cuda_graph_config:
-  enable_padding: true
-  max_batch_size: 720
-moe_config:
-    backend: CUTLASS
-stream_interval: 20
-num_postprocess_workers: 4
-attention_dp_config:
-    enable_balance: true
-    batching_wait_iters: 50
-    timeout_iters: 1
-EOF
+TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml
+```
+
+Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
+
+````{admonition} Show code
+:class: dropdown
+
+```{literalinclude} ../../../examples/configs/gpt-oss-120b-throughput.yaml
+---
+language: shell
+prepend: |
+  EXTRA_LLM_API_FILE=/tmp/config.yml
+
+  cat << EOF > ${EXTRA_LLM_API_FILE}
+append: EOF
+---
 ```
+````
 
 ### Launch the TensorRT LLM Server
 
-Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
+Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section.
 
 ```shell
-trtllm-serve openai/gpt-oss-120b \
-    --host 0.0.0.0 \
-    --port 8000 \
-    --max_batch_size 720 \
-    --max_num_tokens 16384 \
-    --kv_cache_free_gpu_memory_fraction 0.9 \
-    --tp_size 8 \
-    --ep_size 8 \
-    --trust_remote_code \
-    --extra_llm_api_options ${EXTRA_LLM_API_FILE}
+trtllm-serve openai/gpt-oss-120b --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE}
 ```
 
 After the server is set up, the client can now send prompt requests to the server and receive results.
 
-### Configs and Parameters
+### LLM API Options (YAML Configuration)
 
-These options are used directly on the command line when you start the `trtllm-serve` process.
+<!-- TODO: this section is duplicated across the deployment guides; they should be consolidated to a central file and imported as needed, or we can remove this and link to LLM API reference -->
 
-#### `--tp_size`
+These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
+
+#### `tensor_parallel_size`
 
 * **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
 
-#### `--ep_size`
+#### `moe_expert_parallel_size`
 
-* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
+* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
 
-#### `--kv_cache_free_gpu_memory_fraction`
+#### `kv_cache_free_gpu_memory_fraction`
 
 * **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
 * **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
 
-#### `--max_batch_size`
+#### `max_batch_size`
 
 * **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
 
-#### `--max_num_tokens`
+#### `max_num_tokens`
 
 * **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
 
-#### `--max_seq_len`
+#### `max_seq_len`
 
-* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config.
+* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. If not set, it will be inferred from model config.
 
-#### `--trust_remote_code`
+#### `trust_remote_code`
 
 * **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
 
-
-#### Extra LLM API Options (YAML Configuration)
-
-These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
-
 #### `cuda_graph_config`
 
 * **Description**: A section for configuring CUDA graphs to optimize performance.