[None][doc] Refine deployment guide by renaming TRT-LLM to TensorRT L… (#8214)

nv-guomingz · web-flow · commit a193867f8fdb · 2025-10-09T17:11:24.000+08:00
Signed-off-by: nv-guomingz &lt;137257613+nv-guomingz@users.noreply.github.com&gt;
diff --git a/docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md b/docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
@@ -22,7 +22,7 @@ The guide is intended for developers and practitioners seeking high-throughput o
 
 ## MoE Backend Support Matrix
 
-There are multiple MOE backends inside TRT-LLM, not all of them supporting every  precision on every GPUs. Here are the support matrix of the MOE backends.
+There are multiple MOE backends inside TensorRT LLM, not all of them supporting every  precision on every GPUs. Here are the support matrix of the MOE backends.
 
 | device | Checkpoint | Supported moe_backend |
 |----------|----------|----------|
@@ -60,7 +60,7 @@ Note:
 
 If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) 
 
-### Creating the TRT-LLM Server config
+### Creating the TensorRT LLM Server config
 
 We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
 
@@ -103,15 +103,14 @@ moe_config:
 EOF
 ```
 
-### Launch the TRT-LLM Server
+### Launch the TensorRT LLM Server
 
-Below is an example command to launch the TRT-LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
+Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
 
 ```shell
 trtllm-serve deepseek-ai/DeepSeek-R1-0528 \
     --host 0.0.0.0 \
     --port 8000 \
-    --backend pytorch \
     --max_batch_size 1024 \
     --max_num_tokens 3200 \
     --max_seq_len 2048 \
@@ -141,9 +140,6 @@ These options are used directly on the command line when you start the `trtllm-s
 * **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
 * **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
 
-#### `--backend pytorch`
-
-&emsp;**Description:** Tells TensorRT LLM to use the **pytorch** backend.
 
 #### `--max_batch_size`
 
@@ -240,7 +236,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
 
 When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
 
-After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
+After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
 
 ```shell
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -d '{
@@ -251,7 +247,7 @@ curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -
 }'
 ```
 
-Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
+Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
 
 ```json
 {"id":"cmpl-e728f08114c042309efeae4df86a50ca","object":"text_completion","created":1754294810,"model":"deepseek-ai/DeepSeek-R1-0528","choices":[{"index":0,"text":" / by Megan Stine ; illustrated by John Hinderliter.\n\nBook | Gross","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
diff --git a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md b/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
@@ -21,7 +21,7 @@ The guide is intended for developers and practitioners seeking high-throughput o
 
 ## MoE Backend Support Matrix
 
-There are multiple MOE backends inside TRT-LLM. Here are the support matrix of the MOE backends.
+There are multiple MOE backends inside TensorRT LLM. Here are the support matrix of the MOE backends.
 
 | Device     | Activation Type | MoE Weights Type | MoE Backend | Use Case       |
 |------------|------------------|------------------|-------------|----------------|
@@ -56,7 +56,7 @@ Note:
 
 If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
 
-### Creating the TRT-LLM Server config
+### Creating the TensorRT LLM Server config
 
 We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
 
@@ -98,15 +98,14 @@ attention_dp_config:
 EOF
 ```
 
-### Launch the TRT-LLM Server
+### Launch the TensorRT LLM Server
 
-Below is an example command to launch the TRT-LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
+Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
 
 ```shell
 trtllm-serve openai/gpt-oss-120b \
     --host 0.0.0.0 \
     --port 8000 \
-    --backend pytorch \
     --max_batch_size 720 \
     --max_num_tokens 16384 \
     --kv_cache_free_gpu_memory_fraction 0.9 \
@@ -135,10 +134,6 @@ These options are used directly on the command line when you start the `trtllm-s
 * **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
 * **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
 
-#### `--backend pytorch`
-
-* **Description:** Tells TensorRT-LLM to use the **pytorch** backend.
-
 #### `--max_batch_size`
 
 * **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
@@ -201,7 +196,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
 
 When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
 
-After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
+After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
 
 ```shell
 curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{
@@ -217,7 +212,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
 }' -w "\n"
 ```
 
-Here is an example response, showing that the TRT-LLM server reasons and answers the questions.
+Here is an example response, showing that the TensorRT LLM server reasons and answers the questions.
 
 TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
 
diff --git a/docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md b/docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
@@ -52,7 +52,7 @@ Note:
 
 If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) 
 
-### Creating the TRT-LLM Server config
+### Creating the TensorRT LLM Server config
 
 We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
 
@@ -69,15 +69,14 @@ kv_cache_config:
 EOF
 ```
 
-### Launch the TRT-LLM Server
+### Launch the TensorRT LLM Server
 
-Below is an example command to launch the TRT-LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
+Below is an example command to launch the TensorRT LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
 
 ```shell
 trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \
     --host 0.0.0.0 \
     --port 8000 \
-    --backend pytorch \
     --max_batch_size 1024 \
     --max_num_tokens 2048 \
     --max_seq_len 2048 \
@@ -107,10 +106,6 @@ These options are used directly on the command line when you start the `trtllm-s
 
 &emsp;**Recommendation:** If you experience OOM errors, try reducing this value to **0.8** or lower.
 
-#### `--backend pytorch`
-
-&emsp;**Description:** Tells TensorRT LLM to use the **pytorch** backend.
-
 #### `--max_batch_size`
 
 &emsp;**Description:** The maximum number of user requests that can be grouped into a single batch for processing.
@@ -194,7 +189,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
 
 When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
 
-After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server. 
+After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server. 
 
 ```shell
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -d '{
@@ -205,7 +200,7 @@ curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -
 }'
 ```
 
-Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
+Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
 
 ```json
 {"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"nvidia/Llama-3.3-70B-Instruct-FP8","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
diff --git a/docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md b/docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
@@ -51,7 +51,7 @@ Note:
 
 If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
 
-### Creating the TRT-LLM Server config
+### Creating the TensorRT LLM Server config
 
 We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
 
@@ -68,15 +68,14 @@ kv_cache_config:
 EOF
 ```
 
-### Launch the TRT-LLM Server
+### Launch the TensorRT LLM Server
 
-Below is an example command to launch the TRT-LLM server with the Llama-4-Scout-17B-16E-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
+Below is an example command to launch the TensorRT LLM server with the Llama-4-Scout-17B-16E-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
 
 ```shell
 trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 \
     --host 0.0.0.0 \
     --port 8000 \
-    --backend pytorch \
     --max_batch_size 1024 \
     --max_num_tokens 2048 \
     --max_seq_len 2048 \
@@ -106,10 +105,6 @@ These options are used directly on the command line when you start the `trtllm-s
 * **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
 * **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
 
-#### `--backend pytorch`
-
-&emsp;**Description:** Tells TensorRT LLM to use the **pytorch** backend.
-
 #### `--max_batch_size`
 
 * **Description:** The maximum number of user requests that can be grouped into a single batch for processing.
@@ -191,7 +186,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
 
 When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
 
-After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
+After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
 
 ```shell
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -d '{
@@ -202,7 +197,7 @@ curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -
 }'
 ```
 
-Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
+Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
 
 ```json
 {"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"$MODEL","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}