nv-auto-deploy
diff --git a/‎docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md‎
Lines changed: 54 additions & 26 deletions b/‎docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md‎
Lines changed: 54 additions & 26 deletions
diff --git a/‎docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md‎
Lines changed: 38 additions & 10 deletions b/‎docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md‎
Lines changed: 38 additions & 10 deletions
diff --git a/‎docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md‎
Lines changed: 37 additions & 9 deletions b/‎docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md‎
Lines changed: 37 additions & 9 deletions
@@ -151,16 +151,44 @@ P99 E2EL (ms):                           1643.44
 
 ### Key Metrics
 
-* Median Time to First Token (TTFT)
+#### Time to First Token (TTFT)
   * The typical time elapsed from when a request is sent until the first output token is generated.
-* Median Time Per Output Token (TPOT)
-  * The typical time required to generate each token *after* the first one.
-* Median Inter-Token Latency (ITL)
-  * The typical time delay between the completion of one token and the completion of the next.
-* Median End-to-End Latency (E2EL)
+
+#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
+  * TPOT is the typical time required to generate each token *after* the first one.
+  * ITL is the typical time delay between the completion of one token and the completion of the next.
+  * Both TPOT and ITL ignore TTFT.
+
+For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
+
+```math
+\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
+```
+
+Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
+
+```math
+\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
+```
+
+```math
+\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
+```
+
+#### End-to-End (E2E) Latency
   * The typical total time from when a request is submitted until the final token of the response is received.
-* Total Token Throughput
+
+#### Total Token Throughput
   * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
+```math
+\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
+```
+
+#### Tokens Per Second (TPS) or Output Token Throughput
+  * how many output tokens the system generates each second.
+```math
+\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
+```
 
 ## About `extra_llm_api_options`
    trtllm-serve provides `extra_llm_api_options` knob to **overwrite** the parameters specified by trtllm-serve.
@@ -267,28 +295,28 @@ python -m tensorrt_llm.serve.scripts.benchmark_serving \
 Below is some example TensorRT-LLM serving benchmark output. Your actual results may vary.
 ```
 ============ Serving Benchmark Result ============
-Successful requests:                     1         
-Benchmark duration (s):                  0.83      
-Total input tokens:                      128       
-Total generated tokens:                  128       
-Request throughput (req/s):              1.20      
-Output token throughput (tok/s):         153.92    
-Total Token throughput (tok/s):          307.85    
-User throughput (tok/s):                 154.15    
-Mean Request AR:                         0.9845    
-Median Request AR:                       0.9845    
+Successful requests:                     1
+Benchmark duration (s):                  0.83
+Total input tokens:                      128
+Total generated tokens:                  128
+Request throughput (req/s):              1.20
+Output token throughput (tok/s):         153.92
+Total Token throughput (tok/s):          307.85
+User throughput (tok/s):                 154.15
+Mean Request AR:                         0.9845
+Median Request AR:                       0.9845
 ---------------Time to First Token----------------
-Mean TTFT (ms):                          84.03     
-Median TTFT (ms):                        84.03     
-P99 TTFT (ms):                           84.03     
+Mean TTFT (ms):                          84.03
+Median TTFT (ms):                        84.03
+P99 TTFT (ms):                           84.03
 -----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          5.88      
-Median TPOT (ms):                        5.88      
-P99 TPOT (ms):                           5.88      
+Mean TPOT (ms):                          5.88
+Median TPOT (ms):                        5.88
+P99 TPOT (ms):                           5.88
 ---------------Inter-token Latency----------------
-Mean ITL (ms):                           5.83      
-Median ITL (ms):                         5.88      
-P99 ITL (ms):                            6.14      
+Mean ITL (ms):                           5.83
+Median ITL (ms):                         5.88
+P99 ITL (ms):                            6.14
 ==================================================
 ```
 
 
@@ -58,7 +58,7 @@ Note:
 * The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
 * See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
 
-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) 
+If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
 
 ### Creating the TensorRT LLM Server config
 
@@ -226,7 +226,7 @@ Refer to the wide EP [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main
 
 ### Basic Test
 
-Start a new terminal on the host to test the TensorRT LLM server you just launched. 
+Start a new terminal on the host to test the TensorRT LLM server you just launched.
 
 You can query the health/readiness of the server using:
 
@@ -354,7 +354,7 @@ If you want to save the results to a file add the following options.
 --result-filename "concurrency_${concurrency}.json"
 ```
 
-For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py) 
+For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
 
 Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
 
@@ -395,13 +395,41 @@ P99 E2EL (ms):                            [result]
 
 ### Key Metrics
 
-* Median Time to First Token (TTFT)
+#### Time to First Token (TTFT)
   * The typical time elapsed from when a request is sent until the first output token is generated.
-* Median Time Per Output Token (TPOT)
-  * The typical time required to generate each token *after* the first one.
-* Median Inter-Token Latency (ITL)
-  * The typical time delay between the completion of one token and the completion of the next.
-* Median End-to-End Latency (E2EL)
+
+#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
+  * TPOT is the typical time required to generate each token *after* the first one.
+  * ITL is the typical time delay between the completion of one token and the completion of the next.
+  * Both TPOT and ITL ignore TTFT.
+
+For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
+
+```math
+\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
+```
+
+Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
+
+```math
+\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
+```
+
+```math
+\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
+```
+
+#### End-to-End (E2E) Latency
   * The typical total time from when a request is submitted until the final token of the response is received.
-* Total Token Throughput
+
+#### Total Token Throughput
   * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
+```math
+\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
+```
+
+#### Tokens Per Second (TPS) or Output Token Throughput
+  * how many output tokens the system generates each second.
+```math
+\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
+```
@@ -233,7 +233,7 @@ TODO: Use Chat Compeletions API / Responses API as the example after the PR is m
 We use OpenAI's official evaluation tool to test the model's accuracy. For more information see [https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals](gpt-oss-eval).
 With the added support of Chat Completions and Responses API in `trtllm-serve,` `gpt_oss.evals` works directly without any modifications.
 
-You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200. 
+You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200.
 
 | **reasoning-effort** | **parallel configuration** | **max_batch_size** | **max_num_tokens** |
 |:--------------------:|:--------------------------:|:------------------:|:------------------:|
@@ -300,7 +300,7 @@ If you want to save the results to a file add the following options.
 --result-filename "concurrency_${concurrency}.json"
 ```
 
-For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py) 
+For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
 
 Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
 
@@ -341,13 +341,41 @@ P99 E2EL (ms):                            [result]
 
 ### Key Metrics
 
-* Median Time to First Token (TTFT)
+#### Time to First Token (TTFT)
   * The typical time elapsed from when a request is sent until the first output token is generated.
-* Median Time Per Output Token (TPOT)
-  * The typical time required to generate each token *after* the first one.
-* Median Inter-Token Latency (ITL)
-  * The typical time delay between the completion of one token and the completion of the next.
-* Median End-to-End Latency (E2EL)
+
+#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
+  * TPOT is the typical time required to generate each token *after* the first one.
+  * ITL is the typical time delay between the completion of one token and the completion of the next.
+  * Both TPOT and ITL ignore TTFT.
+
+For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
+
+```math
+\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
+```
+
+Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
+
+```math
+\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
+```
+
+```math
+\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
+```
+
+#### End-to-End (E2E) Latency
   * The typical total time from when a request is submitted until the final token of the response is received.
-* Total Token Throughput
+
+#### Total Token Throughput
   * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
+```math
+\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
+```
+
+#### Tokens Per Second (TPS) or Output Token Throughput
+  * how many output tokens the system generates each second.
+```math
+\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
+```