Skip to content

Commit c822c11

Browse files
authored
[None] [docs] Update TPOT/ITL docs (#8378)
Signed-off-by: Kaiyu Xie <[email protected]>
1 parent 206a993 commit c822c11

File tree

5 files changed

+223
-83
lines changed

5 files changed

+223
-83
lines changed

docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md

Lines changed: 54 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -151,16 +151,44 @@ P99 E2EL (ms): 1643.44
151151

152152
### Key Metrics
153153

154-
* Median Time to First Token (TTFT)
154+
#### Time to First Token (TTFT)
155155
* The typical time elapsed from when a request is sent until the first output token is generated.
156-
* Median Time Per Output Token (TPOT)
157-
* The typical time required to generate each token *after* the first one.
158-
* Median Inter-Token Latency (ITL)
159-
* The typical time delay between the completion of one token and the completion of the next.
160-
* Median End-to-End Latency (E2EL)
156+
157+
#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
158+
* TPOT is the typical time required to generate each token *after* the first one.
159+
* ITL is the typical time delay between the completion of one token and the completion of the next.
160+
* Both TPOT and ITL ignore TTFT.
161+
162+
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
163+
164+
```math
165+
\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
166+
```
167+
168+
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
169+
170+
```math
171+
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
172+
```
173+
174+
```math
175+
\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
176+
```
177+
178+
#### End-to-End (E2E) Latency
161179
* The typical total time from when a request is submitted until the final token of the response is received.
162-
* Total Token Throughput
180+
181+
#### Total Token Throughput
163182
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
183+
```math
184+
\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
185+
```
186+
187+
#### Tokens Per Second (TPS) or Output Token Throughput
188+
* how many output tokens the system generates each second.
189+
```math
190+
\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
191+
```
164192

165193
## About `extra_llm_api_options`
166194
trtllm-serve provides `extra_llm_api_options` knob to **overwrite** the parameters specified by trtllm-serve.
@@ -267,28 +295,28 @@ python -m tensorrt_llm.serve.scripts.benchmark_serving \
267295
Below is some example TensorRT-LLM serving benchmark output. Your actual results may vary.
268296
```
269297
============ Serving Benchmark Result ============
270-
Successful requests: 1
271-
Benchmark duration (s): 0.83
272-
Total input tokens: 128
273-
Total generated tokens: 128
274-
Request throughput (req/s): 1.20
275-
Output token throughput (tok/s): 153.92
276-
Total Token throughput (tok/s): 307.85
277-
User throughput (tok/s): 154.15
278-
Mean Request AR: 0.9845
279-
Median Request AR: 0.9845
298+
Successful requests: 1
299+
Benchmark duration (s): 0.83
300+
Total input tokens: 128
301+
Total generated tokens: 128
302+
Request throughput (req/s): 1.20
303+
Output token throughput (tok/s): 153.92
304+
Total Token throughput (tok/s): 307.85
305+
User throughput (tok/s): 154.15
306+
Mean Request AR: 0.9845
307+
Median Request AR: 0.9845
280308
---------------Time to First Token----------------
281-
Mean TTFT (ms): 84.03
282-
Median TTFT (ms): 84.03
283-
P99 TTFT (ms): 84.03
309+
Mean TTFT (ms): 84.03
310+
Median TTFT (ms): 84.03
311+
P99 TTFT (ms): 84.03
284312
-----Time per Output Token (excl. 1st token)------
285-
Mean TPOT (ms): 5.88
286-
Median TPOT (ms): 5.88
287-
P99 TPOT (ms): 5.88
313+
Mean TPOT (ms): 5.88
314+
Median TPOT (ms): 5.88
315+
P99 TPOT (ms): 5.88
288316
---------------Inter-token Latency----------------
289-
Mean ITL (ms): 5.83
290-
Median ITL (ms): 5.88
291-
P99 ITL (ms): 6.14
317+
Mean ITL (ms): 5.83
318+
Median ITL (ms): 5.88
319+
P99 ITL (ms): 6.14
292320
==================================================
293321
```
294322

docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md

Lines changed: 38 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ Note:
5858
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
5959
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
6060

61-
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
61+
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
6262

6363
### Creating the TensorRT LLM Server config
6464

@@ -226,7 +226,7 @@ Refer to the wide EP [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main
226226
227227
### Basic Test
228228
229-
Start a new terminal on the host to test the TensorRT LLM server you just launched.
229+
Start a new terminal on the host to test the TensorRT LLM server you just launched.
230230
231231
You can query the health/readiness of the server using:
232232
@@ -354,7 +354,7 @@ If you want to save the results to a file add the following options.
354354
--result-filename "concurrency_${concurrency}.json"
355355
```
356356

357-
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
357+
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
358358

359359
Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
360360

@@ -395,13 +395,41 @@ P99 E2EL (ms): [result]
395395
396396
### Key Metrics
397397
398-
* Median Time to First Token (TTFT)
398+
#### Time to First Token (TTFT)
399399
* The typical time elapsed from when a request is sent until the first output token is generated.
400-
* Median Time Per Output Token (TPOT)
401-
* The typical time required to generate each token *after* the first one.
402-
* Median Inter-Token Latency (ITL)
403-
* The typical time delay between the completion of one token and the completion of the next.
404-
* Median End-to-End Latency (E2EL)
400+
401+
#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
402+
* TPOT is the typical time required to generate each token *after* the first one.
403+
* ITL is the typical time delay between the completion of one token and the completion of the next.
404+
* Both TPOT and ITL ignore TTFT.
405+
406+
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
407+
408+
```math
409+
\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
410+
```
411+
412+
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
413+
414+
```math
415+
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
416+
```
417+
418+
```math
419+
\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
420+
```
421+
422+
#### End-to-End (E2E) Latency
405423
* The typical total time from when a request is submitted until the final token of the response is received.
406-
* Total Token Throughput
424+
425+
#### Total Token Throughput
407426
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
427+
```math
428+
\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
429+
```
430+
431+
#### Tokens Per Second (TPS) or Output Token Throughput
432+
* how many output tokens the system generates each second.
433+
```math
434+
\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
435+
```

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

Lines changed: 37 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,7 @@ TODO: Use Chat Compeletions API / Responses API as the example after the PR is m
233233
We use OpenAI's official evaluation tool to test the model's accuracy. For more information see [https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals](gpt-oss-eval).
234234
With the added support of Chat Completions and Responses API in `trtllm-serve,` `gpt_oss.evals` works directly without any modifications.
235235

236-
You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200.
236+
You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200.
237237

238238
| **reasoning-effort** | **parallel configuration** | **max_batch_size** | **max_num_tokens** |
239239
|:--------------------:|:--------------------------:|:------------------:|:------------------:|
@@ -300,7 +300,7 @@ If you want to save the results to a file add the following options.
300300
--result-filename "concurrency_${concurrency}.json"
301301
```
302302

303-
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
303+
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
304304

305305
Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
306306

@@ -341,13 +341,41 @@ P99 E2EL (ms): [result]
341341

342342
### Key Metrics
343343

344-
* Median Time to First Token (TTFT)
344+
#### Time to First Token (TTFT)
345345
* The typical time elapsed from when a request is sent until the first output token is generated.
346-
* Median Time Per Output Token (TPOT)
347-
* The typical time required to generate each token *after* the first one.
348-
* Median Inter-Token Latency (ITL)
349-
* The typical time delay between the completion of one token and the completion of the next.
350-
* Median End-to-End Latency (E2EL)
346+
347+
#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
348+
* TPOT is the typical time required to generate each token *after* the first one.
349+
* ITL is the typical time delay between the completion of one token and the completion of the next.
350+
* Both TPOT and ITL ignore TTFT.
351+
352+
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
353+
354+
```math
355+
\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
356+
```
357+
358+
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
359+
360+
```math
361+
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
362+
```
363+
364+
```math
365+
\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
366+
```
367+
368+
#### End-to-End (E2E) Latency
351369
* The typical total time from when a request is submitted until the final token of the response is received.
352-
* Total Token Throughput
370+
371+
#### Total Token Throughput
353372
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
373+
```math
374+
\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
375+
```
376+
377+
#### Tokens Per Second (TPS) or Output Token Throughput
378+
* how many output tokens the system generates each second.
379+
```math
380+
\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
381+
```

0 commit comments

Comments
 (0)