You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
+38-10Lines changed: 38 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -58,7 +58,7 @@ Note:
58
58
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
59
59
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
60
60
61
-
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
61
+
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
62
62
63
63
### Creating the TensorRT LLM Server config
64
64
@@ -226,7 +226,7 @@ Refer to the wide EP [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main
226
226
227
227
### Basic Test
228
228
229
-
Start a new terminal on the host to test the TensorRT LLM server you just launched.
229
+
Start a new terminal on the host to test the TensorRT LLM server you just launched.
230
230
231
231
You can query the health/readiness of the server using:
232
232
@@ -354,7 +354,7 @@ If you want to save the results to a file add the following options.
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
357
+
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
358
358
359
359
Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
360
360
@@ -395,13 +395,41 @@ P99 E2EL (ms): [result]
395
395
396
396
### Key Metrics
397
397
398
-
* Median Time to First Token (TTFT)
398
+
#### Time to First Token (TTFT)
399
399
* The typical time elapsed from when a request is sent until the first output token is generated.
400
-
* Median Time Per Output Token (TPOT)
401
-
* The typical time required to generate each token *after* the first one.
402
-
* Median Inter-Token Latency (ITL)
403
-
* The typical time delay between the completion of one token and the completion of the next.
404
-
* Median End-to-End Latency (E2EL)
400
+
401
+
#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
402
+
* TPOT is the typical time required to generate each token *after* the first one.
403
+
* ITL is the typical time delay between the completion of one token and the completion of the next.
404
+
* Both TPOT and ITL ignore TTFT.
405
+
406
+
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
+37-9Lines changed: 37 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -233,7 +233,7 @@ TODO: Use Chat Compeletions API / Responses API as the example after the PR is m
233
233
We use OpenAI's official evaluation tool to test the model's accuracy. For more information see [https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals](gpt-oss-eval).
234
234
With the added support of Chat Completions and Responses API in `trtllm-serve,``gpt_oss.evals` works directly without any modifications.
235
235
236
-
You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200.
236
+
You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200.
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
303
+
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
304
304
305
305
Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
306
306
@@ -341,13 +341,41 @@ P99 E2EL (ms): [result]
341
341
342
342
### Key Metrics
343
343
344
-
* Median Time to First Token (TTFT)
344
+
####Time to First Token (TTFT)
345
345
* The typical time elapsed from when a request is sent until the first output token is generated.
346
-
* Median Time Per Output Token (TPOT)
347
-
* The typical time required to generate each token *after* the first one.
348
-
* Median Inter-Token Latency (ITL)
349
-
* The typical time delay between the completion of one token and the completion of the next.
350
-
* Median End-to-End Latency (E2EL)
346
+
347
+
#### Time Per Output Token (TPOT) and Inter-Token Latency (ITL)
348
+
* TPOT is the typical time required to generate each token *after* the first one.
349
+
* ITL is the typical time delay between the completion of one token and the completion of the next.
350
+
* Both TPOT and ITL ignore TTFT.
351
+
352
+
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
0 commit comments