[None][doc] add visualization of perf metrics in time breakdown tool doc (#8530)

zhengd-nv · web-flow · commit e666a704f593 · 2025-10-23T22:09:21.000-04:00
Signed-off-by: zhengd-nv &lt;200704041+zhengd-nv@users.noreply.github.com&gt;
diff --git a/docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md b/docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
@@ -1,6 +1,6 @@
 # Run benchmarking with `trtllm-serve`
 
-TensorRT LLM provides the OpenAI-compatiable API via `trtllm-serve` command.
+TensorRT LLM provides the OpenAI-compatible API via `trtllm-serve` command.
 A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).
 
 This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B and Qwen2.5-VL-7B for multimodal models:
@@ -190,6 +190,10 @@ Across different requests, **average TPOT** is the mean of each request's TPOT (
 \text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
 ```
 
+### Request Time Breakdown
+
+To get more detailed metrics besides the key metrics above, there is an [experimental tool](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/serve/scripts/time_breakdown) for request time breakdown.
+
 ## About `extra_llm_api_options`
    trtllm-serve provides `extra_llm_api_options` knob to **overwrite** the parameters specified by trtllm-serve.
    Generally, We create a YAML file that contains various performance switches.
diff --git a/docs/source/developer-guide/perf-benchmarking.md b/docs/source/developer-guide/perf-benchmarking.md
@@ -18,6 +18,8 @@ easier for users to reproduce our officially published [performance overview](./
 the [in-flight batching section](../features/attention.md#inflight-batching) that describes the concept
 in further detail.
 
+To benchmark the OpenAI-compatible `trtllm-serve`, please refer to the [run benchmarking with `trtllm-serve`](../commands/trtllm-serve/run-benchmark-with-trtllm-serve.md) section.
+
 ## Before Benchmarking
 
 For rigorous benchmarking where consistent and reproducible results are critical, proper GPU configuration is essential. These settings help maximize GPU utilization, eliminate performance variability, and ensure optimal conditions for accurate measurements. While not strictly required for normal operation, we recommend applying these configurations when conducting performance comparisons or publishing benchmark results.
diff --git a/tensorrt_llm/serve/scripts/time_breakdown/README.md b/tensorrt_llm/serve/scripts/time_breakdown/README.md
@@ -73,6 +73,11 @@ The tool aims to track detailed timing segments throughout the request lifecycle
    - **Time Period**: `gen_server_first_token_time` → `disagg_server_first_token_time`
    - **Description**: Routing overhead from generation server back through disagg server
    - **Includes**: Response forwarding, aggregation
+
+#### Visualization of Disaggregated Server Metrics
+The timepoints are recorded internally by TensorRT LLM per-request performance metrics (also available via LLM API) and OpenAI-compatible server.
+![Visualization of Disaggregated Metrics](images/perf_metrics_timepoints.png)
+
 ## Input Format
 
 The tool expects a JSON file containing an array of request performance metrics (unit: seconds).
@@ -139,6 +144,7 @@ Set
  perf_metrics_max_requests: <INTEGER>
 ```
 in the `extra-llm-api-config.yaml`. If you are running disaggregated serving, you should add configs for all servers (disagg, context and generation server).
+The server keeps at most `perf_metrics_max_requests` entries.
 
 Step 2:
 Add `--save-request-time-breakdown` when running `benchmark_serving.py`
diff --git a/tensorrt_llm/serve/scripts/time_breakdown/images/perf_metrics_timepoints.png b/tensorrt_llm/serve/scripts/time_breakdown/images/perf_metrics_timepoints.png