Skip to content

[None][feat] Add production-level Prometheus metrics (iteration stats, config info, token counters, phase histograms)#12545

Open
nvyutwu wants to merge 6 commits intoNVIDIA:mainfrom
nvyutwu:yutwu/add-metrics
Open

[None][feat] Add production-level Prometheus metrics (iteration stats, config info, token counters, phase histograms)#12545
nvyutwu wants to merge 6 commits intoNVIDIA:mainfrom
nvyutwu:yutwu/add-metrics

Conversation

@nvyutwu
Copy link
Copy Markdown

@nvyutwu nvyutwu commented Mar 25, 2026

Summary

Add production-level Prometheus metrics to close the observability gap with vLLM/SGLang (ref: #9779).

New metrics (5 commits):

  • Iteration-level metrics (28 new): queue load, memory usage, KV cache blocks, inflight batching, speculative decoding stats — all previously JSON-only, now Prometheus-scrapable
  • Config info gauges (4 new): model config, parallel config, speculative decoding config, KV cache config — logged once at startup with config values as labels
  • Spec decode metric type fix: num_draft_tokens and num_accepted_tokens changed from Gauge to Counter to enable rate() queries
  • Per-request metrics (5 new):
    • trtllm_prompt_tokens_total (Counter): cumulative input tokens
    • trtllm_generation_tokens_total (Counter): cumulative output tokens
    • trtllm_request_prefill_time_seconds (Histogram): prefill phase duration
    • trtllm_request_decode_time_seconds (Histogram): decode phase duration
    • trtllm_request_inference_time_seconds (Histogram): total inference duration
  • Bug fixes: model_name/engine_type labels (were hardcoded "undefined"), REQUEST_QUEUE_TIME=0 silently dropped

Architecture:

  • Extracted _process_req_perf_metrics to tensorrt_llm/metrics/perf_utils.py (no GPU deps, fully unit-testable without TensorRT-LLM install)
  • Fine-grained prefill histogram buckets (1ms min) vs coarse decode/inference buckets

Changes

File Lines Change
tensorrt_llm/metrics/collector.py +427/-24 New counters, histograms, gauges, config info, queue time fix
tensorrt_llm/metrics/perf_utils.py +103 New: GPU-free per-request metrics computation
tensorrt_llm/metrics/enums.py +19 New MetricNames + copyright header
tensorrt_llm/metrics/__init__.py +4 Re-export process_req_perf_metrics
tensorrt_llm/executor/result.py +6/-27 Replace inlined metrics logic with perf_utils import
tensorrt_llm/serve/openai_server.py +79/-2 Fix labels, add config info extraction
tests/unittest/metrics/test_collector.py +575 41+ unit tests

Total: +1188/-54 lines across 8 files

Test Plan

41+ unit tests in tests/unittest/metrics/test_collector.py covering:

  • Iteration stats (queue load, memory, batch size, KV cache, inflight batching, spec decode)
  • Config info gauges (model, parallel, speculative, cache configs)
  • Token counters (accumulation, missing/zero counts, multi-response suppression)
  • Phase histograms (prefill, decode, inference durations; missing times; no-finish-reason guard)
  • process_req_perf_metrics (timestamp computation, clock-skew filtering, zero queue time, edge cases)
pytest tests/unittest/metrics/test_collector.py -v

No GPU required. No model weights required.

Verification

  • Unit tests pass: pytest tests/unittest/metrics/test_collector.py -v
  • /metrics endpoint exposes new metrics when serving with trtllm-serve
  • Metric names use trtllm_ prefix; semantics match vLLM equivalents
  • No regression in existing per-request metrics (e2e, ttft, tpot, queue time)
  • trtllm_request_queue_time_seconds correctly records REQUEST_QUEUE_TIME=0

Pre-Landing Review

Eng Review: CLEAR (0 issues). Convention fixes applied (DCO sign-off, title format).

Related: #9779

Signed-off-by: Yuting Wu (DLAlgo) yutwu@nvidia.com

Summary by CodeRabbit

Release Notes

  • New Features

    • Expanded performance metrics collection to include prefill time, decode time, inference time, prompt token count, and generation token count.
    • Added configuration information logging for model, parallelism, speculative decoding, and KV cache settings.
    • Improved metrics initialization with real server configuration values.
  • Tests

    • Added comprehensive unit test coverage for metrics collection and performance metrics functionality.

nvyutwu added 6 commits March 24, 2026 21:21
Adds 28 new Prometheus metrics (27 gauges + 1 counter) to MetricsCollector,
exposing already-collected iteration stats that were previously only available
as JSON via the /metrics endpoint. This enables Prometheus/OTel scrapers
(including Dynamo's OTel bridge) to collect queue load, memory usage, KV cache
blocks, inflight batching, and speculative decoding stats.

Metric names use trtllm_ prefix and match vLLM/SGLang conventions after
Dynamo strips the prefix (e.g. trtllm_num_requests_running -> num_requests_running).

Signed-off-by: nvyutwu <yutwu@nvidia.com>
Adds 4 info-style Prometheus gauges (model_config_info, parallel_config_info,
speculative_config_info, cache_config_info) logged once at startup with
configuration values as labels. Matches the vLLM/SGLang config info pattern
to enable Dynamo/OTel visibility into model dtype, quantization, TP/PP sizes,
speculative decoding method, KV cache settings, and GPU type.

Signed-off-by: nvyutwu <yutwu@nvidia.com>
…e to Counter

Per-iteration numDraftTokens and numAcceptedTokens are cumulative values
that should be accumulated, not overwritten each iteration. This enables
PromQL rate() queries for computing acceptance rate:

  rate(trtllm_spec_decode_num_accepted_tokens_total[5m]) /
  rate(trtllm_spec_decode_num_draft_tokens_total[5m])

acceptance_length and draft_overhead remain as Gauges (they are ratios).
Follows the same pattern as num_requests_completed_total.

Signed-off-by: Yuting Wu (DLAlgo) <yutwu@nvidia.com>
Signed-off-by: nvyutwu <yutwu@nvidia.com>
…rometheus metrics

Add Step 1 metrics to close the gap with vLLM/SGLang:

New Prometheus metrics:
- trtllm_prompt_tokens_total (Counter): cumulative input tokens
- trtllm_generation_tokens_total (Counter): cumulative output tokens
- trtllm_request_prefill_time_seconds (Histogram): first_token_time - first_scheduled_time
- trtllm_request_decode_time_seconds (Histogram): last_token_time - first_token_time
- trtllm_request_inference_time_seconds (Histogram): last_token_time - first_scheduled_time

Other changes:
- Fix model_name/engine_type labels (were hardcoded "undefined")
- Move _process_req_perf_metrics to tensorrt_llm/metrics/perf_utils.py (no GPU deps,
  fully unit-testable without a TensorRT-LLM install)
- Guard REQUEST_QUEUE_TIME=0 as a valid observation; filter negatives (clock skew)
- Use fine-grained prefill buckets (1ms min) vs coarse decode/inference buckets
- 41 unit tests covering new metrics, edge cases (clock skew, zero queue time,
  single-token output, missing timestamps)

Related: NVIDIA#9779
Signed-off-by: Yuting Wu (DLAlgo) <yutwu@nvidia.com>
- Fix walrus operator silently dropping REQUEST_QUEUE_TIME=0 in
  log_request_metrics_dict (use 'is not None' instead of truthiness check)
- Remove duplicate stat: dict = {} dead code in perf_utils.py
- Strengthen return type annotation: dict[MetricNames, float | int]
- Fix docstring: note REQUEST_QUEUE_TIME=0 exception to the v>0 filter
- Add NVIDIA copyright header to enums.py (was missing on modified file)
- Move imports to module top in test_collector.py; remove unused
  CollectorRegistry import; update stale docstring
- Add test: queue_time=0 is recorded to Prometheus histogram at collector layer
- Strengthen clock-skew test: assert non-negative metrics still present
- Add test: negative REQUEST_QUEUE_TIME (arrival > first_scheduled) is dropped

Signed-off-by: Yuting Wu (DLAlgo) <yutwu@nvidia.com>
@nvyutwu nvyutwu requested a review from a team as a code owner March 25, 2026 16:23
@nvyutwu nvyutwu requested a review from zhenhuaw-me March 25, 2026 16:23
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 25, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

The pull request refactors and enhances the metrics collection system by extracting request performance metric processing into a reusable utility function, significantly expanding Prometheus instrumentation with iteration-level metrics and configuration logging capabilities, and integrating these enhancements into the OpenAI server.

Changes

Cohort / File(s) Summary
Metrics Infrastructure
tensorrt_llm/metrics/enums.py, tensorrt_llm/metrics/__init__.py, tensorrt_llm/metrics/perf_utils.py
Added new metric name enum values (PREFILL_TIME, DECODE_TIME, INFERENCE_TIME, PROMPT_TOKENS, GENERATION_TOKENS) and created process_req_perf_metrics() function that converts RequestEventTiming dictionaries into derived per-request metrics, computing latencies, token counts, and TPOT when timestamps are present and valid.
Metrics Collection Expansion
tensorrt_llm/metrics/collector.py
Extended MetricsCollector with new per-request histograms (prefill/decode/inference times), token counters, iteration-level gauges/counters (active/queued/completed requests, memory usage, batch sizes, KV cache blocks, inflight batching, speculative decoding), and added log_config_info() method for registering configuration info metrics with Prometheus labels.
Result Recording Integration
tensorrt_llm/executor/result.py
Refactored GenerationResultBase.record_stats() to import and use extracted process_req_perf_metrics() from utilities, removing local implementation, and added prompt token count recording when applicable.
Server Configuration Logging
tensorrt_llm/serve/openai_server.py
Replaced placeholder MetricsCollector initialization values with actual server configuration, added _log_config_info_metrics() method to extract and log model, parallelism, speculative decoding, and KV cache configuration details to Prometheus.
Metrics Test Suite
tests/unittest/metrics/test_collector.py
Added comprehensive unit tests validating MetricsCollector.log_iteration_stats() (gauges/counters for requests, memory, batch sizes, KV cache, inflight batching, speculative decoding), log_config_info() metric registration, per-request metric handling with correct token/phase time recording, and process_req_perf_metrics() correctness across various timestamp and configuration scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main changes: adding production-level Prometheus metrics with four key additions (iteration stats, config info, token counters, phase histograms).
Description check ✅ Passed The PR description is comprehensive and well-structured, covering summary, architecture, test plan with verification steps, and related issues, though some template sections (PR Checklist items, CODEOWNERS, documentation updates) are not explicitly addressed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/executor/result.py (1)

566-574: ⚠️ Potential issue | 🟠 Major

Split request-scoped and candidate-scoped token metrics.

PROMPT_TOKENS is request-scoped, so skipping it when n > 1 undercounts multi-output requests. At the same time, process_req_perf_metrics(..., self.sampling_params.n > 1) still treats best_of > 1 / beam-search requests as single-output, so GENERATION_TOKENS and TPOT can end up reflecting whichever candidate was processed last rather than the response exposed to the caller.

Possible fix
+        has_multiple_candidates = (
+            self.sampling_params.n > 1
+            or self.sampling_params.best_of > 1
+            or self.sampling_params.use_beam_search
+        )
         processed_metrics_stat = _process_req_perf_metrics(
-            stats, len(output.token_ids), self.sampling_params.n > 1)
+            stats, len(output.token_ids), has_multiple_candidates)
         if processed_metrics_stat:
             metrics_stats.update(processed_metrics_stat)
-        if output.finish_reason and not (self.sampling_params.n > 1):
+        if output.finish_reason:
             prompt_token_ids = getattr(self, "prompt_token_ids", None)
             if prompt_token_ids is not None and len(prompt_token_ids) > 0:
                 metrics_stats[MetricNames.PROMPT_TOKENS] = len(
                     prompt_token_ids)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/executor/result.py` around lines 566 - 574, Split request-scoped
vs candidate-scoped metrics: always record PROMPT_TOKENS from
self.prompt_token_ids into metrics_stats (do not skip it when
self.sampling_params.n > 1), and change the call to
_process_req_perf_metrics(stats, len(output.token_ids), ...) to use the correct
multi-candidate flag (e.g. self.sampling_params.best_of > 1 or an explicit
best_of/beam flag) instead of self.sampling_params.n > 1 so candidate-level
metrics (GENERATION_TOKENS, TPOT) reflect per-candidate work; update the block
around _process_req_perf_metrics, metrics_stats, MetricNames.PROMPT_TOKENS, and
output.finish_reason accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/metrics/perf_utils.py`:
- Around line 87-91: TPOT calculation can produce bogus positive values when
LAST_TOKEN_TIME exists but FIRST_TOKEN_TIME is missing; update the TPOT block
(MetricNames.TPOT) to only compute when both FIRST_TOKEN_TIME and
LAST_TOKEN_TIME are present/valid using the same timestamp-presence check used
for DECODE_TIME (i.e., verify the required timestamp keys or non-zero timestamps
before computing (last_token - first_token)/(output_length-1)), and keep the
existing guards (output_length > 1 and not is_multiple_response) intact so TPOT
is omitted when timestamps are invalid.

In `@tensorrt_llm/serve/openai_server.py`:
- Around line 318-382: The code assumes quant_config, speculative_config
(spec_config_obj/decoding_config) and kv_cache_config are objects with
attributes, which fails when callers pass plain dicts; update the extraction to
normalize dict-backed configs (e.g., implement a small accessor that does value
= obj.get(key) if isinstance(obj, dict) else getattr(obj, key, None)) and use it
for quant_config.quant_algo,
spec_config_obj.decoding_type/max_draft_len/speculative_model and the
kv_cache_config fields (page_size, enable_block_reuse, enable_partial_reuse,
free_gpu_memory_fraction, dtype) before building
model_config/parallel_config/speculative_config/cache_config so startup won’t
error and metrics_collector.log_config_info receives complete labels.

---

Outside diff comments:
In `@tensorrt_llm/executor/result.py`:
- Around line 566-574: Split request-scoped vs candidate-scoped metrics: always
record PROMPT_TOKENS from self.prompt_token_ids into metrics_stats (do not skip
it when self.sampling_params.n > 1), and change the call to
_process_req_perf_metrics(stats, len(output.token_ids), ...) to use the correct
multi-candidate flag (e.g. self.sampling_params.best_of > 1 or an explicit
best_of/beam flag) instead of self.sampling_params.n > 1 so candidate-level
metrics (GENERATION_TOKENS, TPOT) reflect per-candidate work; update the block
around _process_req_perf_metrics, metrics_stats, MetricNames.PROMPT_TOKENS, and
output.finish_reason accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 284f3bff-e2a5-4c3e-a051-082776596227

📥 Commits

Reviewing files that changed from the base of the PR and between ba7ff28 and dd4b4e5.

📒 Files selected for processing (8)
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/metrics/__init__.py
  • tensorrt_llm/metrics/collector.py
  • tensorrt_llm/metrics/enums.py
  • tensorrt_llm/metrics/perf_utils.py
  • tensorrt_llm/serve/openai_server.py
  • tests/unittest/metrics/__init__.py
  • tests/unittest/metrics/test_collector.py

Comment on lines +87 to +91
# TPOT = decode duration per output token. Requires at least 2 tokens
# (denominator would be 0 for a single-token output).
if output_length > 1 and not is_multiple_response:
stat[MetricNames.TPOT] = (last_token - first_token) / (output_length -
1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Gate TPOT on valid decode timestamps.

If LAST_TOKEN_TIME is present but FIRST_TOKEN_TIME is missing, the current 0 default yields a bogus positive TPOT that survives the final v > 0 filter. TPOT should use the same timestamp-presence check as DECODE_TIME.

Possible fix
-    if output_length > 1 and not is_multiple_response:
+    if (output_length > 1 and not is_multiple_response
+            and first_token > 0 and last_token > 0):
         stat[MetricNames.TPOT] = (last_token - first_token) / (output_length -
                                                                 1)
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 91-91: continuation line over-indented for visual indent

(E127)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/metrics/perf_utils.py` around lines 87 - 91, TPOT calculation
can produce bogus positive values when LAST_TOKEN_TIME exists but
FIRST_TOKEN_TIME is missing; update the TPOT block (MetricNames.TPOT) to only
compute when both FIRST_TOKEN_TIME and LAST_TOKEN_TIME are present/valid using
the same timestamp-presence check used for DECODE_TIME (i.e., verify the
required timestamp keys or non-zero timestamps before computing (last_token -
first_token)/(output_length-1)), and keep the existing guards (output_length > 1
and not is_multiple_response) intact so TPOT is omitted when timestamps are
invalid.

Comment on lines +318 to +382
quant_config = getattr(args, "quant_config", None)
if quant_config is not None:
model_config["quantization"] = str(
quant_config.quant_algo) if quant_config.quant_algo else "none"
else:
model_config["quantization"] = "none"
max_seq_len = getattr(args, "max_seq_len", None)
if max_seq_len is not None:
model_config["max_model_len"] = str(max_seq_len)
try:
import torch
if torch.cuda.is_available():
model_config["gpu_type"] = torch.cuda.get_device_name(0)
except Exception:
pass

# Parallel config
tp_size = getattr(args, "tensor_parallel_size", 1) or 1
pp_size = getattr(args, "pipeline_parallel_size", 1) or 1
parallel_config = {
"tensor_parallel_size": str(tp_size),
"pipeline_parallel_size": str(pp_size),
"gpu_count": str(tp_size * pp_size),
}
ep_size = getattr(args, "moe_expert_parallel_size", None)
if ep_size is not None:
parallel_config["expert_parallel_size"] = str(ep_size)

# Speculative decoding config
spec_config_obj = getattr(args, "speculative_config",
None) or getattr(args, "decoding_config",
None)
speculative_config = None
if spec_config_obj is not None:
speculative_config = {"spec_enabled": "true"}
decoding_type = getattr(spec_config_obj, "decoding_type", None)
if decoding_type is not None:
speculative_config["spec_method"] = str(decoding_type)
max_draft_len = getattr(spec_config_obj, "max_draft_len", None)
if max_draft_len is not None:
speculative_config["spec_num_tokens"] = str(max_draft_len)
draft_model = getattr(spec_config_obj, "speculative_model", None)
if draft_model is not None:
speculative_config["spec_draft_model"] = str(draft_model)

# KV cache config
kv_cache_config = getattr(args, "kv_cache_config", None)
cache_config = None
if kv_cache_config is not None:
cache_config = {}
for field in ("page_size", "enable_block_reuse",
"enable_partial_reuse", "free_gpu_memory_fraction"):
val = getattr(kv_cache_config, field, None)
if val is not None:
cache_config[field] = str(val)
kv_dtype = getattr(kv_cache_config, "dtype", None)
if kv_dtype is not None:
cache_config["cache_dtype"] = str(kv_dtype)

self.metrics_collector.log_config_info(
model_config=model_config,
parallel_config=parallel_config,
speculative_config=speculative_config,
cache_config=cache_config if cache_config else None,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize dict-backed configs before extracting labels.

These args are allowed to be plain dicts, not just model objects. quant_config.quant_algo will raise on dict input, and the later getattr(...) calls quietly drop speculative/KV-cache labels for dict-backed configs, so this path can either fail server startup or publish incomplete config metrics.

Possible fix
+        def _config_value(config: Any, field: str):
+            if isinstance(config, dict):
+                return config.get(field)
+            return getattr(config, field, None)
+
         quant_config = getattr(args, "quant_config", None)
         if quant_config is not None:
-            model_config["quantization"] = str(
-                quant_config.quant_algo) if quant_config.quant_algo else "none"
+            quant_algo = _config_value(quant_config, "quant_algo")
+            model_config["quantization"] = str(
+                quant_algo) if quant_algo else "none"
         else:
             model_config["quantization"] = "none"
@@
         if spec_config_obj is not None:
             speculative_config = {"spec_enabled": "true"}
-            decoding_type = getattr(spec_config_obj, "decoding_type", None)
+            decoding_type = _config_value(spec_config_obj, "decoding_type")
             if decoding_type is not None:
                 speculative_config["spec_method"] = str(decoding_type)
-            max_draft_len = getattr(spec_config_obj, "max_draft_len", None)
+            max_draft_len = _config_value(spec_config_obj, "max_draft_len")
             if max_draft_len is not None:
                 speculative_config["spec_num_tokens"] = str(max_draft_len)
-            draft_model = getattr(spec_config_obj, "speculative_model", None)
+            draft_model = _config_value(spec_config_obj, "speculative_model")
             if draft_model is not None:
                 speculative_config["spec_draft_model"] = str(draft_model)
@@
         if kv_cache_config is not None:
             cache_config = {}
             for field in ("page_size", "enable_block_reuse",
                           "enable_partial_reuse", "free_gpu_memory_fraction"):
-                val = getattr(kv_cache_config, field, None)
+                val = _config_value(kv_cache_config, field)
                 if val is not None:
                     cache_config[field] = str(val)
-            kv_dtype = getattr(kv_cache_config, "dtype", None)
+            kv_dtype = _config_value(kv_cache_config, "dtype")
             if kv_dtype is not None:
                 cache_config["cache_dtype"] = str(kv_dtype)
🧰 Tools
🪛 Ruff (0.15.6)

[error] 331-332: try-except-pass detected, consider logging the exception

(S110)


[warning] 331-331: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/serve/openai_server.py` around lines 318 - 382, The code assumes
quant_config, speculative_config (spec_config_obj/decoding_config) and
kv_cache_config are objects with attributes, which fails when callers pass plain
dicts; update the extraction to normalize dict-backed configs (e.g., implement a
small accessor that does value = obj.get(key) if isinstance(obj, dict) else
getattr(obj, key, None)) and use it for quant_config.quant_algo,
spec_config_obj.decoding_type/max_draft_len/speculative_model and the
kv_cache_config fields (page_size, enable_block_reuse, enable_partial_reuse,
free_gpu_memory_fraction, dtype) before building
model_config/parallel_config/speculative_config/cache_config so startup won’t
error and metrics_collector.log_config_info receives complete labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants