[None][feat] Add production-level Prometheus metrics (iteration stats, config info, token counters, phase histograms)#12545
[None][feat] Add production-level Prometheus metrics (iteration stats, config info, token counters, phase histograms)#12545nvyutwu wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
Adds 28 new Prometheus metrics (27 gauges + 1 counter) to MetricsCollector, exposing already-collected iteration stats that were previously only available as JSON via the /metrics endpoint. This enables Prometheus/OTel scrapers (including Dynamo's OTel bridge) to collect queue load, memory usage, KV cache blocks, inflight batching, and speculative decoding stats. Metric names use trtllm_ prefix and match vLLM/SGLang conventions after Dynamo strips the prefix (e.g. trtllm_num_requests_running -> num_requests_running). Signed-off-by: nvyutwu <yutwu@nvidia.com>
Adds 4 info-style Prometheus gauges (model_config_info, parallel_config_info, speculative_config_info, cache_config_info) logged once at startup with configuration values as labels. Matches the vLLM/SGLang config info pattern to enable Dynamo/OTel visibility into model dtype, quantization, TP/PP sizes, speculative decoding method, KV cache settings, and GPU type. Signed-off-by: nvyutwu <yutwu@nvidia.com>
…e to Counter Per-iteration numDraftTokens and numAcceptedTokens are cumulative values that should be accumulated, not overwritten each iteration. This enables PromQL rate() queries for computing acceptance rate: rate(trtllm_spec_decode_num_accepted_tokens_total[5m]) / rate(trtllm_spec_decode_num_draft_tokens_total[5m]) acceptance_length and draft_overhead remain as Gauges (they are ratios). Follows the same pattern as num_requests_completed_total. Signed-off-by: Yuting Wu (DLAlgo) <yutwu@nvidia.com> Signed-off-by: nvyutwu <yutwu@nvidia.com>
…rometheus metrics Add Step 1 metrics to close the gap with vLLM/SGLang: New Prometheus metrics: - trtllm_prompt_tokens_total (Counter): cumulative input tokens - trtllm_generation_tokens_total (Counter): cumulative output tokens - trtllm_request_prefill_time_seconds (Histogram): first_token_time - first_scheduled_time - trtllm_request_decode_time_seconds (Histogram): last_token_time - first_token_time - trtllm_request_inference_time_seconds (Histogram): last_token_time - first_scheduled_time Other changes: - Fix model_name/engine_type labels (were hardcoded "undefined") - Move _process_req_perf_metrics to tensorrt_llm/metrics/perf_utils.py (no GPU deps, fully unit-testable without a TensorRT-LLM install) - Guard REQUEST_QUEUE_TIME=0 as a valid observation; filter negatives (clock skew) - Use fine-grained prefill buckets (1ms min) vs coarse decode/inference buckets - 41 unit tests covering new metrics, edge cases (clock skew, zero queue time, single-token output, missing timestamps) Related: NVIDIA#9779 Signed-off-by: Yuting Wu (DLAlgo) <yutwu@nvidia.com>
- Fix walrus operator silently dropping REQUEST_QUEUE_TIME=0 in
log_request_metrics_dict (use 'is not None' instead of truthiness check)
- Remove duplicate stat: dict = {} dead code in perf_utils.py
- Strengthen return type annotation: dict[MetricNames, float | int]
- Fix docstring: note REQUEST_QUEUE_TIME=0 exception to the v>0 filter
- Add NVIDIA copyright header to enums.py (was missing on modified file)
- Move imports to module top in test_collector.py; remove unused
CollectorRegistry import; update stale docstring
- Add test: queue_time=0 is recorded to Prometheus histogram at collector layer
- Strengthen clock-skew test: assert non-negative metrics still present
- Add test: negative REQUEST_QUEUE_TIME (arrival > first_scheduled) is dropped
Signed-off-by: Yuting Wu (DLAlgo) <yutwu@nvidia.com>
📝 WalkthroughWalkthroughThe pull request refactors and enhances the metrics collection system by extracting request performance metric processing into a reusable utility function, significantly expanding Prometheus instrumentation with iteration-level metrics and configuration logging capabilities, and integrating these enhancements into the OpenAI server. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/executor/result.py (1)
566-574:⚠️ Potential issue | 🟠 MajorSplit request-scoped and candidate-scoped token metrics.
PROMPT_TOKENSis request-scoped, so skipping it whenn > 1undercounts multi-output requests. At the same time,process_req_perf_metrics(..., self.sampling_params.n > 1)still treatsbest_of > 1/ beam-search requests as single-output, soGENERATION_TOKENSandTPOTcan end up reflecting whichever candidate was processed last rather than the response exposed to the caller.Possible fix
+ has_multiple_candidates = ( + self.sampling_params.n > 1 + or self.sampling_params.best_of > 1 + or self.sampling_params.use_beam_search + ) processed_metrics_stat = _process_req_perf_metrics( - stats, len(output.token_ids), self.sampling_params.n > 1) + stats, len(output.token_ids), has_multiple_candidates) if processed_metrics_stat: metrics_stats.update(processed_metrics_stat) - if output.finish_reason and not (self.sampling_params.n > 1): + if output.finish_reason: prompt_token_ids = getattr(self, "prompt_token_ids", None) if prompt_token_ids is not None and len(prompt_token_ids) > 0: metrics_stats[MetricNames.PROMPT_TOKENS] = len( prompt_token_ids)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/executor/result.py` around lines 566 - 574, Split request-scoped vs candidate-scoped metrics: always record PROMPT_TOKENS from self.prompt_token_ids into metrics_stats (do not skip it when self.sampling_params.n > 1), and change the call to _process_req_perf_metrics(stats, len(output.token_ids), ...) to use the correct multi-candidate flag (e.g. self.sampling_params.best_of > 1 or an explicit best_of/beam flag) instead of self.sampling_params.n > 1 so candidate-level metrics (GENERATION_TOKENS, TPOT) reflect per-candidate work; update the block around _process_req_perf_metrics, metrics_stats, MetricNames.PROMPT_TOKENS, and output.finish_reason accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/metrics/perf_utils.py`:
- Around line 87-91: TPOT calculation can produce bogus positive values when
LAST_TOKEN_TIME exists but FIRST_TOKEN_TIME is missing; update the TPOT block
(MetricNames.TPOT) to only compute when both FIRST_TOKEN_TIME and
LAST_TOKEN_TIME are present/valid using the same timestamp-presence check used
for DECODE_TIME (i.e., verify the required timestamp keys or non-zero timestamps
before computing (last_token - first_token)/(output_length-1)), and keep the
existing guards (output_length > 1 and not is_multiple_response) intact so TPOT
is omitted when timestamps are invalid.
In `@tensorrt_llm/serve/openai_server.py`:
- Around line 318-382: The code assumes quant_config, speculative_config
(spec_config_obj/decoding_config) and kv_cache_config are objects with
attributes, which fails when callers pass plain dicts; update the extraction to
normalize dict-backed configs (e.g., implement a small accessor that does value
= obj.get(key) if isinstance(obj, dict) else getattr(obj, key, None)) and use it
for quant_config.quant_algo,
spec_config_obj.decoding_type/max_draft_len/speculative_model and the
kv_cache_config fields (page_size, enable_block_reuse, enable_partial_reuse,
free_gpu_memory_fraction, dtype) before building
model_config/parallel_config/speculative_config/cache_config so startup won’t
error and metrics_collector.log_config_info receives complete labels.
---
Outside diff comments:
In `@tensorrt_llm/executor/result.py`:
- Around line 566-574: Split request-scoped vs candidate-scoped metrics: always
record PROMPT_TOKENS from self.prompt_token_ids into metrics_stats (do not skip
it when self.sampling_params.n > 1), and change the call to
_process_req_perf_metrics(stats, len(output.token_ids), ...) to use the correct
multi-candidate flag (e.g. self.sampling_params.best_of > 1 or an explicit
best_of/beam flag) instead of self.sampling_params.n > 1 so candidate-level
metrics (GENERATION_TOKENS, TPOT) reflect per-candidate work; update the block
around _process_req_perf_metrics, metrics_stats, MetricNames.PROMPT_TOKENS, and
output.finish_reason accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 284f3bff-e2a5-4c3e-a051-082776596227
📒 Files selected for processing (8)
tensorrt_llm/executor/result.pytensorrt_llm/metrics/__init__.pytensorrt_llm/metrics/collector.pytensorrt_llm/metrics/enums.pytensorrt_llm/metrics/perf_utils.pytensorrt_llm/serve/openai_server.pytests/unittest/metrics/__init__.pytests/unittest/metrics/test_collector.py
| # TPOT = decode duration per output token. Requires at least 2 tokens | ||
| # (denominator would be 0 for a single-token output). | ||
| if output_length > 1 and not is_multiple_response: | ||
| stat[MetricNames.TPOT] = (last_token - first_token) / (output_length - | ||
| 1) |
There was a problem hiding this comment.
Gate TPOT on valid decode timestamps.
If LAST_TOKEN_TIME is present but FIRST_TOKEN_TIME is missing, the current 0 default yields a bogus positive TPOT that survives the final v > 0 filter. TPOT should use the same timestamp-presence check as DECODE_TIME.
Possible fix
- if output_length > 1 and not is_multiple_response:
+ if (output_length > 1 and not is_multiple_response
+ and first_token > 0 and last_token > 0):
stat[MetricNames.TPOT] = (last_token - first_token) / (output_length -
1)🧰 Tools
🪛 Flake8 (7.3.0)
[error] 91-91: continuation line over-indented for visual indent
(E127)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/metrics/perf_utils.py` around lines 87 - 91, TPOT calculation
can produce bogus positive values when LAST_TOKEN_TIME exists but
FIRST_TOKEN_TIME is missing; update the TPOT block (MetricNames.TPOT) to only
compute when both FIRST_TOKEN_TIME and LAST_TOKEN_TIME are present/valid using
the same timestamp-presence check used for DECODE_TIME (i.e., verify the
required timestamp keys or non-zero timestamps before computing (last_token -
first_token)/(output_length-1)), and keep the existing guards (output_length > 1
and not is_multiple_response) intact so TPOT is omitted when timestamps are
invalid.
| quant_config = getattr(args, "quant_config", None) | ||
| if quant_config is not None: | ||
| model_config["quantization"] = str( | ||
| quant_config.quant_algo) if quant_config.quant_algo else "none" | ||
| else: | ||
| model_config["quantization"] = "none" | ||
| max_seq_len = getattr(args, "max_seq_len", None) | ||
| if max_seq_len is not None: | ||
| model_config["max_model_len"] = str(max_seq_len) | ||
| try: | ||
| import torch | ||
| if torch.cuda.is_available(): | ||
| model_config["gpu_type"] = torch.cuda.get_device_name(0) | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Parallel config | ||
| tp_size = getattr(args, "tensor_parallel_size", 1) or 1 | ||
| pp_size = getattr(args, "pipeline_parallel_size", 1) or 1 | ||
| parallel_config = { | ||
| "tensor_parallel_size": str(tp_size), | ||
| "pipeline_parallel_size": str(pp_size), | ||
| "gpu_count": str(tp_size * pp_size), | ||
| } | ||
| ep_size = getattr(args, "moe_expert_parallel_size", None) | ||
| if ep_size is not None: | ||
| parallel_config["expert_parallel_size"] = str(ep_size) | ||
|
|
||
| # Speculative decoding config | ||
| spec_config_obj = getattr(args, "speculative_config", | ||
| None) or getattr(args, "decoding_config", | ||
| None) | ||
| speculative_config = None | ||
| if spec_config_obj is not None: | ||
| speculative_config = {"spec_enabled": "true"} | ||
| decoding_type = getattr(spec_config_obj, "decoding_type", None) | ||
| if decoding_type is not None: | ||
| speculative_config["spec_method"] = str(decoding_type) | ||
| max_draft_len = getattr(spec_config_obj, "max_draft_len", None) | ||
| if max_draft_len is not None: | ||
| speculative_config["spec_num_tokens"] = str(max_draft_len) | ||
| draft_model = getattr(spec_config_obj, "speculative_model", None) | ||
| if draft_model is not None: | ||
| speculative_config["spec_draft_model"] = str(draft_model) | ||
|
|
||
| # KV cache config | ||
| kv_cache_config = getattr(args, "kv_cache_config", None) | ||
| cache_config = None | ||
| if kv_cache_config is not None: | ||
| cache_config = {} | ||
| for field in ("page_size", "enable_block_reuse", | ||
| "enable_partial_reuse", "free_gpu_memory_fraction"): | ||
| val = getattr(kv_cache_config, field, None) | ||
| if val is not None: | ||
| cache_config[field] = str(val) | ||
| kv_dtype = getattr(kv_cache_config, "dtype", None) | ||
| if kv_dtype is not None: | ||
| cache_config["cache_dtype"] = str(kv_dtype) | ||
|
|
||
| self.metrics_collector.log_config_info( | ||
| model_config=model_config, | ||
| parallel_config=parallel_config, | ||
| speculative_config=speculative_config, | ||
| cache_config=cache_config if cache_config else None, | ||
| ) |
There was a problem hiding this comment.
Normalize dict-backed configs before extracting labels.
These args are allowed to be plain dicts, not just model objects. quant_config.quant_algo will raise on dict input, and the later getattr(...) calls quietly drop speculative/KV-cache labels for dict-backed configs, so this path can either fail server startup or publish incomplete config metrics.
Possible fix
+ def _config_value(config: Any, field: str):
+ if isinstance(config, dict):
+ return config.get(field)
+ return getattr(config, field, None)
+
quant_config = getattr(args, "quant_config", None)
if quant_config is not None:
- model_config["quantization"] = str(
- quant_config.quant_algo) if quant_config.quant_algo else "none"
+ quant_algo = _config_value(quant_config, "quant_algo")
+ model_config["quantization"] = str(
+ quant_algo) if quant_algo else "none"
else:
model_config["quantization"] = "none"
@@
if spec_config_obj is not None:
speculative_config = {"spec_enabled": "true"}
- decoding_type = getattr(spec_config_obj, "decoding_type", None)
+ decoding_type = _config_value(spec_config_obj, "decoding_type")
if decoding_type is not None:
speculative_config["spec_method"] = str(decoding_type)
- max_draft_len = getattr(spec_config_obj, "max_draft_len", None)
+ max_draft_len = _config_value(spec_config_obj, "max_draft_len")
if max_draft_len is not None:
speculative_config["spec_num_tokens"] = str(max_draft_len)
- draft_model = getattr(spec_config_obj, "speculative_model", None)
+ draft_model = _config_value(spec_config_obj, "speculative_model")
if draft_model is not None:
speculative_config["spec_draft_model"] = str(draft_model)
@@
if kv_cache_config is not None:
cache_config = {}
for field in ("page_size", "enable_block_reuse",
"enable_partial_reuse", "free_gpu_memory_fraction"):
- val = getattr(kv_cache_config, field, None)
+ val = _config_value(kv_cache_config, field)
if val is not None:
cache_config[field] = str(val)
- kv_dtype = getattr(kv_cache_config, "dtype", None)
+ kv_dtype = _config_value(kv_cache_config, "dtype")
if kv_dtype is not None:
cache_config["cache_dtype"] = str(kv_dtype)🧰 Tools
🪛 Ruff (0.15.6)
[error] 331-332: try-except-pass detected, consider logging the exception
(S110)
[warning] 331-331: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/serve/openai_server.py` around lines 318 - 382, The code assumes
quant_config, speculative_config (spec_config_obj/decoding_config) and
kv_cache_config are objects with attributes, which fails when callers pass plain
dicts; update the extraction to normalize dict-backed configs (e.g., implement a
small accessor that does value = obj.get(key) if isinstance(obj, dict) else
getattr(obj, key, None)) and use it for quant_config.quant_algo,
spec_config_obj.decoding_type/max_draft_len/speculative_model and the
kv_cache_config fields (page_size, enable_block_reuse, enable_partial_reuse,
free_gpu_memory_fraction, dtype) before building
model_config/parallel_config/speculative_config/cache_config so startup won’t
error and metrics_collector.log_config_info receives complete labels.
Summary
Add production-level Prometheus metrics to close the observability gap with vLLM/SGLang (ref: #9779).
New metrics (5 commits):
num_draft_tokensandnum_accepted_tokenschanged from Gauge to Counter to enablerate()queriestrtllm_prompt_tokens_total(Counter): cumulative input tokenstrtllm_generation_tokens_total(Counter): cumulative output tokenstrtllm_request_prefill_time_seconds(Histogram): prefill phase durationtrtllm_request_decode_time_seconds(Histogram): decode phase durationtrtllm_request_inference_time_seconds(Histogram): total inference durationmodel_name/engine_typelabels (were hardcoded"undefined"),REQUEST_QUEUE_TIME=0silently droppedArchitecture:
_process_req_perf_metricstotensorrt_llm/metrics/perf_utils.py(no GPU deps, fully unit-testable without TensorRT-LLM install)Changes
tensorrt_llm/metrics/collector.pytensorrt_llm/metrics/perf_utils.pytensorrt_llm/metrics/enums.pytensorrt_llm/metrics/__init__.pytensorrt_llm/executor/result.pytensorrt_llm/serve/openai_server.pytests/unittest/metrics/test_collector.pyTotal: +1188/-54 lines across 8 files
Test Plan
41+ unit tests in
tests/unittest/metrics/test_collector.pycovering:process_req_perf_metrics(timestamp computation, clock-skew filtering, zero queue time, edge cases)No GPU required. No model weights required.
Verification
pytest tests/unittest/metrics/test_collector.py -v/metricsendpoint exposes new metrics when serving withtrtllm-servetrtllm_prefix; semantics match vLLM equivalentstrtllm_request_queue_time_secondscorrectly recordsREQUEST_QUEUE_TIME=0Pre-Landing Review
Eng Review: CLEAR (0 issues). Convention fixes applied (DCO sign-off, title format).
Related: #9779
Signed-off-by: Yuting Wu (DLAlgo) yutwu@nvidia.com
Summary by CodeRabbit
Release Notes
New Features
Tests