Skip to content

fix: handle unknown prometheus metrics#696

Open
rafiw wants to merge 1 commit intoai-dynamo:mainfrom
rafiw:unknown_metric
Open

fix: handle unknown prometheus metrics#696
rafiw wants to merge 1 commit intoai-dynamo:mainfrom
rafiw:unknown_metric

Conversation

@rafiw
Copy link

@rafiw rafiw commented Feb 23, 2026

When asked to parse prometheus endpoint that publish unknown metrics it fails with error. This commit treats unknown as gauges and adds unit tests.

Summary by CodeRabbit

  • Bug Fixes

    • Standardized handling of metrics with unknown types, ensuring they are properly exported and computed consistently with gauge metrics across all export formats and computation processes.
  • Tests

    • Expanded test coverage for unknown metric types, including parameterized tests for various export and computation scenarios.

@github-actions
Copy link

github-actions bot commented Feb 23, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@19e4b776998f9ee42bfb1a183e8144c718b47651

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@19e4b776998f9ee42bfb1a183e8144c718b47651

Last updated for commit: 19e4b77Browse code

@github-actions github-actions bot added the fix label Feb 23, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 23, 2026

Warning

Rate limit exceeded

@rafiw has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 36 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 5de94a2a35eb54d6119c70a8ec0ea37cc7bcba81 and 19e4b77.

📒 Files selected for processing (12)
  • src/aiperf/server_metrics/accumulator.py
  • src/aiperf/server_metrics/csv_exporter.py
  • src/aiperf/server_metrics/export_stats.py
  • src/aiperf/server_metrics/json_exporter.py
  • src/aiperf/server_metrics/parquet_exporter.py
  • src/aiperf/server_metrics/storage.py
  • tests/server_metrics/test_storage.py
  • tests/unit/server_metrics/test_accumulator.py
  • tests/unit/server_metrics/test_csv_exporter.py
  • tests/unit/server_metrics/test_export_stats_basic.py
  • tests/unit/server_metrics/test_export_stats_edge_cases_aggressive.py
  • tests/unit/server_metrics/test_storage_edge_cases.py

Walkthrough

This PR adds comprehensive support for handling PrometheusMetricType.UNKNOWN metric types across the metrics system. UNKNOWN metrics are now consistently treated as scalar time series (GAUGE equivalents) throughout storage, accumulation, statistics computation, and export pipelines, with corresponding test coverage expansions and refactoring.

Changes

Cohort / File(s) Summary
Storage Layer
src/aiperf/server_metrics/storage.py, tests/server_metrics/test_storage.py, tests/unit/server_metrics/test_storage_edge_cases.py
Maps PrometheusMetricType.UNKNOWN to ScalarTimeSeries in ServerMetricEntry.from_metric_family, aligning UNKNOWN with GAUGE and COUNTER. Test refactored to parameterized format covering all metric types.
Metrics Accumulation
src/aiperf/server_metrics/accumulator.py, tests/unit/server_metrics/test_accumulator.py
Treats UNKNOWN as GAUGE when creating per-endpoint metric data containers. Test updated with parametrization for dynamic metric type and name handling.
Statistics Computation
src/aiperf/server_metrics/export_stats.py, tests/unit/server_metrics/test_export_stats_basic.py, tests/unit/server_metrics/test_export_stats_edge_cases_aggressive.py
Routes UNKNOWN metrics to gauge statistics path via compute_stats. Tests expanded with UNKNOWN coverage and consolidated empty-series handling into parameterized tests.
CSV Export Mapping
src/aiperf/server_metrics/csv_exporter.py, tests/unit/server_metrics/test_csv_exporter.py
Adds STAT_KEYS_MAP entry mapping PrometheusMetricType.UNKNOWN to GAUGE_STAT_KEYS. New test validates unknown metric type mapping alignment.
JSON Exporter
src/aiperf/server_metrics/json_exporter.py
Expands _build_hybrid_metrics to treat UNKNOWN as GAUGE, instantiating UNKNOWN metrics as GaugeMetricData.
Parquet Exporter
src/aiperf/server_metrics/parquet_exporter.py
Includes UNKNOWN in gauge handling across four code paths: metadata building, row collection (both generator and non-generator), and scalar row exports, enabling consistent scalar export pathway.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 The UNKNOWN metric comes to play,
As GAUGE we route it every way!
Through CSV, JSON, Parquet too,
Consistency reigns—our job is through! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: handling unknown Prometheus metrics throughout the codebase. It is concise and directly reflects the primary objective of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

When asked to parse prometheus endpoint that publish unknown metrics it fails with error.
This commit treats unknown as gauges and adds unit tests.

Signed-off-by: Rafi Wiener <rwiener@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/server_metrics/test_accumulator.py (1)

114-155: ⚠️ Potential issue | 🟡 Minor

Add assertions that verify UNKNOWN metrics produce gauge-equivalent output in endpoint_summaries.

The new parametrized case covers PrometheusMetricType.UNKNOWN, but the assertions only check structural properties (result type, endpoint presence, summary count). The PR's core claim — that UNKNOWN is treated identically to GAUGE — is not exercised at the assertion level: neither that metric_name exists in the per-endpoint metrics map, nor that scalar statistics (e.g. mean, min, max) are present as they would be for a GAUGE.

✅ Suggested additional assertions
     assert result.endpoint_summaries is not None
     assert len(result.endpoint_summaries) == 1
+
+    # Verify the metric is present and carries scalar (gauge-equivalent) stats
+    summary = list(result.endpoint_summaries.values())[0]
+    assert metric_name in summary.metrics
+    series = summary.metrics[metric_name].series
+    assert len(series) > 0
+    stats = series[0]
+    assert stats.mean is not None
+    assert stats.min is not None
+    assert stats.max is not None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/server_metrics/test_accumulator.py` around lines 114 - 155, The
test currently only checks structural properties but doesn't assert that UNKNOWN
metrics are converted to gauge-like scalars; update
test_export_results_with_data to verify that for the returned
ServerMetricsResults (from ServerMetricsAccumulator.export_results) the
endpoint_summaries contains the metric_name in the per-endpoint metrics map and
that the metric entry includes scalar statistics (e.g., mean, min, max, count)
matching the supplied samples; reference the
ServerMetricsResults.endpoint_summaries and the metric key (metric_name) to
assert presence and verify the scalar fields (mean/min/max/count) have the
expected numeric values for the generated samples so UNKNOWN behaves like GAUGE.
🧹 Nitpick comments (3)
src/aiperf/server_metrics/csv_exporter.py (1)

42-47: The STAT_KEYS_MAP entry for UNKNOWN is dead code and can be removed.

UNKNOWN metrics are correctly written to CSV because GaugeMetricData.type is hardcoded to PrometheusMetricType.GAUGE (not UNKNOWN). When the accumulator converts UNKNOWN metrics to GaugeMetricData, they are grouped under GAUGE in _group_metrics_by_type and exported in the GAUGE section. The STAT_KEYS_MAP entry for UNKNOWN will never be accessed since metrics are only grouped by GAUGE, COUNTER, or HISTOGRAM. Remove line 44 as unnecessary.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/server_metrics/csv_exporter.py` around lines 42 - 47, Remove the
dead STAT_KEYS_MAP entry for PrometheusMetricType.UNKNOWN: delete the mapping
key PrometheusMetricType.UNKNOWN that points to GAUGE_STAT_KEYS inside the
STAT_KEYS_MAP definition so the map only contains GAUGE, COUNTER, and HISTOGRAM;
verify no other code refers to PrometheusMetricType.UNKNOWN in STAT_KEYS_MAP
(search for STAT_KEYS_MAP, PrometheusMetricType.UNKNOWN, GAUGE_STAT_KEYS) and
run tests to ensure CSV export behavior is unchanged.
tests/unit/server_metrics/test_export_stats_basic.py (1)

97-114: TestUnknownExportStats — logic is correct; consider also asserting the result type.

The expected values (avg=20.0, min=10.0, max=30.0) are arithmetically correct for [10.0, 20.0, 30.0]. result.stats is accessed directly after assert result is not None; if stats were None, this would surface as an AttributeError rather than a clear test-failure message. Adding assert result.stats is not None before the field assertions would give a clearer signal. Optionally, asserting isinstance(result, GaugeSeries) would also confirm the correct return type end-to-end.

♻️ Suggested improvement
         assert result is not None
+        assert result.stats is not None
+        assert isinstance(result, GaugeSeries)  # UNKNOWN routes to GaugeSeries
         assert result.stats.avg == 20.0
         assert result.stats.min == 10.0
         assert result.stats.max == 30.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/server_metrics/test_export_stats_basic.py` around lines 97 - 114,
The test TestUnknownExportStats currently asserts result is not None then
accesses result.stats directly; add an explicit assertion that result.stats is
not None to surface a clear failure if stats are missing and optionally assert
the concrete return type (e.g., isinstance(result, GaugeSeries)) to confirm
UNKNOWN is routed to gauge logic; update the test around compute_stats (called
with PrometheusMetricType.UNKNOWN, get_gauge, make_time_filter) to include
assert result.stats is not None and, if desired, assert isinstance(result,
GaugeSeries) before checking avg/min/max.
tests/unit/server_metrics/test_csv_exporter.py (1)

338-341: Prefer == over is to test value equality rather than object identity.

Using is asserts that both keys reference the same list object, which is an implementation detail. If the backing csv_exporter.py is ever refactored to create a copy (e.g., STAT_KEYS_MAP[UNKNOWN] = list(GAUGE_STAT_KEYS)), this test would silently break while the behavior remains correct. == expresses the actual intent — that the stat-key lists are identical in content.

♻️ Proposed fix
-        assert (
-            STAT_KEYS_MAP[PrometheusMetricType.UNKNOWN]
-            is STAT_KEYS_MAP[PrometheusMetricType.GAUGE]
-        )
+        assert (
+            STAT_KEYS_MAP[PrometheusMetricType.UNKNOWN]
+            == STAT_KEYS_MAP[PrometheusMetricType.GAUGE]
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/server_metrics/test_csv_exporter.py` around lines 338 - 341, The
assertion is using object identity instead of value equality; update the test to
compare contents by replacing the `is` check with an equality check so that
STAT_KEYS_MAP[PrometheusMetricType.UNKNOWN] ==
STAT_KEYS_MAP[PrometheusMetricType.GAUGE]; this keeps the intent clear and
resilient if the lists are copies (referencing STAT_KEYS_MAP and
PrometheusMetricType.UNKNOWN / PrometheusMetricType.GAUGE to locate the
assertion).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unit/server_metrics/test_csv_exporter.py`:
- Around line 333-341: The CSV exporter currently omits UNKNOWN metrics because
section_order only lists GAUGE/COUNTER/HISTOGRAM and _group_metrics_by_type keys
metrics by their raw type; fix by either adding PrometheusMetricType.UNKNOWN to
the section_order used by _generate_content or by normalizing UNKNOWN to
PrometheusMetricType.GAUGE inside _group_metrics_by_type (or both for clarity),
and add an integration test that creates a metric with
type=PrometheusMetricType.UNKNOWN and asserts a CSV row is emitted; reference
STAT_KEYS_MAP, PrometheusMetricType.UNKNOWN, section_order, _generate_content,
and _group_metrics_by_type when updating the code and tests.

In `@tests/unit/server_metrics/test_export_stats_basic.py`:
- Around line 9-11: The imports in the test import block are out of alphabetical
order causing the ruff pre-commit failure: move the PrometheusMetricType import
before the TimeRangeFilter import so that aiperf.common.enums is imported prior
to aiperf.common.models (swap the order of PrometheusMetricType and
TimeRangeFilter in the import list).

---

Outside diff comments:
In `@tests/unit/server_metrics/test_accumulator.py`:
- Around line 114-155: The test currently only checks structural properties but
doesn't assert that UNKNOWN metrics are converted to gauge-like scalars; update
test_export_results_with_data to verify that for the returned
ServerMetricsResults (from ServerMetricsAccumulator.export_results) the
endpoint_summaries contains the metric_name in the per-endpoint metrics map and
that the metric entry includes scalar statistics (e.g., mean, min, max, count)
matching the supplied samples; reference the
ServerMetricsResults.endpoint_summaries and the metric key (metric_name) to
assert presence and verify the scalar fields (mean/min/max/count) have the
expected numeric values for the generated samples so UNKNOWN behaves like GAUGE.

---

Nitpick comments:
In `@src/aiperf/server_metrics/csv_exporter.py`:
- Around line 42-47: Remove the dead STAT_KEYS_MAP entry for
PrometheusMetricType.UNKNOWN: delete the mapping key
PrometheusMetricType.UNKNOWN that points to GAUGE_STAT_KEYS inside the
STAT_KEYS_MAP definition so the map only contains GAUGE, COUNTER, and HISTOGRAM;
verify no other code refers to PrometheusMetricType.UNKNOWN in STAT_KEYS_MAP
(search for STAT_KEYS_MAP, PrometheusMetricType.UNKNOWN, GAUGE_STAT_KEYS) and
run tests to ensure CSV export behavior is unchanged.

In `@tests/unit/server_metrics/test_csv_exporter.py`:
- Around line 338-341: The assertion is using object identity instead of value
equality; update the test to compare contents by replacing the `is` check with
an equality check so that STAT_KEYS_MAP[PrometheusMetricType.UNKNOWN] ==
STAT_KEYS_MAP[PrometheusMetricType.GAUGE]; this keeps the intent clear and
resilient if the lists are copies (referencing STAT_KEYS_MAP and
PrometheusMetricType.UNKNOWN / PrometheusMetricType.GAUGE to locate the
assertion).

In `@tests/unit/server_metrics/test_export_stats_basic.py`:
- Around line 97-114: The test TestUnknownExportStats currently asserts result
is not None then accesses result.stats directly; add an explicit assertion that
result.stats is not None to surface a clear failure if stats are missing and
optionally assert the concrete return type (e.g., isinstance(result,
GaugeSeries)) to confirm UNKNOWN is routed to gauge logic; update the test
around compute_stats (called with PrometheusMetricType.UNKNOWN, get_gauge,
make_time_filter) to include assert result.stats is not None and, if desired,
assert isinstance(result, GaugeSeries) before checking avg/min/max.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e358413 and 5de94a2a35eb54d6119c70a8ec0ea37cc7bcba81.

📒 Files selected for processing (12)
  • src/aiperf/server_metrics/accumulator.py
  • src/aiperf/server_metrics/csv_exporter.py
  • src/aiperf/server_metrics/export_stats.py
  • src/aiperf/server_metrics/json_exporter.py
  • src/aiperf/server_metrics/parquet_exporter.py
  • src/aiperf/server_metrics/storage.py
  • tests/server_metrics/test_storage.py
  • tests/unit/server_metrics/test_accumulator.py
  • tests/unit/server_metrics/test_csv_exporter.py
  • tests/unit/server_metrics/test_export_stats_basic.py
  • tests/unit/server_metrics/test_export_stats_edge_cases_aggressive.py
  • tests/unit/server_metrics/test_storage_edge_cases.py

Comment on lines +333 to +341
class TestStatKeysMap:
"""Test STAT_KEYS_MAP covers all expected metric types."""

def test_unknown_metric_type_uses_gauge_stat_keys(self):
"""Test UNKNOWN type maps to the same stat keys as GAUGE."""
assert (
STAT_KEYS_MAP[PrometheusMetricType.UNKNOWN]
is STAT_KEYS_MAP[PrometheusMetricType.GAUGE]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "section_order|UNKNOWN|PrometheusMetricType" --type py src/aiperf/server_metrics/csv_exporter.py

Repository: ai-dynamo/aiperf

Length of output: 1239


🏁 Script executed:

sed -n '170,200p' src/aiperf/server_metrics/csv_exporter.py

Repository: ai-dynamo/aiperf

Length of output: 1244


🏁 Script executed:

sed -n '213,250p' src/aiperf/server_metrics/csv_exporter.py

Repository: ai-dynamo/aiperf

Length of output: 1515


🏁 Script executed:

rg -n "UNKNOWN.*csv|test.*UNKNOWN.*metric|integration.*csv" --type py tests/

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

rg -n "UNKNOWN|test_.*metric" tests/unit/server_metrics/test_csv_exporter.py | head -30

Repository: ai-dynamo/aiperf

Length of output: 937


Add UNKNOWN metrics to CSV output by including it in section_order or normalizing to GAUGE.

The STAT_KEYS_MAP correctly maps PrometheusMetricType.UNKNOWN to GAUGE_STAT_KEYS (line 44), but _generate_content does not process UNKNOWN metrics. The section_order list (lines 186-189) includes only GAUGE, COUNTER, and HISTOGRAM. Since _group_metrics_by_type stores metrics keyed by their raw type without normalizing UNKNOWN to GAUGE (line 237), any metric with type=UNKNOWN will be silently omitted from CSV output. Either add PrometheusMetricType.UNKNOWN to section_order or normalize UNKNOWN types to GAUGE during the grouping step. Additionally, add an integration test that verifies UNKNOWN-typed metrics produce rows in the CSV output.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/server_metrics/test_csv_exporter.py` around lines 333 - 341, The
CSV exporter currently omits UNKNOWN metrics because section_order only lists
GAUGE/COUNTER/HISTOGRAM and _group_metrics_by_type keys metrics by their raw
type; fix by either adding PrometheusMetricType.UNKNOWN to the section_order
used by _generate_content or by normalizing UNKNOWN to
PrometheusMetricType.GAUGE inside _group_metrics_by_type (or both for clarity),
and add an integration test that creates a metric with
type=PrometheusMetricType.UNKNOWN and asserts a CSV row is emitted; reference
STAT_KEYS_MAP, PrometheusMetricType.UNKNOWN, section_order, _generate_content,
and _group_metrics_by_type when updating the code and tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

section_order will not have UNKNOWN in it since _group_metrics_by_type in csv_exporter.py line 211 iterates endpoint_summary.metrics.items and it type is defined as
metrics: SerializeAsAny[
dict[
str,
GaugeMetricData | CounterMetricData | HistogramMetricData,
]
that's ok because in accumulator.py in line 247
case PrometheusMetricType.GAUGE | PrometheusMetricType.UNKNOWN:
metrics[base_name] = GaugeMetricData(
description=metric_entry.description,
series=[series_stats],
)

@codecov
Copy link

codecov bot commented Feb 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant