RHOAIENG-25597 | feat: Adding GPU metrics #2236

den-rgb · 2025-07-31T10:28:41Z

Description

Collect accelerator vendor metrics

https://issues.redhat.com/browse/RHOAIENG-25597

How Has This Been Tested?

Screenshot or short clip

Merge criteria

You have read the contributors guide.
Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

New Features
- Added support for collecting and aggregating accelerator (GPU) metrics (utilization, memory usage, temperature, power, count).
- Conditional scraping of accelerator metrics when monitoring is enabled and configured.
- New Prometheus recording rules for accelerator metrics with a short evaluation interval.
Documentation
- Added detailed documentation for accelerator metrics collection, configuration, verification, and troubleshooting.
Tests
- Added unit tests validating monitoring template data and accelerator metrics configuration.

coderabbitai · 2025-07-31T10:29:10Z

Walkthrough

Adds conditional accelerator (GPU) metrics collection: documentation, Prometheus recording rules, OpenTelemetry Collector scrape job and relabeling, controller template flag, and unit tests enabling scraping, normalization, and aggregation when monitoring metrics are configured.

Changes

Cohort / File(s)	Change Summary
Documentation `docs/ACCELERATOR_METRICS.md`	Adds documentation describing accelerator metrics collection architecture, configuration prerequisites, metric normalization/mapping, verification steps, and troubleshooting.
Prometheus Rules `config/monitoring/base/rhods-prometheusrules.yaml`	Reformats existing rules and adds a new `rhods-accelerator-usage.rules` group (30s interval) with recording rules aggregating GPU metrics (utilization, memory util, temperature, memory used, power, GPU count) labeled `instance: rhoai-accelerators`.
OpenTelemetry Collector Template `internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml`	Adds conditional Prometheus scrape job `dcgm-exporter-accelerator-metrics` (enabled by `.AcceleratorMetrics`) targeting DCGM Exporter pods in `nvidia-gpu-operator`, scraping :9400/metrics with relabeling/metric renames, metric filtering, TLS config, 30s interval/10s timeout.
Controller Logic `internal/controller/services/monitoring/monitoring_controller_support.go`	Adds `AcceleratorMetrics` boolean to template data, set when Monitoring.Spec.Metrics is present (mirrors existing Metrics flag behavior).
Controller Tests `internal/controller/services/monitoring/monitoring_controller_support_test.go`	New unit tests for `getTemplateData` validating `AcceleratorMetrics` flag and presence/typing of metrics-related template fields across managed/unmanaged and with/without metrics configuration scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant MonitoringController
    participant OpenTelemetryCollector
    participant DCGMExporter
    participant Prometheus

    User->>MonitoringController: Enable monitoring with metrics configured
    MonitoringController->>OpenTelemetryCollector: Render config with AcceleratorMetrics=true
    OpenTelemetryCollector->>DCGMExporter: Scrape /metrics on port 9400 (dcgm-exporter pods)
    OpenTelemetryCollector->>Prometheus: Expose/forward normalized accelerator metrics
    Prometheus-->>User: Aggregated accelerator metrics available for queries

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, rhoai-2.23

Suggested reviewers

MarianMacik
zdtsw

Poem

🐇
I nibble logs in moonlit stacks,
I map the GPUs on their racks.
With scrape and rule and label spun,
The metrics dance — our work is done.
A little hop for observability!

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e827a22 and 7b9c927.

📒 Files selected for processing (5)

config/monitoring/base/rhods-prometheusrules.yaml (1 hunks)
docs/ACCELERATOR_METRICS.md (1 hunks)
internal/controller/services/monitoring/monitoring_controller_support.go (1 hunks)
internal/controller/services/monitoring/monitoring_controller_support_test.go (1 hunks)
internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

internal/controller/services/monitoring/monitoring_controller_support.go
internal/controller/services/monitoring/monitoring_controller_support_test.go
docs/ACCELERATOR_METRICS.md
internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml
config/monitoring/base/rhods-prometheusrules.yaml

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58ba3a9 and a9b7f41.

📒 Files selected for processing (5)

config/monitoring/base/rhods-prometheusrules.yaml (1 hunks)
docs/GPU_METRICS.md (1 hunks)
internal/controller/services/monitoring/monitoring_controller_support.go (2 hunks)
internal/controller/services/monitoring/monitoring_controller_support_test.go (1 hunks)
internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (1)

Learnt from: mlassak
PR: #2010
File: config/crd/kustomization.yaml:22-22
Timestamp: 2025-07-22T10:32:09.737Z
Learning: In the opendatahub-operator repository, when FeatureTrackers are being removed or deprecated, the FeatureTracker CRD reference in config/crd/kustomization.yaml should be kept for backward compatibility during the migration period, even if some components no longer use FeatureTrackers.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build/push catalog image
GitHub Check: Run tests and collect coverage on internal and pkg
GitHub Check: golangci-lint
GitHub Check: Run tests and collect coverage on tests/integration

🔇 Additional comments (12)

internal/controller/services/monitoring/monitoring_controller_support.go (2)

10-10: LGTM! Import addition is appropriate.

The import of operatorv1 package is correctly added to support the new GPU metrics condition.

40-40: LGTM! GPU metrics flag logic is correct.

The condition properly checks both requirements: management state is Managed and metrics configuration is present. This aligns with the documented behavior in the GPU_METRICS.md file.

config/monitoring/base/rhods-prometheusrules.yaml (2)

17-38: LGTM! Formatting improvements enhance consistency.

The reformatting of existing rules improves YAML structure and readability without changing functionality.

39-71: Comprehensive GPU recording rules implementation.

The GPU recording rules provide excellent coverage of key GPU metrics:

Utilization and memory utilization averages

Maximum temperature monitoring

Total memory usage and power consumption

Active GPU count

The 30-second evaluation interval matches the scrape interval, and the rhoai-gpu-metrics job reference aligns with the OpenTelemetry collector configuration.

internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (2)

35-52: LGTM! GPU scrape job configuration is well-structured.

The conditional inclusion and targeting of DCGM exporter pods is correctly implemented. The namespace targeting and pod name regex pattern appropriately identify NVIDIA DCGM exporter instances.

53-101: Comprehensive metric normalization with good OCP alignment.

The relabeling rules effectively transform DCGM metrics to normalized naming conventions:

Clear mapping from DCGM_FI_* to nvidia_gpu_* format

Appropriate addition of node, component, and job labels

Good alignment with OpenShift Container Platform standards

The normalization enhances metric discoverability and consistency.

internal/controller/services/monitoring/monitoring_controller_support_test.go (3)

1-2: Appropriate nolint comment for testing unexported function.

The nolint:testpackage directive is correctly used since the test needs to access the unexported getTemplateData function.

21-116: Comprehensive test coverage for GPUMetrics flag logic.

The test cases thoroughly cover all combinations of management state and metrics configuration:

Managed + metrics → GPU metrics enabled

Unmanaged + metrics → GPU metrics disabled

Managed + no metrics → GPU metrics disabled

Unmanaged + no metrics → GPU metrics disabled

The test setup correctly uses fake clients and validates both the GPUMetrics flag and other template data fields.

118-189: Good additional test for full metrics configuration.

This test validates the GPUMetrics flag behavior when a complete metrics configuration is provided, ensuring the template data includes all expected metrics-related fields. The type assertions for boolean values add good validation rigor.

docs/GPU_METRICS.md (3)

1-24: Excellent documentation structure and configuration clarity.

The overview and configuration requirements section clearly explains when GPU metrics are enabled, matching the implementation logic in the code. The architecture description provides good context for users.

25-70: Comprehensive implementation details with accurate code references.

The implementation details section effectively documents:

Template data configuration with actual code snippet

OpenTelemetry collector configuration structure

Complete DCGM to OCP metric mapping table

Additional labels for enrichment

The metric name mappings accurately reflect the relabeling rules in the OpenTelemetry collector configuration.

71-168: Thorough operational guidance and troubleshooting.

The documentation provides excellent operational support:

Clear recording rules documentation

Practical verification steps with actual commands

Comprehensive troubleshooting section covering common issues

Useful debug commands for investigating problems

Complete list of related components

This will be valuable for operators and developers working with the GPU metrics feature.

internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml

internal/controller/services/monitoring/monitoring_controller_support.go

codecov · 2025-07-31T12:46:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@43df545). Learn more about missing BASE report.
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2236   +/-   ##
=======================================
  Coverage        ?   40.63%           
=======================================
  Files           ?      148           
  Lines           ?    11909           
  Branches        ?        0           
=======================================
  Hits            ?     4839           
  Misses          ?     6666           
  Partials        ?      404

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

den-rgb · 2025-07-31T13:13:17Z

/cc @MarianMacik
/cc @CFSNM

CFSNM

/approved /lgtm

openshift-ci · 2025-07-31T13:55:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CFSNM

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CFSNM]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

CFSNM · 2025-07-31T13:56:13Z

@MarianMacik we could consider adding some scenario with a GPU to test these metrics, but I am worried about the resources

zdtsw · 2025-07-31T18:44:27Z

we added these in both places , so the managed SRE will get it from old prom and the new monitoringstack both?

another thing is, will this be later for other (NPU TPU) than Nvidia GPU only ? if so, should it be named Accelerator than GPU?

den-rgb · 2025-08-01T09:41:18Z