Skip to content

Conversation

den-rgb
Copy link
Contributor

@den-rgb den-rgb commented Jul 31, 2025

Description

Collect accelerator vendor metrics

https://issues.redhat.com/browse/RHOAIENG-25597

How Has This Been Tested?

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added support for collecting and aggregating accelerator (GPU) metrics (utilization, memory usage, temperature, power, count).
    • Conditional scraping of accelerator metrics when monitoring is enabled and configured.
    • New Prometheus recording rules for accelerator metrics with a short evaluation interval.
  • Documentation

    • Added detailed documentation for accelerator metrics collection, configuration, verification, and troubleshooting.
  • Tests

    • Added unit tests validating monitoring template data and accelerator metrics configuration.

Copy link

coderabbitai bot commented Jul 31, 2025

Walkthrough

Adds conditional accelerator (GPU) metrics collection: documentation, Prometheus recording rules, OpenTelemetry Collector scrape job and relabeling, controller template flag, and unit tests enabling scraping, normalization, and aggregation when monitoring metrics are configured.

Changes

Cohort / File(s) Change Summary
Documentation
docs/ACCELERATOR_METRICS.md
Adds documentation describing accelerator metrics collection architecture, configuration prerequisites, metric normalization/mapping, verification steps, and troubleshooting.
Prometheus Rules
config/monitoring/base/rhods-prometheusrules.yaml
Reformats existing rules and adds a new rhods-accelerator-usage.rules group (30s interval) with recording rules aggregating GPU metrics (utilization, memory util, temperature, memory used, power, GPU count) labeled instance: rhoai-accelerators.
OpenTelemetry Collector Template
internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml
Adds conditional Prometheus scrape job dcgm-exporter-accelerator-metrics (enabled by .AcceleratorMetrics) targeting DCGM Exporter pods in nvidia-gpu-operator, scraping :9400/metrics with relabeling/metric renames, metric filtering, TLS config, 30s interval/10s timeout.
Controller Logic
internal/controller/services/monitoring/monitoring_controller_support.go
Adds AcceleratorMetrics boolean to template data, set when Monitoring.Spec.Metrics is present (mirrors existing Metrics flag behavior).
Controller Tests
internal/controller/services/monitoring/monitoring_controller_support_test.go
New unit tests for getTemplateData validating AcceleratorMetrics flag and presence/typing of metrics-related template fields across managed/unmanaged and with/without metrics configuration scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant MonitoringController
    participant OpenTelemetryCollector
    participant DCGMExporter
    participant Prometheus

    User->>MonitoringController: Enable monitoring with metrics configured
    MonitoringController->>OpenTelemetryCollector: Render config with AcceleratorMetrics=true
    OpenTelemetryCollector->>DCGMExporter: Scrape /metrics on port 9400 (dcgm-exporter pods)
    OpenTelemetryCollector->>Prometheus: Expose/forward normalized accelerator metrics
    Prometheus-->>User: Aggregated accelerator metrics available for queries
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, rhoai-2.23

Suggested reviewers

  • MarianMacik
  • zdtsw

Poem

🐇
I nibble logs in moonlit stacks,
I map the GPUs on their racks.
With scrape and rule and label spun,
The metrics dance — our work is done.
A little hop for observability!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e827a22 and 7b9c927.

📒 Files selected for processing (5)
  • config/monitoring/base/rhods-prometheusrules.yaml (1 hunks)
  • docs/ACCELERATOR_METRICS.md (1 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (1 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support_test.go (1 hunks)
  • internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • internal/controller/services/monitoring/monitoring_controller_support.go
  • internal/controller/services/monitoring/monitoring_controller_support_test.go
  • docs/ACCELERATOR_METRICS.md
  • internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml
  • config/monitoring/base/rhods-prometheusrules.yaml
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58ba3a9 and a9b7f41.

📒 Files selected for processing (5)
  • config/monitoring/base/rhods-prometheusrules.yaml (1 hunks)
  • docs/GPU_METRICS.md (1 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (2 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support_test.go (1 hunks)
  • internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (1)

Learnt from: mlassak
PR: #2010
File: config/crd/kustomization.yaml:22-22
Timestamp: 2025-07-22T10:32:09.737Z
Learning: In the opendatahub-operator repository, when FeatureTrackers are being removed or deprecated, the FeatureTracker CRD reference in config/crd/kustomization.yaml should be kept for backward compatibility during the migration period, even if some components no longer use FeatureTrackers.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint
  • GitHub Check: Run tests and collect coverage on tests/integration
🔇 Additional comments (12)
internal/controller/services/monitoring/monitoring_controller_support.go (2)

10-10: LGTM! Import addition is appropriate.

The import of operatorv1 package is correctly added to support the new GPU metrics condition.


40-40: LGTM! GPU metrics flag logic is correct.

The condition properly checks both requirements: management state is Managed and metrics configuration is present. This aligns with the documented behavior in the GPU_METRICS.md file.

config/monitoring/base/rhods-prometheusrules.yaml (2)

17-38: LGTM! Formatting improvements enhance consistency.

The reformatting of existing rules improves YAML structure and readability without changing functionality.


39-71: Comprehensive GPU recording rules implementation.

The GPU recording rules provide excellent coverage of key GPU metrics:

  • Utilization and memory utilization averages
  • Maximum temperature monitoring
  • Total memory usage and power consumption
  • Active GPU count

The 30-second evaluation interval matches the scrape interval, and the rhoai-gpu-metrics job reference aligns with the OpenTelemetry collector configuration.

internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (2)

35-52: LGTM! GPU scrape job configuration is well-structured.

The conditional inclusion and targeting of DCGM exporter pods is correctly implemented. The namespace targeting and pod name regex pattern appropriately identify NVIDIA DCGM exporter instances.


53-101: Comprehensive metric normalization with good OCP alignment.

The relabeling rules effectively transform DCGM metrics to normalized naming conventions:

  • Clear mapping from DCGM_FI_* to nvidia_gpu_* format
  • Appropriate addition of node, component, and job labels
  • Good alignment with OpenShift Container Platform standards

The normalization enhances metric discoverability and consistency.

internal/controller/services/monitoring/monitoring_controller_support_test.go (3)

1-2: Appropriate nolint comment for testing unexported function.

The nolint:testpackage directive is correctly used since the test needs to access the unexported getTemplateData function.


21-116: Comprehensive test coverage for GPUMetrics flag logic.

The test cases thoroughly cover all combinations of management state and metrics configuration:

  • Managed + metrics → GPU metrics enabled
  • Unmanaged + metrics → GPU metrics disabled
  • Managed + no metrics → GPU metrics disabled
  • Unmanaged + no metrics → GPU metrics disabled

The test setup correctly uses fake clients and validates both the GPUMetrics flag and other template data fields.


118-189: Good additional test for full metrics configuration.

This test validates the GPUMetrics flag behavior when a complete metrics configuration is provided, ensuring the template data includes all expected metrics-related fields. The type assertions for boolean values add good validation rigor.

docs/GPU_METRICS.md (3)

1-24: Excellent documentation structure and configuration clarity.

The overview and configuration requirements section clearly explains when GPU metrics are enabled, matching the implementation logic in the code. The architecture description provides good context for users.


25-70: Comprehensive implementation details with accurate code references.

The implementation details section effectively documents:

  • Template data configuration with actual code snippet
  • OpenTelemetry collector configuration structure
  • Complete DCGM to OCP metric mapping table
  • Additional labels for enrichment

The metric name mappings accurately reflect the relabeling rules in the OpenTelemetry collector configuration.


71-168: Thorough operational guidance and troubleshooting.

The documentation provides excellent operational support:

  • Clear recording rules documentation
  • Practical verification steps with actual commands
  • Comprehensive troubleshooting section covering common issues
  • Useful debug commands for investigating problems
  • Complete list of related components

This will be valuable for operators and developers working with the GPU metrics feature.

Copy link

codecov bot commented Jul 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@43df545). Learn more about missing BASE report.
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2236   +/-   ##
=======================================
  Coverage        ?   40.63%           
=======================================
  Files           ?      148           
  Lines           ?    11909           
  Branches        ?        0           
=======================================
  Hits            ?     4839           
  Misses          ?     6666           
  Partials        ?      404           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@den-rgb
Copy link
Contributor Author

den-rgb commented Jul 31, 2025

/cc @MarianMacik
/cc @CFSNM

@openshift-ci openshift-ci bot requested review from CFSNM and MarianMacik July 31, 2025 13:13
@den-rgb den-rgb force-pushed the RHOAIENG-25597 branch 2 times, most recently from 272834f to 440552d Compare July 31, 2025 13:48
Copy link
Member

@CFSNM CFSNM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approved /lgtm

Copy link

openshift-ci bot commented Jul 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CFSNM

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@CFSNM
Copy link
Member

CFSNM commented Jul 31, 2025

@MarianMacik we could consider adding some scenario with a GPU to test these metrics, but I am worried about the resources

@openshift-ci openshift-ci bot removed the lgtm label Jul 31, 2025
@den-rgb den-rgb force-pushed the RHOAIENG-25597 branch 2 times, most recently from f2b594c to 297b2df Compare July 31, 2025 14:56
@zdtsw
Copy link
Member

zdtsw commented Jul 31, 2025

we added these in both places , so the managed SRE will get it from old prom and the new monitoringstack both?

another thing is, will this be later for other (NPU TPU) than Nvidia GPU only ? if so, should it be named Accelerator than GPU?

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 1, 2025

/test opendatahub-operator-e2e

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 5, 2025

@zdtsw The GPU metrics collection is specifically configured in the new OpenTelemetry collector that scrapes the NVIDIA DCGM exporter so they will only get it from the new monioring stack.
I'll rename it to accelerator :)

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 5, 2025

/test opendatahub-operator-e2e

1 similar comment
@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 6, 2025

/test opendatahub-operator-e2e

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 6, 2025

/test opendatahub-operator-e2e

2 similar comments
@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 7, 2025

/test opendatahub-operator-e2e

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 7, 2025

/test opendatahub-operator-e2e

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 7, 2025

/test opendatahub-operator-e2e

@den-rgb den-rgb force-pushed the RHOAIENG-25597 branch 2 times, most recently from 628ce36 to e827a22 Compare August 7, 2025 15:50
@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 8, 2025

/test opendatahub-operator-e2e

4 similar comments
@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 8, 2025

/test opendatahub-operator-e2e

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 8, 2025

/test opendatahub-operator-e2e

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 8, 2025

/test opendatahub-operator-e2e

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 8, 2025

/test opendatahub-operator-e2e

@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 11, 2025

/test opendatahub-operator-e2e

1 similar comment
@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 12, 2025

/test opendatahub-operator-e2e

@StevenTobin
Copy link
Contributor

/lgtm

@@ -36,8 +36,8 @@ func getTemplateData(ctx context.Context, rr *odhtypes.ReconciliationRequest) (m

templateData["Traces"] = monitoring.Spec.Traces != nil
templateData["Metrics"] = monitoring.Spec.Metrics != nil
templateData["AcceleratorMetrics"] = monitoring.Spec.Metrics != nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might not fully understand why we need introduce a AcceleratorMetrics here
looks like AcceleratorMetrics == Metrics, right?

also in the template,
that {{- if .AcceleratorMetrics }} is withint the {{- if .Metrics }}...{{- end}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was implemented for better maintainability and for more flexibility relating to accelerator metrics collection, If needs be I can change it to use metrics if you dont think there will be a need in the future to keep them seperate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the change has been merged already, we can keep it as-is, i was just not sure what is the plan for the future, as you said, maybe more things are needed only for Accelerator 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zdtsw made a valid point. Since AcceleratorMetrics is identical to Metrics (both are monitoring.Spec.Metrics != nil), this adds unnecessary complexity without any benefit.

This violates YAGNI (You Aren't Gonna Need It). We're adding abstraction for future flexibility that doesn't exist yet. Since the accelerator metrics block is already inside {{- if .Metrics }}, the extra conditional serves no purpose.

We should create a follow-up to remove AcceleratorMetrics and just use {{- if .Metrics }} directly.

@zdtsw
Copy link
Member

zdtsw commented Aug 15, 2025

nit: for the doc, do we want to have a doc for all monitoring related config and usage? looks like this is the only PR we have docs on it.

@openshift-merge-bot openshift-merge-bot bot merged commit bf227a4 into opendatahub-io:main Aug 15, 2025
16 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in ODH Platform Planning Aug 15, 2025
@den-rgb
Copy link
Contributor Author

den-rgb commented Aug 19, 2025

/cherry-pick rhoai

@openshift-cherrypick-robot

@den-rgb: cannot checkout rhoa: error checking out "rhoa": exit status 1 error: pathspec 'rhoa' did not match any file(s) known to git

In response to this:

/cherry-pick rhoa

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@den-rgb: new pull request created: #2321

In response to this:

/cherry-pick rhoai

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sukumars321 pushed a commit to sukumars321/opendatahub-operator that referenced this pull request Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants