-
Notifications
You must be signed in to change notification settings - Fork 191
RHOAIENG-25597 | feat: Adding GPU metrics #2236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHOAIENG-25597 | feat: Adding GPU metrics #2236
Conversation
WalkthroughAdds conditional accelerator (GPU) metrics collection: documentation, Prometheus recording rules, OpenTelemetry Collector scrape job and relabeling, controller template flag, and unit tests enabling scraping, normalization, and aggregation when monitoring metrics are configured. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant MonitoringController
participant OpenTelemetryCollector
participant DCGMExporter
participant Prometheus
User->>MonitoringController: Enable monitoring with metrics configured
MonitoringController->>OpenTelemetryCollector: Render config with AcceleratorMetrics=true
OpenTelemetryCollector->>DCGMExporter: Scrape /metrics on port 9400 (dcgm-exporter pods)
OpenTelemetryCollector->>Prometheus: Expose/forward normalized accelerator metrics
Prometheus-->>User: Aggregated accelerator metrics available for queries
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (5)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
config/monitoring/base/rhods-prometheusrules.yaml
(1 hunks)docs/GPU_METRICS.md
(1 hunks)internal/controller/services/monitoring/monitoring_controller_support.go
(2 hunks)internal/controller/services/monitoring/monitoring_controller_support_test.go
(1 hunks)internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (1)
Learnt from: mlassak
PR: #2010
File: config/crd/kustomization.yaml:22-22
Timestamp: 2025-07-22T10:32:09.737Z
Learning: In the opendatahub-operator repository, when FeatureTrackers are being removed or deprecated, the FeatureTracker CRD reference in config/crd/kustomization.yaml should be kept for backward compatibility during the migration period, even if some components no longer use FeatureTrackers.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: golangci-lint
- GitHub Check: Run tests and collect coverage on tests/integration
🔇 Additional comments (12)
internal/controller/services/monitoring/monitoring_controller_support.go (2)
10-10
: LGTM! Import addition is appropriate.The import of
operatorv1
package is correctly added to support the new GPU metrics condition.
40-40
: LGTM! GPU metrics flag logic is correct.The condition properly checks both requirements: management state is
Managed
and metrics configuration is present. This aligns with the documented behavior in the GPU_METRICS.md file.config/monitoring/base/rhods-prometheusrules.yaml (2)
17-38
: LGTM! Formatting improvements enhance consistency.The reformatting of existing rules improves YAML structure and readability without changing functionality.
39-71
: Comprehensive GPU recording rules implementation.The GPU recording rules provide excellent coverage of key GPU metrics:
- Utilization and memory utilization averages
- Maximum temperature monitoring
- Total memory usage and power consumption
- Active GPU count
The 30-second evaluation interval matches the scrape interval, and the
rhoai-gpu-metrics
job reference aligns with the OpenTelemetry collector configuration.internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml (2)
35-52
: LGTM! GPU scrape job configuration is well-structured.The conditional inclusion and targeting of DCGM exporter pods is correctly implemented. The namespace targeting and pod name regex pattern appropriately identify NVIDIA DCGM exporter instances.
53-101
: Comprehensive metric normalization with good OCP alignment.The relabeling rules effectively transform DCGM metrics to normalized naming conventions:
- Clear mapping from DCGM_FI_* to nvidia_gpu_* format
- Appropriate addition of node, component, and job labels
- Good alignment with OpenShift Container Platform standards
The normalization enhances metric discoverability and consistency.
internal/controller/services/monitoring/monitoring_controller_support_test.go (3)
1-2
: Appropriate nolint comment for testing unexported function.The nolint:testpackage directive is correctly used since the test needs to access the unexported
getTemplateData
function.
21-116
: Comprehensive test coverage for GPUMetrics flag logic.The test cases thoroughly cover all combinations of management state and metrics configuration:
- Managed + metrics → GPU metrics enabled
- Unmanaged + metrics → GPU metrics disabled
- Managed + no metrics → GPU metrics disabled
- Unmanaged + no metrics → GPU metrics disabled
The test setup correctly uses fake clients and validates both the GPUMetrics flag and other template data fields.
118-189
: Good additional test for full metrics configuration.This test validates the GPUMetrics flag behavior when a complete metrics configuration is provided, ensuring the template data includes all expected metrics-related fields. The type assertions for boolean values add good validation rigor.
docs/GPU_METRICS.md (3)
1-24
: Excellent documentation structure and configuration clarity.The overview and configuration requirements section clearly explains when GPU metrics are enabled, matching the implementation logic in the code. The architecture description provides good context for users.
25-70
: Comprehensive implementation details with accurate code references.The implementation details section effectively documents:
- Template data configuration with actual code snippet
- OpenTelemetry collector configuration structure
- Complete DCGM to OCP metric mapping table
- Additional labels for enrichment
The metric name mappings accurately reflect the relabeling rules in the OpenTelemetry collector configuration.
71-168
: Thorough operational guidance and troubleshooting.The documentation provides excellent operational support:
- Clear recording rules documentation
- Practical verification steps with actual commands
- Comprehensive troubleshooting section covering common issues
- Useful debug commands for investigating problems
- Complete list of related components
This will be valuable for operators and developers working with the GPU metrics feature.
internal/controller/services/monitoring/resources/opentelemetry-collector.tmpl.yaml
Outdated
Show resolved
Hide resolved
internal/controller/services/monitoring/monitoring_controller_support.go
Outdated
Show resolved
Hide resolved
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2236 +/- ##
=======================================
Coverage ? 40.63%
=======================================
Files ? 148
Lines ? 11909
Branches ? 0
=======================================
Hits ? 4839
Misses ? 6666
Partials ? 404 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
/cc @MarianMacik |
272834f
to
440552d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approved /lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CFSNM The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@MarianMacik we could consider adding some scenario with a GPU to test these metrics, but I am worried about the resources |
f2b594c
to
297b2df
Compare
we added these in both places , so the managed SRE will get it from old prom and the new monitoringstack both? another thing is, will this be later for other (NPU TPU) than Nvidia GPU only ? if so, should it be named Accelerator than GPU? |
/test opendatahub-operator-e2e |
@zdtsw The GPU metrics collection is specifically configured in the new OpenTelemetry collector that scrapes the NVIDIA DCGM exporter so they will only get it from the new monioring stack. |
/test opendatahub-operator-e2e |
1 similar comment
/test opendatahub-operator-e2e |
/test opendatahub-operator-e2e |
2 similar comments
/test opendatahub-operator-e2e |
/test opendatahub-operator-e2e |
/test opendatahub-operator-e2e |
628ce36
to
e827a22
Compare
/test opendatahub-operator-e2e |
4 similar comments
/test opendatahub-operator-e2e |
/test opendatahub-operator-e2e |
/test opendatahub-operator-e2e |
/test opendatahub-operator-e2e |
e827a22
to
75d392a
Compare
75d392a
to
7b9c927
Compare
/test opendatahub-operator-e2e |
1 similar comment
/test opendatahub-operator-e2e |
/lgtm |
@@ -36,8 +36,8 @@ func getTemplateData(ctx context.Context, rr *odhtypes.ReconciliationRequest) (m | |||
|
|||
templateData["Traces"] = monitoring.Spec.Traces != nil | |||
templateData["Metrics"] = monitoring.Spec.Metrics != nil | |||
templateData["AcceleratorMetrics"] = monitoring.Spec.Metrics != nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i might not fully understand why we need introduce a AcceleratorMetrics here
looks like AcceleratorMetrics == Metrics, right?
also in the template,
that {{- if .AcceleratorMetrics }}
is withint the {{- if .Metrics }}...{{- end}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was implemented for better maintainability and for more flexibility relating to accelerator metrics collection, If needs be I can change it to use metrics if you dont think there will be a need in the future to keep them seperate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the change has been merged already, we can keep it as-is, i was just not sure what is the plan for the future, as you said, maybe more things are needed only for Accelerator 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zdtsw made a valid point. Since AcceleratorMetrics
is identical to Metrics
(both are monitoring.Spec.Metrics != nil
), this adds unnecessary complexity without any benefit.
This violates YAGNI (You Aren't Gonna Need It). We're adding abstraction for future flexibility that doesn't exist yet. Since the accelerator metrics block is already inside {{- if .Metrics }}
, the extra conditional serves no purpose.
We should create a follow-up to remove AcceleratorMetrics
and just use {{- if .Metrics }}
directly.
nit: for the doc, do we want to have a doc for all monitoring related config and usage? looks like this is the only PR we have docs on it. |
bf227a4
into
opendatahub-io:main
/cherry-pick rhoai |
@den-rgb: cannot checkout In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@den-rgb: new pull request created: #2321 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Co-authored-by: den-rgb <[email protected]>
Description
Collect accelerator vendor metrics
https://issues.redhat.com/browse/RHOAIENG-25597
How Has This Been Tested?
Screenshot or short clip
Merge criteria
Summary by CodeRabbit
New Features
Documentation
Tests