-
Notifications
You must be signed in to change notification settings - Fork 190
add built in alerting configuration for components #2261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add built in alerting configuration for components #2261
Conversation
WalkthroughAdds an Alerting API type and monitoring.alerting field with cross-field validation; extends CRDs, RBAC, CSV for PrometheusRule; embeds per-component PrometheusRule templates; implements deployAlerting plus add/cleanup helpers in the monitoring reconciler; propagates alerting from DSCInitialization; adds e2e tests and deepcopy/utility updates. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DSCI as DSCInitialization
participant Mon as Monitoring CR
participant Recon as Monitoring Reconciler
participant API as K8s API
participant PR as PrometheusRule CRD
User->>DSCI: set spec.monitoring.alerting = {}
DSCI->>Mon: create Monitoring CR (alerting propagated)
Recon->>API: Reconcile(Monitoring)
Recon->>PR: Check PrometheusRule CRD exists
alt alerting not configured or CRD missing
Recon->>API: Set ConditionAlertingAvailable=False (NotConfigured/CRDNotFound)
else CRD present and alerting enabled
Recon->>API: Set ConditionAlertingAvailable=True
Recon->>API: Apply operator-prometheusrules
loop for each component
alt component enabled and ready
Recon->>API: Apply <component>-prometheusrules (addPrometheusRules)
else component disabled
Recon->>API: Delete <component>-prometheusrules (cleanupPrometheusRules)
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🔭 Outside diff range comments (3)
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (1)
1-99
: Fix YAML formatting issues.The PrometheusRule logic is sound with appropriate SLO-based alerting and recording rules. However, there are several YAML formatting issues that need to be addressed:
Apply this diff to fix the formatting issues:
groups: - - name: SLOs-probe_success_llama_stack_k8s_operator + - name: SLOs-probe_success_llama_stack_k8s_operator rules: - alert: Llama Stack K8s Operator Probe Success Burn Rate # ... alert content ... labels: severity: warning instance: llama-stack-k8s-operator-controller-manager - + - alert: Llama Stack K8s Operator Probe Success Burn Rate # ... alert content ... labels: severity: warning instance: llama-stack-k8s-operator-controller-manager - + - alert: Llama Stack K8s Operator Probe Success Burn Rate # ... alert content ... labels: severity: warning instance: llama-stack-k8s-operator-controller-manager - + # RecordingRules for Llama Stack - - name: SLOs - Llama Stack K8s Operator + - name: SLOs - Llama Stack K8s Operator rules: - - expr: | + - expr: | absent(up{job="llama-stack-k8s-operator"}) * 0 or vector(1) labels: instance: llama-stack-k8s-operator-controller-manager record: probe_success - - - expr: | + + - expr: | 1 - min(avg_over_time(probe_success{instance="llama-stack-k8s-operator-controller-manager"}[1d])) labels: instance: llama-stack-k8s-operator-controller-manager record: probe_success:burnrate1d - + # Apply similar fixes to remove trailing spaces from other recording rulesinternal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml (1)
1-90
: Fix YAML formatting and indentation issues.The PrometheusRule content is well-structured with appropriate alerting thresholds and recording rules. However, there are formatting issues that need to be addressed:
Please fix the following YAML formatting issues:
- Line 8: Change indentation from 6 to 4 spaces for consistency
- Line 52: Change indentation from 8 to 10 spaces for the recording rules
- Remove trailing spaces from lines: 22, 35, 57, 63, 69, 75, 81, 87, 93
The alerting logic using different severity levels (critical for shorter time windows, warning for longer ones) is appropriate for SLO monitoring.
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)
1-99
: Fix YAML formatting issues.The PrometheusRule configuration is correct with appropriate SLO-based alerting. However, there are YAML formatting issues that need to be addressed:
Please fix the following formatting issues:
- Line 8: Change indentation from 6 to 4 spaces
- Line 52: Change indentation from 8 to 10 spaces
- Remove trailing spaces from lines: 22, 35, 57, 63, 69, 75, 81, 87, 93
The alerting configuration with critical severity for short-term burn rates and warning for long-term ones follows best practices for SLO alerting.
🧹 Nitpick comments (7)
internal/controller/components/components.go (1)
7-20
: Optional: collapse repetitive//go:embed
directives for readabilityGo 1.17+ allows a single
//go:embed
line to accept multiple space-separated patterns or a brace list. Collapsing 14 directives into one reduces visual noise without changing behaviour.-//go:embed kueue/monitoring -//go:embed trainingoperator/monitoring -//go:embed trustyai/monitoring -//go:embed workbenches/monitoring -//go:embed codeflare/monitoring -//go:embed dashboard/monitoring -//go:embed datasciencepipelines/monitoring -//go:embed feastoperator/monitoring -//go:embed kserve/monitoring -//go:embed llamastackoperator/monitoring -//go:embed modelcontroller/monitoring -//go:embed modelmeshserving/monitoring -//go:embed modelregistry/monitoring -//go:embed ray/monitoring +//go:embed {kueue,trainingoperator,trustyai,workbenches,codeflare,dashboard,datasciencepipelines,feastoperator,kserve,llamastackoperator,modelcontroller,modelmeshserving,modelregistry,ray}/monitoringdocs/api-overview.md (1)
2863-2868
: Document the “metrics-required” validation foralerting
The CRD sets a CEL rule that
alerting
can only be enabled whenmetrics
is configured, yet this important behavioural contract is not mentioned here.
Add a short note such as:Alerting requires
.metrics
to be non-null; the operator will reject the CR if metrics are disabled.That prevents user confusion and failed reconciliations.
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1)
58-107
: Maintain consistency in label usage across components.This file groups by
name
label while kserve groups byinstance
. Consider standardizing the label usage across all component PrometheusRules for consistency.Also applies the same alert naming and namespace flexibility issues mentioned above.
internal/controller/services/monitoring/monitoring_controller_actions.go (1)
415-417
: Enhance the final error message with more context.The generic error message doesn't indicate the scope of the issues. Consider including counts or component names.
if len(addErrors) > 0 || len(cleanupErrors) > 0 { - return fmt.Errorf("errors occurred while adding or cleaning up prometheus rules for components") + return fmt.Errorf("prometheus rules management failed: %d add errors, %d cleanup errors", len(addErrors), len(cleanupErrors)) }internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (3)
8-178
: YAML-lint flags multiple indentation / trailing-space issuesSeveral lines (e.g. Lines 8, 10, 31, 53, 105, 143, 174) have 6- or 8-space indentation where 4 is expected, and Lines 50 & 171 contain trailing spaces.
While Kubernetes is lenient, maintaining a consistent two-space increment prevents future merge conflicts and avoids tooling problems (helm-lint, kube-lint, etc.).No functional change, but consider running
yamllint -d relaxed
orprettier --write
on the template before committing.
19-27
:severity
level for 100 % PVC usage is only warningAt 100 % utilisation the PVC becomes read-only and notebooks will fail. A
critical
severity (or at leasterror
) is normally expected so on-call engineers are paged.- severity: warning + severity: critical(Adjust to your org’s routing policy.)
172-178
:rhods_aggregate_availability
expression is hard to readThe nested
or
/label_replace
construction is valid but quite opaque. Consider extracting intermediate recording rules (e.g.rhods_dashboard_avail
,notebook_spawner_avail
) and then taking themin()
of those. This keeps alert rules readable and easier to maintain.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (31)
api/services/v1alpha1/monitoring_types.go
(3 hunks)api/services/v1alpha1/zz_generated.deepcopy.go
(2 hunks)config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
(2 hunks)config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
(2 hunks)config/rbac/role.yaml
(2 hunks)docs/api-overview.md
(4 hunks)internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/components.go
(1 hunks)internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/dscinitialization/dscinitialization_controller.go
(1 hunks)internal/controller/dscinitialization/kubebuilder_rbac.go
(1 hunks)internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/services/monitoring/monitoring_controller.go
(2 hunks)internal/controller/services/monitoring/monitoring_controller_actions.go
(3 hunks)internal/controller/services/monitoring/monitoring_controller_support.go
(2 hunks)internal/controller/status/status.go
(2 hunks)pkg/cluster/gvk/gvk.go
(1 hunks)pkg/common/common.go
(2 hunks)tests/e2e/monitoring_test.go
(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in the opendatahub-operator repository, crd files under config/crd.rhoai/bases/ are generated files ...
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: config/crd.rhoai/bases/dscinitialization.opendatahub.io_dscinitializations.yaml:90-107
Timestamp: 2025-07-29T18:45:24.518Z
Learning: In the opendatahub-operator repository, CRD files under config/crd.rhoai/bases/ are generated files (using controller-gen from Go structs), and inconsistencies between enum and pattern validation should not be flagged as they require upstream source code changes rather than direct YAML modifications.
Applied to files:
internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
internal/controller/services/monitoring/monitoring_controller_support.go
📚 Learning: in the opendatahub-operator repository, when featuretrackers are being removed or deprecated, the fe...
Learnt from: mlassak
PR: opendatahub-io/opendatahub-operator#2010
File: config/crd/kustomization.yaml:22-22
Timestamp: 2025-07-22T10:32:09.737Z
Learning: In the opendatahub-operator repository, when FeatureTrackers are being removed or deprecated, the FeatureTracker CRD reference in config/crd/kustomization.yaml should be kept for backward compatibility during the migration period, even if some components no longer use FeatureTrackers.
Applied to files:
internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
📚 Learning: in rhoai patch files, there are two naming patterns for dscinitialization crd patches: files with `....
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: config/crd.rhoai/patches/cainjection_in_dscinitialization.opendatahub.io_dscinitializations.yaml:5-7
Timestamp: 2025-07-29T18:43:10.232Z
Learning: In RHOAI patch files, there are two naming patterns for DSCInitialization CRD patches: files with `.opendatahub.io` in their filename use the duplicated pattern `dscinitializations.dscinitialization.opendatahub.io.opendatahub.io` in the CRD name, while files without the group in filename use the standard pattern `dscinitializations.dscinitialization.opendatahub.io`. This duplication appears to be intentional and exists in both regular and RHOAI branches.
Applied to files:
config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
🧬 Code Graph Analysis (4)
internal/controller/services/monitoring/monitoring_controller.go (2)
pkg/cluster/gvk/gvk.go (1)
PrometheusRule
(468-472)pkg/controller/reconciler/reconciler_support.go (2)
Dynamic
(69-74)CrdExists
(77-85)
api/services/v1alpha1/zz_generated.deepcopy.go (1)
api/services/v1alpha1/monitoring_types.go (1)
Alerting
(117-118)
internal/controller/dscinitialization/dscinitialization_controller.go (2)
api/services/v1alpha1/monitoring_types.go (2)
Alerting
(117-118)Monitoring
(129-135)pkg/cluster/gvk/gvk.go (1)
Monitoring
(288-292)
tests/e2e/monitoring_test.go (5)
tests/e2e/resource_options_test.go (3)
WithMinimalObject
(140-155)WithMutateFunc
(194-198)WithCondition
(203-207)pkg/cluster/gvk/gvk.go (5)
DSCInitialization
(66-70)DataScienceCluster
(60-64)Namespace
(24-28)Dashboard
(210-214)PrometheusRule
(468-472)pkg/utils/test/testf/testf_support.go (2)
TransformPipeline
(50-61)Transform
(78-106)pkg/utils/test/matchers/jq/jq_matcher.go (1)
Match
(11-15)internal/controller/status/status.go (2)
ConditionTypeReady
(90-90)ConditionTypeProvisioningSucceeded
(95-95)
🪛 YAMLlint (1.37.1)
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[error] 48-48: trailing spaces
(trailing-spaces)
[error] 49-49: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[warning] 31-31: wrong indentation: expected 10 but found 8
(indentation)
[error] 50-50: trailing spaces
(trailing-spaces)
[warning] 53-53: wrong indentation: expected 10 but found 8
(indentation)
[warning] 105-105: wrong indentation: expected 10 but found 8
(indentation)
[warning] 143-143: wrong indentation: expected 10 but found 8
(indentation)
[error] 171-171: trailing spaces
(trailing-spaces)
[warning] 174-174: wrong indentation: expected 10 but found 8
(indentation)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[error] 68-68: trailing spaces
(trailing-spaces)
[warning] 72-72: wrong indentation: expected 10 but found 8
(indentation)
[warning] 110-110: wrong indentation: expected 10 but found 8
(indentation)
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
🔇 Additional comments (41)
pkg/common/common.go (2)
23-26
: LGTM! Clean utility function implementation.The new imports and
FileExists
function are well-implemented. The function correctly usesfs.Stat
to check file existence in embedded filesystems, which is the standard Go pattern for this operation.
124-128
: LGTM! Simple and effective file existence check.The function implementation is correct and follows Go conventions for checking file existence in embedded filesystems.
internal/controller/services/monitoring/monitoring_controller.go (2)
101-101
: LGTM! Consistent dynamic resource watching.The PrometheusRule resource watching follows the same conditional pattern as other monitoring resources, ensuring the controller only manages these resources when the corresponding CRD exists.
131-131
: LGTM! Well-positioned alerting deployment action.The
deployAlerting
action is appropriately placed in the reconciliation chain after the monitoring stack deployment, which ensures the required infrastructure is available before deploying alerting rules.internal/controller/dscinitialization/dscinitialization_controller.go (1)
454-454
: LGTM! Consistent field assignment pattern.The alerting configuration assignment follows the same pattern as the existing metrics and traces fields, ensuring proper propagation of the alerting configuration from DSCInitialization to the Monitoring CR.
pkg/cluster/gvk/gvk.go (1)
468-472
: LGTM! Consistent GVK declaration for PrometheusRule.The PrometheusRule GVK follows the same pattern as other declarations and correctly uses the
monitoring.rhobs
API group with stable v1 version, consistent with the existing ServiceMonitor GVK.internal/controller/dscinitialization/kubebuilder_rbac.go (1)
60-62
: LGTM! Comprehensive RBAC permissions for PrometheusRule management.The RBAC annotations correctly grant full permissions on
prometheusrules
resources and their subresources, following the same pattern as othermonitoring.rhobs
resources. These permissions are essential for the operator to manage alerting rules as described in the PR objectives.api/services/v1alpha1/zz_generated.deepcopy.go (2)
27-40
: LGTM! Autogenerated deepcopy functions for Alerting type are correct.The autogenerated deepcopy functions follow the standard Kubernetes pattern and are consistent with other deepcopy implementations in this file.
259-263
: LGTM! Proper handling of optional Alerting field.The deepcopy logic correctly handles the optional
Alerting
field with proper nil checking and pointer dereferencing, consistent with how other optional fields are handled in this function.internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml (1)
1-18
: Alert job label pattern validated successfullyThe regex
job=~'KubeRay Operator'
correctly matches the Prometheus scrape configuration and unit tests:
- In config/monitoring/prometheus/apps/prometheus-configs.yaml:
- job_name: 'KubeRay Operator'
- In tests/prometheus_unit_tests/ray-alerting.unit-tests.yaml:
Series and expected alerts useup{job="KubeRay Operator"}
No further changes required.
internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml (1)
1-18
: PrometheusRule job label regex validatedThe
job=~'Kueue Operator'
expression correctly matches thejob_name: 'Kueue Operator'
in the Prometheus scrape configuration and the unit tests’up{job="Kueue Operator"}
series. No changes required.internal/controller/status/status.go (2)
108-108
: LGTM! New alerting condition constant follows established naming pattern.The
ConditionAlertingAvailable
constant is consistently named and appropriately placed with other condition type constants.
189-190
: LGTM! Alerting configuration constants follow established patterns.The
AlertingNotConfiguredReason
andAlertingNotConfiguredMessage
constants are consistent with other "NotConfigured" patterns in the codebase and provide clear messaging.internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml (1)
1-18
: Confirmed PrometheusRule job label matches configuration and testsVerified that the regex matcher
job=~'KubeFlow Training Operator'
exactly matches thejob_name: 'KubeFlow Training Operator'
defined in
config/monitoring/prometheus/apps/prometheus-configs.yaml
and is exercised asup{job="KubeFlow Training Operator"}
in the unit tests undertests/prometheus_unit_tests/trainingoperator-alerting.unit-tests.yaml
LGTM.
config/rbac/role.yaml (1)
758-770
: Consider adding thedeletecollection
verb for parity withmonitoring.coreos.com
rulesYour controller already requests
deletecollection
formonitoring.coreos.com/prometheusrules
(Lines 746-755).
If the same bulk-deletion logic is ever exercised against themonitoring.rhobs/prometheusrules
resource, the absence of this verb will trigger an RBAC “forbidden” error and leave orphaned rules behind.- prometheusrules + verbs: + - create + - delete + - deletecollection # ← add for symmetry & future-proofing + - get + - list + - patch + - update + - watch - verbs: - - create - - delete - - get - - list - - patch - - update - - watchDo you rely on bulk cleanup via
DeleteCollection
anywhere in the controller? If yes, the change is mandatory; otherwise it is a harmless hardening.config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (2)
93-96
: LGTM – newalerting
field addedThe schema addition matches the API changes and is guarded by a cross-field validation rule further below.
No issues spotted.
242-245
: Validation rule looks correct
has(self.alerting) ? has(self.metrics) : true
enforces the intended dependency without over-constraining the object.config/crd/bases/services.platform.opendatahub.io_monitorings.yaml (2)
55-57
: LGTM –alerting
property introducedSchema update aligns with controller behaviour.
190-193
: Cross-field validation properly enforces Metrics ↔ Alerting dependencySame rule as in DSCInitialization CRD; looks good.
docs/api-overview.md (1)
2959-2966
: Keep the spec sections consistent
MonitoringCommonSpec
andMonitoringSpec
now list the newalerting
field, but unlikemetrics
andtraces
they do not link back to a completedAlerting
schema (see earlier comment) nor mention default behaviour (e.g. “nil ⇒ disabled”).Once the
Alerting
table is added, ensure these links resolve and that defaults are stated for parity with the other subsections.Also applies to: 3000-3005
api/services/v1alpha1/monitoring_types.go (2)
116-118
: Well-designed empty struct for future extensibility.The empty
Alerting
struct follows Go conventions and provides a foundation for future alerting configuration fields while maintaining backward compatibility.
138-138
: Excellent validation rule ensuring proper dependency management.The XValidation rule correctly enforces that alerting configuration requires metrics to be enabled, preventing invalid configurations. The conditional logic
has(self.alerting) ? has(self.metrics) : true
is well-structured and the error message is clear and actionable.Also applies to: 151-152
internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (2)
1-46
: Well-structured SLO-based alerting rules.The alerting rules follow Prometheus SLO best practices with:
- Multiple time window combinations for burn rate detection
- Appropriate severity levels (critical/warning) based on burn rate thresholds
- Clear annotations with actionable messages
- Proper error budget calculations using 98% SLI
The alert structure effectively balances sensitivity and noise reduction.
47-90
: Comprehensive recording rules for efficient alerting.The recording rules are well-designed:
- Cover all necessary time windows (5m to 3d) for burn rate calculations
- Use consistent naming convention
probe_success:burnrate<duration>
- Include fallback logic with
absent()
andvector(1)
for missing metrics- Properly labeled for reuse in alerting rules
This approach optimizes query performance for alert evaluation.
internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml (1)
1-21
: Appropriate operator-level monitoring configuration.The PrometheusRule effectively monitors operator health by:
- Tracking reconciliation failure rates for key controllers
- Using 15-minute intervals suitable for controller monitoring
- Filtering by specific job label for proper metric scoping
- Recording metrics for efficient downstream alerting
The focus on
result!="success"
provides clear visibility into controller operational issues.internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1)
1-90
: Excellent consistency with other component monitoring rules.The TrustyAI PrometheusRule maintains consistent patterns with other components:
- Same SLO-based alerting approach with 98% error budget
- Identical time windows and burn rate thresholds
- Consistent alert severity mapping and annotation structure
- Proper instance labeling for the controller manager
This standardization simplifies operational procedures and ensures uniform monitoring coverage across components.
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (3)
14-17
: Consider consistency in SLI thresholds across components.CodeFlare uses a 99.950% SLI (error budget of 0.0005) while other components use 98% (error budget of 0.02). This creates significantly different alerting sensitivities across the platform.
Should all components use consistent SLI thresholds for operational consistency, or does CodeFlare require stricter monitoring due to its critical role in distributed workloads?
Also applies to: 26-29, 38-41
50-67
: Excellent operational guidance in alert annotations.The availability alerts include valuable operational context:
- Clear descriptions of alert conditions
- Direct links to triage documentation
- Appropriate severity levels for different scenarios
This enhances the operational experience and reduces mean time to resolution.
69-145
: Comprehensive recording rules covering multiple components.The recording rules effectively support both MCAD Controller and CodeFlare Operator monitoring with:
- Consistent burn rate calculations across time windows
- Proper job filtering for metric scoping
- Clear instance labeling for component identification
This dual-component approach provides complete coverage for the CodeFlare ecosystem.
tests/e2e/monitoring_test.go (3)
52-53
: LGTM: Well-structured test case additions.The new test cases for PrometheusRule creation and deletion are properly integrated into the test suite and follow the existing pattern.
603-648
: LGTM: Comprehensive PrometheusRule creation test.The test correctly:
- Enables alerting configuration in DSCI
- Sets dashboard component to managed state
- Validates dashboard resource readiness with proper condition checks
- Verifies existence of both component-specific and operator PrometheusRule resources
650-665
: LGTM: Proper cleanup validation.The deletion test correctly verifies that PrometheusRule resources are cleaned up when a component is disabled by setting its management state to "Removed".
internal/controller/services/monitoring/monitoring_controller_support.go (2)
188-201
: LGTM: Well-implemented PrometheusRule template management.The function properly:
- Constructs the expected file path for component PrometheusRule templates
- Validates file existence in the embedded filesystem before adding to templates
- Provides clear error messaging for missing files
- Follows the established pattern for template management
203-219
: LGTM: Proper PrometheusRule cleanup implementation.The cleanup function correctly:
- Creates an unstructured object with the proper PrometheusRule GVK
- Constructs the expected resource name and namespace
- Handles NotFound errors gracefully (resource already deleted)
- Provides clear error messages for deletion failures
internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (2)
1-7
: LGTM!The PrometheusRule metadata and structure are correctly defined with proper API version and templated namespace.
48-51
: No action required: job selector matches Prometheus configThe
job_name: 'Kserve Controller Manager'
used in the alerting rule exactly matches the scrape configuration inconfig/monitoring/prometheus/apps/prometheus-configs.yaml
:
- In
config/monitoring/prometheus/apps/prometheus-configs.yaml
, you have- job_name: 'Kserve Controller Manager' honor_labels: true metrics_path: /metrics- The alert rule in
internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
correctly usesabsent(up{job=~'Kserve Controller Manager'}) * 0 or vector(1)
Since the selector aligns with your Prometheus scrape job, no changes are needed here.
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1)
109-196
: Recording rules are well-structured.The recording rules correctly calculate burn rates for both HAProxy HTTP responses and probe success metrics over multiple time windows.
internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1)
51-51
: Verify job name with spaces.Similar to kserve, the job name 'ODH Model Controller' contains spaces. Verify this matches your actual Prometheus job configuration.
internal/controller/services/monitoring/monitoring_controller_actions.go (2)
311-339
: Well-structured validation and CRD existence check.The function properly validates the instance type, checks for alerting configuration, and verifies CRD availability before proceeding. Good defensive programming.
359-392
: Excellent error collection pattern.The approach of collecting errors without failing fast allows all components to be processed, providing better visibility into multiple issues. The separation of add and cleanup errors aids in debugging.
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)
14-24
: Regex mismatch between numerator & denominatorThe numerator regex is
.*jupyterhub-nb-.*
(leading.*
), the denominator isjupyterhub-nb-.*
(no leading wildcard).
This asymmetry is probably unintentional and can hide mis-matched series, slightly skewing the ratio.- expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 ... + expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} > 0.9 ...Same for the 100 % rule on Line 23.
|
||
|
||
#### Alerting | ||
|
||
|
||
|
||
Alerting configuration for Prometheus | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Alerting
section is incomplete – no schema table provided
The new header correctly introduces the Alerting
type but the usual field-by-field Markdown table is missing.
Consumers of this generated API reference have no way to know which keys are accepted (e.g. disabled
, receivers
, etc.) or their validation constraints.
Please re-run controller-gen
/make manifests
(or whichever doc generator you use) after adding proper godoc comments to the Alerting
struct so the table is rendered.
🤖 Prompt for AI Agents
In docs/api-overview.md around lines 2756 to 2764, the Alerting section lacks
the detailed schema table showing its fields and validation constraints. To fix
this, add comprehensive godoc comments to the Alerting struct in the source
code, then re-run the documentation generator such as controller-gen or make
manifests to regenerate the API reference with the complete field-by-field
Markdown table included.
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
...er/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
...er/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
...l/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
23cb177
to
ec962a0
Compare
@zdtsw Partly yes, but we need to work out the migration path first. For now this is just to enable the alerting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (4)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)
8-10
: Fix YAML indentation and formatting issues.The static analysis correctly identifies formatting problems that were previously flagged. The indentation issues and trailing spaces still need to be addressed.
Also applies to: 68-68, 72-72, 110-110
internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (2)
20-20
: Reconsider using 'info' severity for error burn rate alerts.Error burn rate alerts typically warrant 'warning' or 'critical' severity. Using 'info' severity may cause these alerts to be ignored in production monitoring systems.
Also applies to: 32-32, 44-44, 56-56
101-102
: Fix inconsistency between alert message and actual timing.The alert messages claim components have been NotReady for more than 5 minutes, but the expressions use
max_over_time(...[3m])
withfor: 2m
, totaling approximately 3-5 minutes.Also applies to: 110-111, 119-120, 128-129
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)
53-100
: Duplicate alert names break the whole rule filePrometheus rejects any rule group that contains duplicate
alert:
names, causing all rules inSLOs-probe_success_workbench
to be ignored.
The four alerts here are all namedRHODS Jupyter Probe Success Burn Rate
.- - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 5m_1h ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 30m_6h ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 2h_1d ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 6h_3dMake each alert name unique (or move them into separate groups) so the rule file loads.
🧹 Nitpick comments (2)
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)
52-56
: Review the probe success initialization logic.The expression
absent(up{job="feast-operator"}) * 0 or vector(1)
sets the probe_success metric to 1 when the up metric is absent, defaulting to "success". This could mask actual failures during operator startup or unavailability periods.Consider whether defaulting to success when metrics are absent aligns with your monitoring strategy.
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)
141-171
: Redundant recording rule & trailing whitespace
rhods_active_users
(Line 147) andcluster:usage:consumption:rhods:active_users
(Line 160) are derived from the same selector, producing two identical series. Consider dropping one or changing the expression to capture a different aspect (e.g., burst vs steady-state users).Also, Line 171 contains trailing spaces—cleaning these up avoids noisy diffs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (31)
api/services/v1alpha1/monitoring_types.go
(3 hunks)api/services/v1alpha1/zz_generated.deepcopy.go
(2 hunks)config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
(2 hunks)config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
(2 hunks)config/rbac/role.yaml
(2 hunks)docs/api-overview.md
(4 hunks)internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/components.go
(1 hunks)internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/dscinitialization/dscinitialization_controller.go
(1 hunks)internal/controller/dscinitialization/kubebuilder_rbac.go
(1 hunks)internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/services/monitoring/monitoring_controller.go
(2 hunks)internal/controller/services/monitoring/monitoring_controller_actions.go
(3 hunks)internal/controller/services/monitoring/monitoring_controller_support.go
(2 hunks)internal/controller/status/status.go
(2 hunks)pkg/cluster/gvk/gvk.go
(1 hunks)pkg/common/common.go
(2 hunks)tests/e2e/monitoring_test.go
(2 hunks)
✅ Files skipped from review due to trivial changes (3)
- pkg/cluster/gvk/gvk.go
- internal/controller/components/components.go
- docs/api-overview.md
🚧 Files skipped from review as they are similar to previous changes (22)
- pkg/common/common.go
- config/rbac/role.yaml
- internal/controller/dscinitialization/kubebuilder_rbac.go
- internal/controller/dscinitialization/dscinitialization_controller.go
- internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml
- internal/controller/status/status.go
- config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
- api/services/v1alpha1/monitoring_types.go
- internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
- api/services/v1alpha1/zz_generated.deepcopy.go
- tests/e2e/monitoring_test.go
- config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
- internal/controller/services/monitoring/monitoring_controller.go
- internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
- internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
- internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
- internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
- internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
- internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
- internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
- internal/controller/services/monitoring/monitoring_controller_support.go
- internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: in rhoai patch files, there are two naming patterns for dscinitialization crd patches: files with `....
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: config/crd.rhoai/patches/cainjection_in_dscinitialization.opendatahub.io_dscinitializations.yaml:5-7
Timestamp: 2025-07-29T18:43:10.232Z
Learning: In RHOAI patch files, there are two naming patterns for DSCInitialization CRD patches: files with `.opendatahub.io` in their filename use the duplicated pattern `dscinitializations.dscinitialization.opendatahub.io.opendatahub.io` in the CRD name, while files without the group in filename use the standard pattern `dscinitializations.dscinitialization.opendatahub.io`. This duplication appears to be intentional and exists in both regular and RHOAI branches.
Applied to files:
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
🪛 YAMLlint (1.37.1)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[error] 68-68: trailing spaces
(trailing-spaces)
[warning] 72-72: wrong indentation: expected 10 but found 8
(indentation)
[warning] 110-110: wrong indentation: expected 10 but found 8
(indentation)
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[error] 48-48: trailing spaces
(trailing-spaces)
[error] 49-49: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[warning] 31-31: wrong indentation: expected 10 but found 8
(indentation)
[error] 50-50: trailing spaces
(trailing-spaces)
[warning] 53-53: wrong indentation: expected 10 but found 8
(indentation)
[warning] 105-105: wrong indentation: expected 10 but found 8
(indentation)
[warning] 143-143: wrong indentation: expected 10 but found 8
(indentation)
[error] 171-171: trailing spaces
(trailing-spaces)
[warning] 174-174: wrong indentation: expected 10 but found 8
(indentation)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: Run tests and collect coverage on tests/integration
- GitHub Check: golangci-lint
🔇 Additional comments (11)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (2)
14-17
: Verify burn rate threshold calculations.The burn rate thresholds use
(14.40 * (1-0.99950))
which equals14.40 * 0.0005 = 0.0072
. This represents a very strict SLO (99.95% success rate) with high sensitivity. Ensure this aligns with your actual service level objectives.
72-106
: LGTM on recording rules structure.The recording rules correctly calculate burn rates using
1 - avg_over_time(probe_success{...}[duration])
pattern across multiple time windows. The labeling withinstance: mcad-controller
is consistent.internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)
15-15
: Different SLO threshold compared to other components.This component uses a 98% success rate threshold
(1-0.98000)
while CodeFlare uses 99.95%. Ensure this difference is intentional and aligns with the service requirements for Feast Operator.Also applies to: 17-17
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)
15-15
: Consistent SLO threshold with Feast Operator.Uses the same 98% success rate threshold as Feast Operator, which is good for consistency across similar operator components.
Also applies to: 17-17
52-56
: Probe success initialization follows same pattern.Uses the same
absent(up{job="llama-stack-k8s-operator"}) * 0 or vector(1)
pattern as Feast Operator for consistent behavior across components.internal/controller/services/monitoring/monitoring_controller_actions.go (4)
311-324
: Well-structured function initialization and early validation.The function correctly validates the instance type and checks for alerting configuration, with appropriate condition marking when alerting is not configured.
326-348
: Proper CRD existence check and template addition.Good practice to check for PrometheusRule CRD availability before proceeding. The operator-level rules are correctly added regardless of component states.
359-392
: Excellent error collection strategy for component processing.The approach of collecting errors for each component while continuing to process others is robust. This prevents one failing component from blocking alerting setup for all others.
400-420
: Comprehensive error reporting and handling.The function logs individual component errors while still returning an aggregate error if any issues occurred. This provides good visibility into which specific components had problems.
internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (2)
135-172
: Recording rules structure is well-implemented.The probe success burn rate recording rules correctly calculate burn rates across multiple time windows with proper labeling and consistent instance naming.
174-224
: HTTP response burn rate calculations look correct.The HAProxy backend HTTP response burn rate recording rules properly calculate 5xx error rates across different time windows using rate functions and division for percentage calculation.
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
do we need it for ODH 2.33.0 release? |
It would be good to have this is 2.33 so we have all the foundational pieces of observability in, but I don't think it's a blocker if not. |
/test opendatahub-operator-e2e |
what i am thinking is, if we need it for 2.23.0 RHOAI we will need it for 2.33.0 ODH, then we need focus gettting this PR in main, and we can leave the sync to rhoai for the next 2 weeks. |
The target for this is 2.24 RHOAI. |
/test opendatahub-operator-e2e |
ec962a0
to
2376986
Compare
/test opendatahub-operator-e2e |
a9907bb
to
c041af2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
♻️ Duplicate comments (13)
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (2)
10-57
: Rename alerts per window and template the namespace label in alert labelsAll four alerts in this group share the same name and hardcode the namespace label. Unique names help triage and routing; templating the namespace improves portability.
Apply the following changes:
- - alert: RHODS Dashboard Route Error Burn Rate + - alert: RHODS Dashboard Route Error Burn Rate (5m/1h) annotations: message: 'High error budget burn for {{`{{`}}$labels.route{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).' summary: RHODS Dashboard Route Error Burn Rate expr: | - sum(haproxy_backend_http_responses_total:burnrate5m{route=~"rhods-dashboard"}) by (route) > (14.40 * (1-0.99950)) + sum(haproxy_backend_http_responses_total:burnrate5m{route="rhods-dashboard"}) by (route) > (14.40 * (1-0.99950)) and - sum(haproxy_backend_http_responses_total:burnrate1h{route=~"rhods-dashboard"}) by (route) > (14.40 * (1-0.99950)) + sum(haproxy_backend_http_responses_total:burnrate1h{route="rhods-dashboard"}) by (route) > (14.40 * (1-0.99950)) for: 2m labels: severity: critical - namespace: redhat-ods-applications + namespace: {{.Namespace}} - - alert: RHODS Dashboard Route Error Burn Rate + - alert: RHODS Dashboard Route Error Burn Rate (30m/6h) annotations: message: 'High error budget burn for {{`{{`}}$labels.route{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).' summary: RHODS Dashboard Route Error Burn Rate expr: | - sum(haproxy_backend_http_responses_total:burnrate30m{route=~"rhods-dashboard"}) by (route) > (6.00 * (1-0.99950)) + sum(haproxy_backend_http_responses_total:burnrate30m{route="rhods-dashboard"}) by (route) > (6.00 * (1-0.99950)) and - sum(haproxy_backend_http_responses_total:burnrate6h{route=~"rhods-dashboard"}) by (route) > (6.00 * (1-0.99950)) + sum(haproxy_backend_http_responses_total:burnrate6h{route="rhods-dashboard"}) by (route) > (6.00 * (1-0.99950)) for: 15m labels: severity: critical - namespace: redhat-ods-applications + namespace: {{.Namespace}} - - alert: RHODS Dashboard Route Error Burn Rate + - alert: RHODS Dashboard Route Error Burn Rate (2h/1d) annotations: message: 'High error budget burn for {{`{{`}}$labels.route{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).' summary: RHODS Dashboard Route Error Burn Rate expr: | - sum(haproxy_backend_http_responses_total:burnrate2h{route=~"rhods-dashboard"}) by (route) > (3.00 * (1-0.99950)) + sum(haproxy_backend_http_responses_total:burnrate2h{route="rhods-dashboard"}) by (route) > (3.00 * (1-0.99950)) and - sum(haproxy_backend_http_responses_total:burnrate1d{route=~"rhods-dashboard"}) by (route) > (3.00 * (1-0.99950)) + sum(haproxy_backend_http_responses_total:burnrate1d{route="rhods-dashboard"}) by (route) > (3.00 * (1-0.99950)) for: 1h labels: severity: warning - namespace: redhat-ods-applications + namespace: {{.Namespace}} - - alert: RHODS Dashboard Route Error Burn Rate + - alert: RHODS Dashboard Route Error Burn Rate (6h/3d) annotations: message: 'High error budget burn for {{`{{`}}$labels.route{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).' summary: RHODS Dashboard Route Error Burn Rate expr: | - sum(haproxy_backend_http_responses_total:burnrate6h{route=~"rhods-dashboard"}) by (route) > (1.00 * (1-0.99950)) + sum(haproxy_backend_http_responses_total:burnrate6h{route="rhods-dashboard"}) by (route) > (1.00 * (1-0.99950)) and - sum(haproxy_backend_http_responses_total:burnrate3d{route=~"rhods-dashboard"}) by (route) > (1.00 * (1-0.99950)) + sum(haproxy_backend_http_responses_total:burnrate3d{route="rhods-dashboard"}) by (route) > (1.00 * (1-0.99950)) for: 3h labels: severity: warning - namespace: redhat-ods-applications + namespace: {{.Namespace}}
58-107
: Apply the same naming and namespace-label fixes to probe alertsMake probe alert names unique per window and template the namespace label. Also prefer exact-match selectors to avoid regex overhead when the value is static.
- - alert: RHODS Dashboard Probe Success Burn Rate + - alert: RHODS Dashboard Probe Success Burn Rate (5m/1h) annotations: message: 'High error budget burn for {{`{{`}}$labels.name{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).' summary: RHODS Dashboard Probe Success Burn Rate expr: | - sum(probe_success:burnrate5m{name=~"rhods-dashboard"}) by (name) > (14.40 * (1-0.98)) + sum(probe_success:burnrate5m{name="rhods-dashboard"}) by (name) > (14.40 * (1-0.98)) and - sum(probe_success:burnrate1h{name=~"rhods-dashboard"}) by (name) > (14.40 * (1-0.98)) + sum(probe_success:burnrate1h{name="rhods-dashboard"}) by (name) > (14.40 * (1-0.98)) for: 2m labels: severity: critical - namespace: redhat-ods-applications + namespace: {{.Namespace}} - - alert: RHODS Dashboard Probe Success Burn Rate + - alert: RHODS Dashboard Probe Success Burn Rate (30m/6h) annotations: message: 'High error budget burn for {{`{{`}}$labels.name{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).' summary: RHODS Dashboard Probe Success Burn Rate expr: | - sum(probe_success:burnrate30m{name=~"rhods-dashboard"}) by (name) > (6.00 * (1-0.98)) + sum(probe_success:burnrate30m{name="rhods-dashboard"}) by (name) > (6.00 * (1-0.98)) and - sum(probe_success:burnrate6h{name=~"rhods-dashboard"}) by (name) > (6.00 * (1-0.98)) + sum(probe_success:burnrate6h{name="rhods-dashboard"}) by (name) > (6.00 * (1-0.98)) for: 15m labels: severity: critical - namespace: redhat-ods-applications + namespace: {{.Namespace}} - - alert: RHODS Dashboard Probe Success Burn Rate + - alert: RHODS Dashboard Probe Success Burn Rate (2h/1d) annotations: message: 'High error budget burn for {{`{{`}}$labels.name{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).' summary: RHODS Dashboard Probe Success Burn Rate expr: | - sum(probe_success:burnrate2h{name=~"rhods-dashboard"}) by (name) > (3.00 * (1-0.98)) + sum(probe_success:burnrate2h{name="rhods-dashboard"}) by (name) > (3.00 * (1-0.98)) and - sum(probe_success:burnrate1d{name=~"rhods-dashboard"}) by (name) > (3.00 * (1-0.98)) + sum(probe_success:burnrate1d{name="rhods-dashboard"}) by (name) > (3.00 * (1-0.98)) for: 1h labels: severity: warning - namespace: redhat-ods-applications + namespace: {{.Namespace}} - - alert: RHODS Dashboard Probe Success Burn Rate + - alert: RHODS Dashboard Probe Success Burn Rate (6h/3d) annotations: message: 'High error budget burn for {{`{{`}}$labels.name{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).' summary: RHODS Dashboard Probe Success Burn Rate expr: | - sum(probe_success:burnrate6h{name=~"rhods-dashboard"}) by (name) > (1.00 * (1-0.98)) + sum(probe_success:burnrate6h{name="rhods-dashboard"}) by (name) > (1.00 * (1-0.98)) and - sum(probe_success:burnrate3d{name=~"rhods-dashboard"}) by (name) > (1.00 * (1-0.98)) + sum(probe_success:burnrate3d{name="rhods-dashboard"}) by (name) > (1.00 * (1-0.98)) for: 3h labels: severity: warning - namespace: redhat-ods-applications + namespace: {{.Namespace}}internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1)
10-46
: Duplicate alert names within a group will break rule loadingEach alert in a group must have a unique name. These three alerts share the same name, which will cause the Prometheus rule group to be rejected.
- - alert: TrustyAI Controller Probe Success Burn Rate + - alert: TrustyAI Controller Probe Success Burn Rate 5m_1h ... - - alert: TrustyAI Controller Probe Success Burn Rate + - alert: TrustyAI Controller Probe Success Burn Rate 30m_6h ... - - alert: TrustyAI Controller Probe Success Burn Rate + - alert: TrustyAI Controller Probe Success Burn Rate 2h_1dinternal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (2)
10-24
: Unify PVC regex patterns and simplify the expressionThe used vs capacity selectors use different regexes, which risks mismatched series. Use identical patterns on both sides. Also, parenthesize once to avoid recomputing the ratio visually.
- expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99 + expr: (kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"}) > 0.9 + and + (kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"}) < 0.99
53-100
: Duplicate alert names within group will break rule loadingAll four alerts in SLOs-probe_success_workbench have the same name. Make them unique (suffix with window pair) or split into separate groups.
- - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 5m_1h ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 30m_6h ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 2h_1d ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 6h_3dinternal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)
8-8
: Fix list indentation under groupsYAMLlint flags wrong indentation. The list item under spec.groups should be indented consistently (2 spaces under groups).
Apply this diff:
- - name: SLOs-probe_success_llama_stack_k8s_operator + - name: SLOs-probe_success_llama_stack_k8s_operator
52-52
: Correct indentation for first recording rule entryYAMLlint expects 10 spaces before the list entry under rules, not 8.
Apply this diff:
- - expr: | + - expr: |internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (2)
52-52
: Correct indentation for first recording rule entryYAMLlint expects 10 spaces before the list entry under rules, not 8.
Apply this diff:
- - expr: | + - expr: |
8-8
: Fix list indentation under groupsYAMLlint flags wrong indentation. The list item under spec.groups should be indented consistently (2 spaces under groups).
Apply this diff:
- - name: SLOs-probe_success_feast_operator + - name: SLOs-probe_success_feast_operatorinternal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (4)
51-61
: Fix PromQL label quoting and standardize matching; current expressions won’t evaluate.PromQL requires double quotes in label matchers. Also, using a space-containing job value is brittle and likely incorrect for up. Prefer matching the operator pod in the target namespace to avoid reliance on scrape job names.
Proposed, more robust expressions:
- expr: absent(up{job=~'CodeFlare Operator'}) or up{job=~'CodeFlare Operator'} != 1 + expr: | + absent(up{namespace="{{.Namespace}}", pod=~"codeflare-operator-.*"}) + OR + up{namespace="{{.Namespace}}", pod=~"codeflare-operator-.*"} != 1 ... - expr: absent_over_time(up{job="CodeFlare Operator"}[2m]) == 1 + expr: absent_over_time(up{namespace="{{.Namespace}}", pod=~"codeflare-operator-.*"}[2m]) == 1If you do need to match on job, use double quotes and a sane regex (and confirm the actual label value):
- up{job=
"codeflare-operator.*"} not up{job='CodeFlare Operator'}
70-76
: Normalize indentation for rules entries (YAMLlint: expected 10 spaces).Child list items under rules should be indented two spaces deeper than rules:.
- name: SLOs - MCAD Controller - rules: - - expr: | + rules: + - expr: | 1 - avg_over_time(probe_success{instance=~"mcad-controller-.*", job="user_facing_endpoints_status_codeflare"}[1d])
108-114
: Normalize indentation for rules entries in “SLOs - CodeFlare Operator”.Same indentation problem as above.
- name: SLOs - CodeFlare Operator - rules: - - expr: | + rules: + - expr: | 1 - avg_over_time(probe_success{instance=~"codeflare-operator-.*", job="user_facing_endpoints_status_codeflare"}[1d])
7-15
: Fix invalid YAML indentation under spec.groups (will fail to parse).The list item under groups and its nested rules/items are over-indented. Align them with 2-space steps.
Apply:
spec: groups: - - name: SLOs-probe_success_codeflare - rules: - - alert: CodeFlare Operator Probe Success Burn Rate + - name: SLOs-probe_success_codeflare + rules: + - alert: CodeFlare Operator Probe Success Burn Rate
🧹 Nitpick comments (14)
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (2)
8-112
: Optional: place recording rules before alerts to reduce cold-start gapsPrometheus doesn’t guarantee cross-group evaluation order. Having alerts depend on recording rules in later groups can cause transient “empty” results on first runs/reloads.
Consider grouping recordings ahead of alerts, or combine dependent alerts and recordings into the same group (recordings first, then alerts) to ensure data availability within the same evaluation cycle.
15-17
: Prefer exact-match selectors instead of regex when staticUse route="rhods-dashboard" and name="rhods-dashboard" instead of regex matches to reduce query overhead and noise.
You already see this reflected in the diffs above; apply consistently across all occurrences.
Also applies to: 27-29, 39-41, 51-53, 65-67, 77-79, 89-91, 101-103
pkg/common/common.go (1)
23-27
: Remove unnecessary embed import after generalizing FileExistsIf you adopt the fs.FS interface in FileExists, the embed import becomes unused and should be removed.
-import ( - "crypto/sha256" - "embed" +import ( + "crypto/sha256" b64 "encoding/base64" "fmt" "io/fs" "os" "regexp"internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml (2)
11-17
: Tighten job matcher to avoid unintended matchesAnchor the job regex to avoid accidental partial matches (e.g., jobs containing “Kueue Operator” as a substring). Optionally, also scope to the target namespace if appropriate.
- expr: absent(up{job=~'Kueue Operator'}) or up{job=~'Kueue Operator'} != 1 + expr: absent(up{job=~'^Kueue Operator$'}) or up{job=~'^Kueue Operator$'} != 1Follow-up:
- Verify the actual job label value produced by the ServiceMonitor/PodMonitor for Kueue. If it differs, adjust the selector accordingly (e.g., service, namespace, or pod labels). I can help update the rule once the canonical labels are confirmed.
3-5
: Consider aligning resource name with other component templatesOther components use the “-prometheusrules” pattern (e.g., trustyai-prometheusrules, workbenches-prometheusrules). For consistency and discoverability, consider renaming the resource to kueue-prometheusrules.
internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1)
10-46
: Optional: Align thresholds/labels with a runbook URL and consistent summariesConsider adding runbook_url and making summaries concise (“TrustyAI Controller probe SLO burn rate high (5m/1h)”) for better triage.
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (2)
8-8
: YAML lint issues: indentation and trailing spacesThere are multiple indentation warnings and trailing spaces flagged by YAMLlint. Please fix to avoid CI/tooling noise and potential parsing issues.
- Wrong indentation at lines: 8, 10, 31, 53, 105, 143, 174 (expected 4/10 spaces per linter hints).
- Trailing spaces at lines: 50, 171.
Also applies to: 10-10, 31-31, 50-50, 53-53, 105-105, 143-143, 171-171, 174-174
35-48
: Tighten controller job matchersAs with other rules, consider anchoring the job regex and optionally scoping by namespace. This reduces false positives if similarly named jobs exist.
- expr: absent(up{job=~'Kubeflow Notebook Controller Service Metrics'}) + expr: absent(up{job=~'^Kubeflow Notebook Controller Service Metrics$'}) ... - expr: absent(up{job=~'ODH Notebook Controller Service Metrics'}) + expr: absent(up{job=~'^ODH Notebook Controller Service Metrics$'})Follow-up:
- Verify the actual job label values produced by the ServiceMonitors for these controllers. I can adjust selectors once the canonical labels are confirmed.
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)
10-21
: Severity consistency across componentsTwo shorter-window alerts are severity: critical here, while similar templates (e.g., llama stack) use warning. If this disparity is intentional, ignore. Otherwise, consider aligning severities for consistency across components.
Do you want me to standardize severities across all component PrometheusRule templates?
Also applies to: 23-34, 36-47
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (2)
68-68
: Remove trailing whitespace on blank line.YAMLlint flags trailing spaces here.
- +
47-67
: Optional: Add a component label to aid Alertmanager routing.Consider adding labels:
- component: codeflare
- product: rhods or odh (as applicable)
This helps downstream routing/ownership without parsing alert names.
internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml (3)
15-18
: Simplify expressions: remove unnecessary sum by(instance) and use equality matchersRecording rules already produce a single series per instance. Using sum(...) by (instance) is redundant. Equality matcher is also sufficient and marginally cheaper than regex.
- sum(probe_success:burnrate5m{instance=~"modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000)) + probe_success:burnrate5m{instance="modelmesh-controller"} > (14.40 * (1-0.98000)) and - sum(probe_success:burnrate1h{instance=~"modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000)) + probe_success:burnrate1h{instance="modelmesh-controller"} > (14.40 * (1-0.98000))- sum(probe_success:burnrate30m{instance=~"modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000)) + probe_success:burnrate30m{instance="modelmesh-controller"} > (6.00 * (1-0.98000)) and - sum(probe_success:burnrate6h{instance=~"modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000)) + probe_success:burnrate6h{instance="modelmesh-controller"} > (6.00 * (1-0.98000))- sum(probe_success:burnrate2h{instance=~"modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000)) + probe_success:burnrate2h{instance="modelmesh-controller"} > (3.00 * (1-0.98000)) and - sum(probe_success:burnrate1d{instance=~"modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000)) + probe_success:burnrate1d{instance="modelmesh-controller"} > (3.00 * (1-0.98000))Also applies to: 27-30, 39-42
10-14
: Differentiate the three alerts with a label or unique name for easier triageAll three alerts share the same name and differ only by expr and severity. Add a label (e.g., window=short|medium|long) or suffix the alert name to simplify routing, silencing, and dashboards.
Example:
- - alert: Modelmesh Controller Probe Success Burn Rate + - alert: Modelmesh Controller Probe Success Burn Rate (short) + labels: + window: short ... - - alert: Modelmesh Controller Probe Success Burn Rate + - alert: Modelmesh Controller Probe Success Burn Rate (medium) + labels: + window: medium ... - - alert: Modelmesh Controller Probe Success Burn Rate + - alert: Modelmesh Controller Probe Success Burn Rate (long) + labels: + window: longAlso applies to: 22-26, 34-38
47-90
: Sanity-check the base signal used for SLOsYou’re deriving SLOs from “up” via a synthetic probe_success. If modelmesh exposes a dedicated readiness/health metric, consider using it for better signal quality (up only measures scrape reachability).
If a better health metric exists (e.g., controller reconciliation errors, queue depth, or a health gauge), I can help propose an alternative set of recording rules and thresholds.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (34)
api/services/v1alpha1/monitoring_types.go
(3 hunks)api/services/v1alpha1/zz_generated.deepcopy.go
(2 hunks)bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
(2 hunks)bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
(3 hunks)bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
(2 hunks)config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
(2 hunks)config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
(2 hunks)config/rbac/role.yaml
(2 hunks)docs/api-overview.md
(4 hunks)internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/components.go
(1 hunks)internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/dscinitialization/dscinitialization_controller.go
(1 hunks)internal/controller/dscinitialization/kubebuilder_rbac.go
(1 hunks)internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/services/monitoring/monitoring_controller.go
(2 hunks)internal/controller/services/monitoring/monitoring_controller_actions.go
(3 hunks)internal/controller/services/monitoring/monitoring_controller_support.go
(2 hunks)internal/controller/status/status.go
(2 hunks)pkg/cluster/gvk/gvk.go
(1 hunks)pkg/common/common.go
(2 hunks)tests/e2e/monitoring_test.go
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (21)
- internal/controller/dscinitialization/dscinitialization_controller.go
- internal/controller/components/components.go
- internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
- internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
- pkg/cluster/gvk/gvk.go
- internal/controller/services/monitoring/monitoring_controller_support.go
- config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
- internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
- internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
- internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
- internal/controller/services/monitoring/monitoring_controller.go
- tests/e2e/monitoring_test.go
- api/services/v1alpha1/zz_generated.deepcopy.go
- internal/controller/dscinitialization/kubebuilder_rbac.go
- internal/controller/services/monitoring/monitoring_controller_actions.go
- api/services/v1alpha1/monitoring_types.go
- docs/api-overview.md
- config/rbac/role.yaml
- config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
- internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
- internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-29T18:44:42.749Z
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: bundle.rhoai/manifests/rhods-operator.clusterserviceversion.yaml:1548-1550
Timestamp: 2025-07-29T18:44:42.749Z
Learning: In the opendatahub-operator repository's rhoai branch, the CSV manifest intentionally uses `quay.io/opendatahub/opendatahub-operator:latest` as the image reference even though it's for the rhods-operator, and this configuration difference from the main branch is by design.
Applied to files:
bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
🪛 YAMLlint (1.37.1)
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[error] 68-68: trailing spaces
(trailing-spaces)
[warning] 72-72: wrong indentation: expected 10 but found 8
(indentation)
[warning] 110-110: wrong indentation: expected 10 but found 8
(indentation)
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[error] 48-48: trailing spaces
(trailing-spaces)
[error] 49-49: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[warning] 31-31: wrong indentation: expected 10 but found 8
(indentation)
[error] 50-50: trailing spaces
(trailing-spaces)
[warning] 53-53: wrong indentation: expected 10 but found 8
(indentation)
[warning] 105-105: wrong indentation: expected 10 but found 8
(indentation)
[warning] 143-143: wrong indentation: expected 10 but found 8
(indentation)
[error] 171-171: trailing spaces
(trailing-spaces)
[warning] 174-174: wrong indentation: expected 10 but found 8
(indentation)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on tests/integration
- GitHub Check: golangci-lint
- GitHub Check: Run tests and collect coverage on internal and pkg
🔇 Additional comments (11)
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1)
1-2
: Confirm PrometheusRule API group monitoring.rhobs/v1 is supported in target clustersI’ve verified the repository references and RBAC rules for both API groups:
• 14 PrometheusRule templates under
internal/controller/.../monitoring/*-prometheusrules.tmpl.yaml
use
apiVersion: monitoring.rhobs/v1
• Base config and some ServiceMonitor/PodMonitor templates use
apiVersion: monitoring.coreos.com/v1
• RBAC inconfig/rbac/role.yaml
and the CSV manifest grants permissions for both monitoring.rhobs and monitoring.coreos.com
• No PrometheusRule CRDs are vendored—cluster must provide the CRD for whichever group is in usePlease confirm that RHOAI 2.24/ODH 2.33 target clusters have the PrometheusRule CRD installed for monitoring.rhobs/v1, or adjust these templates to use monitoring.coreos.com/v1 if that’s the supported API in your environment.
internal/controller/status/status.go (2)
108-109
: LGTM: new ConditionAlertingAvailableThe new condition type aligns with existing condition naming and will help surface alerting availability clearly.
189-191
: LGTM: “AlertingNotConfigured” reason/messageConsistent with Metrics/Traces constants. This will simplify status reporting in the monitoring controller.
bundle/manifests/services.platform.opendatahub.io_monitorings.yaml (2)
55-57
: CRD: Add alerting under spec — LGTMThe new alerting object is correctly added under spec with description. This keeps the schema forward‑compatible.
195-197
: CRD cross-field validation — LGTMValidation ensuring alerting implies metrics is correct and well-scoped at the spec level.
bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml (2)
93-95
: DSCI CRD: Add monitoring.alerting — LGTMThe alerting object is properly introduced under monitoring with a clear description.
247-249
: DSCI CRD cross-field validation — LGTMValidation mirrors Monitoring CRD: alerting requires metrics. Consistent and correct.
bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (2)
1018-1041
: RBAC: Add PrometheusRule permissions for monitoring.rhobs — LGTMCreate/update/list and finalizers/status for monitoring.rhobs PrometheusRule are correctly included, aligning with the new alerting flow.
1498-1498
: Confirm image tag strategy for CSV deploymentThe manager image is set to quay.io/opendatahub/opendatahub-operator:latest. Please confirm this matches the release process for the ODH CSV on main (some branches intentionally use latest; just validating this is intended here).
Would you like me to switch this to a pinned version or a CI-replaced placeholder (e.g., REPLACE_IMAGE) if that’s the repository standard?
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)
1-1
: Confirm CRD group apiVersion “monitoring.rhobs/v1” availability in target clusters.Many clusters expose PrometheusRule at monitoring.coreos.com/v1. If monitoring.rhobs/v1 is not registered, these resources won’t reconcile.
Would you like me to scan the repo/manifests for the installed PrometheusRule GVKs and generate a compatibility matrix for ODH 2.33 and RHOAI 2.24?
internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml (1)
1-2
: PrometheusRule API group monitoring.rhobs/v1 is correct and consistentAll PrometheusRule templates and GVK registrations in this repo use monitoring.rhobs/v1:
- pkg/cluster/gvk/gvk.go (lines 480–483) defines
PrometheusRule = schema.GroupVersionKind{Group: "monitoring.rhobs", Version: "v1", Kind: "PrometheusRule"}
- Only monitoring.rhobs/v1 appears in
*.tmpl.yaml
for PrometheusRule- Runtime guard via
cluster.HasCRD(ctx, rr.Client, gvk.PrometheusRule)
ensures we only create these resources if the CRD is installedNo change is required. If you need to support clusters exposing PrometheusRule under monitoring.coreos.com/v1, you’d have to update the GVK in pkg/cluster/gvk/gvk.go, the YAML templates, and any affected tests.
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
...roller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
...controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
...controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
// FileExists checks if a file exists in the embedded filesystem. | ||
func FileExists(fsys embed.FS, path string) bool { | ||
_, err := fs.Stat(fsys, path) | ||
return err == nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Generalize FS type and avoid treating directories as files in FileExists
Accept fs.FS (interface) instead of embed.FS for broader applicability, and ensure directories don’t return true. This avoids false positives when a directory matches the path and eliminates tight coupling to embed.FS.
-func FileExists(fsys embed.FS, path string) bool {
- _, err := fs.Stat(fsys, path)
- return err == nil
-}
+func FileExists(fsys fs.FS, path string) bool {
+ info, err := fs.Stat(fsys, path)
+ if err != nil {
+ return false
+ }
+ return !info.IsDir()
+}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// FileExists checks if a file exists in the embedded filesystem. | |
func FileExists(fsys embed.FS, path string) bool { | |
_, err := fs.Stat(fsys, path) | |
return err == nil | |
} | |
// FileExists checks if a file exists in the embedded filesystem. | |
func FileExists(fsys fs.FS, path string) bool { | |
info, err := fs.Stat(fsys, path) | |
if err != nil { | |
return false | |
} | |
return !info.IsDir() | |
} |
🤖 Prompt for AI Agents
In pkg/common/common.go around lines 124 to 128, change FileExists to accept the
fs.FS interface instead of embed.FS and ensure it returns false for directories:
call fs.Stat(fsys, path), if err != nil return false, otherwise check the
returned FileInfo and return !fi.IsDir(). Update any callers to pass an fs.FS
(embed.FS still implements fs.FS) and adjust imports if necessary.
/test opendatahub-operator-e2e |
c3ef7cf
to
8ce120c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tests/e2e/monitoring_test.go (1)
642-664
: Optional: assert rule ownership or basic content to increase signalYou currently assert existence. Consider asserting at least one group/label expected by your templates or ownerReferences to the Monitoring CR to catch template regressions.
If desired, I can propose concrete jq matchers for your current templates.
bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (2)
1015-1027
: RBAC for monitoring.rhobs/prometheusrules: consider adding deletecollectionYou added create/read/update/delete for rhobs PrometheusRules. For parity with monitoring.coreos.com and to enable label-selective cleanup patterns, add deletecollection.
Apply this diff:
- apiGroups: - monitoring.rhobs resources: - monitoringstacks - prometheusrules - servicemonitors verbs: - create - delete + - deletecollection - get - list - patch - update - watch
92-97
: Expose alerting in ALM example for discoverabilityIncluding
"alerting": {}
in the DSCInitialization example helps users discover the new feature via OLM UI.Apply this diff:
"monitoring": { "managementState": "Managed", - "metrics": {}, + "metrics": {}, + "alerting": {}, "namespace": "opendatahub" },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (4)
bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
(2 hunks)bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
(2 hunks)bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
(2 hunks)tests/e2e/monitoring_test.go
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-13T21:04:32.696Z
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: .github/workflows/prometheus-unit-tests.yaml:5-5
Timestamp: 2025-08-13T21:04:32.696Z
Learning: In the opendatahub-operator repository, prometheus monitoring configurations are shared between ODH and RHOAI platforms. The prometheus directory only exists in odh-config/monitoring/prometheus/, not in rhoai-config. Both platforms reference the same prometheus manifests from the odh-config location.
Applied to files:
bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: Run tests and collect coverage on tests/integration
🔇 Additional comments (3)
bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml (1)
93-96
: Addingmonitoring.alerting
looks correct and well-scopedGood addition and placement under
spec.monitoring
. Keeping it as a free-form object allows forward compatibility with future fields.tests/e2e/monitoring_test.go (2)
51-53
: New alerting test cases: good coverage and placementAdding creation and deletion tests for PrometheusRule improves behavioral coverage and aligns with the PR objective to enable built-in alerting.
625-632
: Ensure metrics is configured when enabling alerting (reduce flakiness)CRD cross-field validation and controller behavior can reject or make
alerting = {}
flaky if.spec.monitoring.metrics
isn't set. Make the test self-sufficient by configuring metrics in the same mutate pipeline.
- File: tests/e2e/monitoring_test.go — around lines 625–632 (the EventuallyResourceCreatedOrUpdated call)
Apply this diff:
tc.EventuallyResourceCreatedOrUpdated( WithMinimalObject(gvk.DSCInitialization, tc.DSCInitializationNamespacedName), WithMutateFunc(testf.TransformPipeline( testf.Transform(`.spec.monitoring.managementState = "%s"`, operatorv1.Managed), + // Ensure metrics is configured so alerting is valid and stack is present + setMonitoringMetrics(), testf.Transform(`.spec.monitoring.alerting = {}`), )), )
bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
Show resolved
Hide resolved
tc.EnsureResourceGone( | ||
WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}), | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Use existing plural helper (and verify availability) + assert operator rules persist
Elsewhere in this file you use EnsureResourcesGone
. For consistency and to avoid potential missing helper issues, prefer the plural form. Also, after removing the dashboard component, it’s valuable to assert that operator-level rules are still present when alerting remains enabled.
Apply this diff:
-// Ensure the dashboard-prometheusrules is deleted
-tc.EnsureResourceGone(
- WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
-)
+// Ensure the dashboard-specific PrometheusRule is deleted
+tc.EnsureResourcesGone(
+ WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
+)
+
+// Operator-level rules should remain when alerting is still enabled
+tc.EnsureResourceExists(
+ WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "operator-prometheusrules", Namespace: tc.AppsNamespace}),
+)
Run this script to verify whether EnsureResourceGone
exists in the test helpers; if not, the change above avoids a build failure:
🏁 Script executed:
#!/bin/bash
# Check for EnsureResourceGone helper presence
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourceGone\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourceGone\\(' -A 2 || true
Length of output: 2409
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for EnsureResourcesGone definition and usages..."
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourcesGone\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourcesGone\\(' -A 2 || true
echo
echo "Searching for EnsureResourceExists / related helpers..."
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourceExists\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourceExists\\(' -A 2 || true
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourcesExist\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourcesExist\\(' -A 2 || true
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourcesExists\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourcesExists\\(' -A 2 || true
echo
echo "Searching for PrometheusRule usages and any operator-prometheusrules literal..."
rg -n --glob '!vendor' -S $'PrometheusRule' -g '!**/vendor/**' || true
rg -n --glob '!vendor' -S $'operator-prometheusrules' -g '!**/vendor/**' || true
Length of output: 37728
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- tests/e2e/monitoring_test.go: lines 600-700 ----"
sed -n '600,700p' tests/e2e/monitoring_test.go || true
echo
echo "---- tests/e2e/test_context_test.go: lines 320-400 ----"
sed -n '320,400p' tests/e2e/test_context_test.go || true
Length of output: 6842
Use plural helper and assert operator PrometheusRule remains
EnsureResourcesGone and EnsureResourceExists are present in tests/e2e/test_context_test.go, so switch to the plural helper for consistency and add an assertion that operator-prometheusrules still exists after removing the dashboard.
- File: tests/e2e/monitoring_test.go — update ValidatePrometheusRuleDeletion (around lines 678–681).
Apply this diff:
-// Ensure the dashboard-prometheusrules is deleted
-tc.EnsureResourceGone(
- WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
-)
+// Ensure the dashboard-specific PrometheusRule is deleted
+tc.EnsureResourcesGone(
+ WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
+)
+
+// Operator-level rules should remain when alerting is still enabled
+tc.EnsureResourceExists(
+ WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "operator-prometheusrules", Namespace: tc.AppsNamespace}),
+)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
tc.EnsureResourceGone( | |
WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}), | |
) | |
} | |
// Ensure the dashboard-specific PrometheusRule is deleted | |
tc.EnsureResourcesGone( | |
WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}), | |
) | |
// Operator-level rules should remain when alerting is still enabled | |
tc.EnsureResourceExists( | |
WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "operator-prometheusrules", Namespace: tc.AppsNamespace}), | |
) | |
} |
🤖 Prompt for AI Agents
In tests/e2e/monitoring_test.go around lines 678 to 681 in
ValidatePrometheusRuleDeletion, switch the single-resource call to the plural
helper: replace tc.EnsureResourceGone(WithMinimalObject(gvk.PrometheusRule,
types.NamespacedName{Name: "dashboard-prometheusrules", Namespace:
tc.AppsNamespace})) with
tc.EnsureResourcesGone(WithMinimalObjects(gvk.PrometheusRule,
types.NamespacedName{Name: "dashboard-prometheusrules", Namespace:
tc.AppsNamespace})), and then add a call to assert the operator rule still
exists: tc.EnsureResourceExists(WithMinimalObject(gvk.PrometheusRule,
types.NamespacedName{Name: "operator-prometheusrules", Namespace:
tc.AppsNamespace})).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (2)
internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml (2)
15-18
: Use exact-match label selector instead of regex for a fixed value.instance=~"modelmesh-controller" is a regex match; use instance="modelmesh-controller" for precision and performance.
Apply this diff in each alert expression:
- sum(probe_success:burnrate5m{instance=~"modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000)) + sum(probe_success:burnrate5m{instance="modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000)) - sum(probe_success:burnrate1h{instance=~"modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000)) + sum(probe_success:burnrate1h{instance="modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000)) - sum(probe_success:burnrate30m{instance=~"modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000)) + sum(probe_success:burnrate30m{instance="modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000)) - sum(probe_success:burnrate6h{instance=~"modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000)) + sum(probe_success:burnrate6h{instance="modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000)) - sum(probe_success:burnrate2h{instance=~"modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000)) + sum(probe_success:burnrate2h{instance="modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000)) - sum(probe_success:burnrate1d{instance=~"modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000)) + sum(probe_success:burnrate1d{instance="modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000))Also applies to: 27-30, 39-42
6-9
: Consider adding standard labels/annotations for discoverability and routing.Adding labels like app.kubernetes.io/component, app.kubernetes.io/part-of, and odh.component=modelmeshserving helps selection, ownership, and cleanup across components.
I can draft a consistent labeling scheme used by other component PrometheusRules in this PR so the monitoring controller can manage them uniformly.
Also applies to: 47-49
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (1)
internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
(1 hunks)
🔇 Additional comments (2)
internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml (2)
10-14
: Alert rule naming/structure LGTM; multi-window burn-rate thresholds align with 98% SLO.Once the base series and selectors are fixed, these three windows and durations should behave as expected.
Also applies to: 22-26, 34-38
1-2
: PrometheusRule API group — repo uses monitoring.rhobs/v1; original concern not applicableQuick verification: most PrometheusRule manifests in the repo use monitoring.rhobs/v1; only one file uses monitoring.coreos.com/v1. The file under review is consistent with the repo standard.
Files found (apiVersion -> value):
- config/monitoring/base/rhods-prometheusrules.yaml -> monitoring.coreos.com/v1
- internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
- internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
Recommendation: If the intended target is the Prometheus Operator CRD (monitoring.coreos.com/v1), pick that as the canonical API and update the monitoring.rhobs.v1 files; otherwise, keep monitoring.rhobs/v1 and ignore the original suggestion.
Likely an incorrect or invalid review comment.
...controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
...controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
...controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
...controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
...controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
...controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
/test opendatahub-operator-e2e |
1 similar comment
/test opendatahub-operator-e2e |
2d5ac29
to
3fdf7b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
♻️ Duplicate comments (12)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (3)
14-14
: Align PVC regex selectors between numerator and denominator to avoid mismatched seriesThe used/capacity selectors use different regexes, which can silently drop series during division. Make them identical.
Apply:
- expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99 + expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99
53-100
: Duplicate alert names in one group will prevent the entire rule file from loading—make them uniqueAll four alerts share the same name in the group
SLOs-probe_success_workbench
. Prometheus rejects duplicate alert names within a group; the whole group (and often file) will fail to load.Suffix the alert names (and optionally summaries) with the window pair:
- - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 5m_1h ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 30m_6h ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 2h_1d ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 6h_3dOptional, mirror the suffix in
summary:
to aid triage.To validate there are no remaining duplicates and no single-quoted label selectors, run:
#!/bin/bash set -euo pipefail f='internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml' echo "Duplicate alert names within SLO group:" awk ' $0 ~ /- name: SLOs-probe_success_workbench/ { in=1; next } in && $0 ~ /- name:/ { in=0 } in && $0 ~ /^\s*- alert:\s*(.*)$/ { name=$0; gsub(/.*- alert:\s*/,"",name); c[name]++ } END { for (k in c) if (c[k]>1) printf "%s -> %d\n", k, c[k] } ' "$f" || true echo "Single-quoted label selectors (invalid in PromQL):" rg -n --color=never "=[~!]?'[^']+'" "$f" || true
174-177
: PromQL type error: label_replace applied to a scalar; use vector form with aggregation
label_replace(min(probe_success{...}), ...)
applieslabel_replace
to a scalar (sincemin(...)
withoutby()
collapses to a scalar). This fails at evaluation time. Keep it as a single-element vector usingmin by ()
and then applylabel_replace
. Also keep the per-name vector viamin by (name)
.- - expr: ((min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name) or on() vector(0)) or label_replace(min(probe_success{name=~"rhods-dashboard|notebook-spawner"}), "name", "combined", "name", ".*")) + - expr: | + min by (name) (probe_success{name=~"rhods-dashboard|notebook-spawner"}) + or + label_replace( + min by () (probe_success{name=~"rhods-dashboard|notebook-spawner"}), + "name", "combined", "", "" + )Optional: if these probes are scraped under a dedicated job, constrain with job="user_facing_endpoints_status_workbenches" to avoid mixing unrelated series.
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (2)
20-21
: Template the namespace label instead of hardcoding. [duplicate]Hardcoding the alert label namespace breaks multi-namespace installs. Use the templated namespace to mirror metadata.namespace.
Apply this diff at each label block:
labels: severity: info - namespace: redhat-ods-applications + namespace: {{.Namespace}}And similarly for the other occurrences on Lines 32-33, 44-45, 52-55, and 61-64.
Also applies to: 32-33, 44-45, 52-55, 61-64
8-10
: Correct YAML indentation and remove trailing spaces. [duplicate]YAMLlint errors indicate mis-indentation and trailing whitespace that will cause parsing or lint failures. Align list items and keys to standard PrometheusRule structure.
Apply these diffs:
- - name: SLOs-probe_success_codeflare + - name: SLOs-probe_success_codeflare- rules: + rules:- - alert: CodeFlare Operator Probe Success Burn Rate + - alert: CodeFlare Operator Probe Success Burn Rate- +- - expr: | + - expr: |- - expr: | + - expr: |Note: The intent is:
- spec
- groups
- name
- rules
- alert/expr/etc.
Please re-run yamllint after changes.
Also applies to: 68-68, 72-72, 110-110
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)
49-49
: Remove trailing space in comment.- # RecordingRules for Llama Stack + # RecordingRules for Llama Stack
52-56
: PromQL: compute probe_success from up; current expression can mask outages.
absent(up)*0 or vector(1)
returns 1 if the series merely exists, even whenup==0
(down), hiding outages. Record probe_success from the actualup
metric, scoped to the namespace, and coalesce missing to 0.- - expr: | - absent(up{job="llama-stack-k8s-operator"}) * 0 or vector(1) + - expr: | + avg without (pod, endpoint, instance) ( + up{job="llama-stack-k8s-operator", namespace="{{.Namespace}}"} + ) or on() vector(0) labels: instance: llama-stack-k8s-operator-controller-manager record: probe_successinternal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (3)
8-8
: Fix indentation under spec.groupsYAMLlint flags the list item under spec.groups as over-indented (6 spaces). It should be indented by 4 spaces to be a peer of subsequent groups.
Apply this diff:
- - name: SLOs-probe_success_feast_operator + - name: SLOs-probe_success_feast_operator
52-52
: Correct indentation for rule items under rulesThe list item under rules should be indented by 10 spaces (currently 8). This is a common source of YAML parsing errors.
Apply this diff:
- - expr: | + - expr: |
22-22
: Remove trailing whitespace to satisfy YAML lintersMultiple blank lines contain trailing spaces. Clean them up to prevent lint/pipeline failures.
Apply this diff:
- + - + - + - + - + - + - + - + - +Also applies to: 35-35, 57-57, 63-63, 69-69, 75-75, 81-81, 87-87, 93-93
internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (2)
20-21
: Use appropriate severity (warning/critical) for burn-rate alerts (not info).These are SLO violations; “info” is too low for actionable burn-rate alerts and will likely be ignored by routing/pagers. Align with other components (page on fast-burn, ticket on slow-burn).
Example mapping:
- 5m/1h: critical
- 30m/6h: critical or warning (choose per policy)
- 2h/1d: warning
- 6h/3d: warning
Do you want me to generate an exact diff with your preferred mapping?
Also applies to: 32-33, 44-45, 56-57
98-133
: Correct PromQL syntax in readiness alerts and address message vs timing mismatch.
- PromQL: Aggregator must include the
by
clause:min by (...) (expr)
.- The message says “more than 5 minutes” but the condition is effectively 3m window + 2m
for
. Align text or timings.Apply the syntax fix:
- expr: min(max_over_time(data_science_pipelines_application_ready[3m])) by (dspa_name, dspa_namespace) == 0 + expr: min by (dspa_name, dspa_namespace) (max_over_time(data_science_pipelines_application_ready[3m])) == 0Repeat the same change for APIServer, PersistenceAgent, and ScheduledWorkflows expressions in this group.
For the timing, either:
- Update “5 minutes” → “3 minutes”, or
- Use a 5m range and adjust
for
to match.Example to truly cover 5 minutes:
- expr: min by (dspa_name, dspa_namespace) (max_over_time(data_science_pipelines_application_ready[3m])) == 0 - for: 2m + expr: min by (dspa_name, dspa_namespace) (max_over_time(data_science_pipelines_application_ready[5m])) == 0 + for: 1m
🧹 Nitpick comments (12)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (2)
23-23
: PromQL: normalize regex and spacing in the 100% PVC usage alertKeep the same anchored regex in both operands and add spacing for readability.
- expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"}/kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.99 + expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.99
8-10
: YAML lint: indentation and trailing spacesYAMLlint flags indentation style and trailing spaces. Not functionally breaking, but it may fail CI/style checks. Recommend normalizing indent to 2 spaces under
groups:
and removing trailing spaces.Minimal trailing-space cleanup:
- +And:
- +Consider running your repo’s YAML formatter/linter to auto-fix the remaining indentation warnings.
Also applies to: 31-31, 50-50, 53-53, 105-105, 143-143, 171-171, 174-174
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (2)
69-71
: Fix mismatched comment header for recording rules.The comment says “RecordingRules for CodeFlare Operator” but the following group is “SLOs - MCAD Controller”. Update the comment to avoid confusion.
- # RecordingRules for CodeFlare Operator + # RecordingRules for MCAD Controller
47-59
: Optional: Aggregate multiple targets and use explicit down condition.If multiple CodeFlare Operator targets exist, the current rule may produce multiple alerts or miss partial-down scenarios. Consider aggregating and alerting when all targets are down.
Example (optional):
- Use sum(up{job=~"CodeFlare Operator"}) == 0
- Or min_over_time(sum by (job)(up{job=~"CodeFlare Operator"})[2m]) == 0 for stability
Not blocking; offered for robustness.
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)
15-21
: Prefer exact label match over regex and keep selectors minimal.Use
instance="..."
instead ofinstance=~"..."
since the value is fixed; regex adds overhead and potential surprises.- sum(probe_success:burnrate5m{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000)) + sum(probe_success:burnrate5m{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000)) @@ - sum(probe_success:burnrate1h{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000)) + sum(probe_success:burnrate1h{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000)) @@ - sum(probe_success:burnrate30m{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000)) + sum(probe_success:burnrate30m{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000)) @@ - sum(probe_success:burnrate6h{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000)) + sum(probe_success:burnrate6h{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000)) @@ - sum(probe_success:burnrate2h{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000)) + sum(probe_success:burnrate2h{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000)) @@ - sum(probe_success:burnrate1d{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000)) + sum(probe_success:burnrate1d{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000))Also applies to: 28-34, 41-47
8-8
: Unify group naming style for readability.Group names mix hyphenated and spaced styles:
SLOs-probe_success_llama_stack_k8s_operator
SLOs - Llama Stack K8s Operator
Pick one convention (suggest the spaced form) for consistency.
- - name: SLOs-probe_success_llama_stack_k8s_operator + - name: SLOs - Llama Stack K8s Operator (Alerts) @@ - - name: SLOs - Llama Stack K8s Operator + - name: SLOs - Llama Stack K8s Operator (RecordingRules)Also applies to: 50-50
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (2)
15-17
: Simplify PromQL by removing unnecessary sum/by and regex equalityThese selectors return a single series per instance; sum by (instance) and =~ can be dropped for clarity and minor eval savings.
Apply this diff:
- sum(probe_success:burnrate5m{instance=~"feast-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000)) + probe_success:burnrate5m{instance="feast-operator-controller-manager"} > (14.40 * (1-0.98000)) and - sum(probe_success:burnrate1h{instance=~"feast-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000)) + probe_success:burnrate1h{instance="feast-operator-controller-manager"} > (14.40 * (1-0.98000))- sum(probe_success:burnrate30m{instance=~"feast-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000)) + probe_success:burnrate30m{instance="feast-operator-controller-manager"} > (6.00 * (1-0.98000)) and - sum(probe_success:burnrate6h{instance=~"feast-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000)) + probe_success:burnrate6h{instance="feast-operator-controller-manager"} > (6.00 * (1-0.98000))- sum(probe_success:burnrate2h{instance=~"feast-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000)) + probe_success:burnrate2h{instance="feast-operator-controller-manager"} > (3.00 * (1-0.98000)) and - sum(probe_success:burnrate1d{instance=~"feast-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000)) + probe_success:burnrate1d{instance="feast-operator-controller-manager"} > (3.00 * (1-0.98000))Also applies to: 28-30, 41-43
59-62
: Drop redundant min() around avg_over_time()avg_over_time() already returns a single series for this selector; min() is unnecessary noise.
Apply this diff:
- 1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[1d])) + 1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[1d])- 1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[1h])) + 1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[1h])- 1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[2h])) + 1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[2h])- 1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[30m])) + 1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[30m])- 1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[3d])) + 1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[3d])- 1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[5m])) + 1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[5m])- 1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[6h])) + 1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[6h])Also applies to: 65-68, 71-74, 77-80, 83-86, 89-92, 95-98
internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (4)
98-129
: Make alert names consistent: use “Data Science Pipelines …” (plural) throughout.Alert names in this section mix singular/plural. Use “Pipelines” consistently.
- - alert: Data Science Pipeline Application Unavailable + - alert: Data Science Pipelines Application Unavailable ... - - alert: Data Science Pipeline APIServer Unavailable + - alert: Data Science Pipelines APIServer Unavailable ... - - alert: Data Science Pipeline PersistenceAgent Unavailable + - alert: Data Science Pipelines PersistenceAgent Unavailable ... - - alert: Data Science Pipeline ScheduledWorkflows Unavailable + - alert: Data Science Pipelines ScheduledWorkflows Unavailable
19-33
: Avoid hard-codingnamespace
label; template it to the target namespace.Hard-coding ‘redhat-ods-applications’ risks misrouting alerts when the operator is installed elsewhere. Use the templated namespace to keep routing consistent with deployment.
labels: severity: info - namespace: redhat-ods-applications + namespace: {{ .Namespace }} ... labels: severity: info - namespace: redhat-ods-applications + namespace: {{ .Namespace }}Repeat within this group for the other two rules (lines 44-45 and 56-57).
69-95
: Template thenamespace
label here as well.labels: severity: critical - namespace: redhat-ods-applications + namespace: {{ .Namespace }} ... labels: severity: critical - namespace: redhat-ods-applications + namespace: {{ .Namespace }} ... labels: severity: warning - namespace: redhat-ods-applications + namespace: {{ .Namespace }}
104-133
: Template thenamespace
label for readiness alerts.labels: severity: info - namespace: redhat-ods-applications + namespace: {{ .Namespace }} ... labels: severity: info - namespace: redhat-ods-applications + namespace: {{ .Namespace }} ... labels: severity: info - namespace: redhat-ods-applications + namespace: {{ .Namespace }} ... labels: severity: info - namespace: redhat-ods-applications + namespace: {{ .Namespace }}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (34)
api/services/v1alpha1/monitoring_types.go
(3 hunks)api/services/v1alpha1/zz_generated.deepcopy.go
(2 hunks)bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
(2 hunks)bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
(2 hunks)bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
(2 hunks)config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
(2 hunks)config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
(2 hunks)config/rbac/role.yaml
(2 hunks)docs/api-overview.md
(4 hunks)internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/components.go
(1 hunks)internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/dscinitialization/dscinitialization_controller.go
(1 hunks)internal/controller/dscinitialization/kubebuilder_rbac.go
(1 hunks)internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/services/monitoring/monitoring_controller.go
(2 hunks)internal/controller/services/monitoring/monitoring_controller_actions.go
(3 hunks)internal/controller/services/monitoring/monitoring_controller_support.go
(2 hunks)internal/controller/status/status.go
(2 hunks)pkg/cluster/gvk/gvk.go
(1 hunks)pkg/common/common.go
(2 hunks)tests/e2e/monitoring_test.go
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (29)
- internal/controller/dscinitialization/dscinitialization_controller.go
- internal/controller/services/monitoring/monitoring_controller_support.go
- internal/controller/services/monitoring/monitoring_controller.go
- internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
- pkg/common/common.go
- pkg/cluster/gvk/gvk.go
- internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
- internal/controller/components/components.go
- internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
- internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
- internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
- internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
- api/services/v1alpha1/zz_generated.deepcopy.go
- config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
- internal/controller/services/monitoring/monitoring_controller_actions.go
- bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
- internal/controller/status/status.go
- internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
- config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
- bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
- tests/e2e/monitoring_test.go
- bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
- config/rbac/role.yaml
- internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
- internal/controller/dscinitialization/kubebuilder_rbac.go
- api/services/v1alpha1/monitoring_types.go
- internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
- docs/api-overview.md
- internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
🧰 Additional context used
🪛 YAMLlint (1.37.1)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[error] 68-68: trailing spaces
(trailing-spaces)
[warning] 72-72: wrong indentation: expected 10 but found 8
(indentation)
[warning] 110-110: wrong indentation: expected 10 but found 8
(indentation)
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[error] 48-48: trailing spaces
(trailing-spaces)
[error] 49-49: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[warning] 31-31: wrong indentation: expected 10 but found 8
(indentation)
[error] 50-50: trailing spaces
(trailing-spaces)
[warning] 53-53: wrong indentation: expected 10 but found 8
(indentation)
[warning] 105-105: wrong indentation: expected 10 but found 8
(indentation)
[warning] 143-143: wrong indentation: expected 10 but found 8
(indentation)
[error] 171-171: trailing spaces
(trailing-spaces)
[warning] 174-174: wrong indentation: expected 10 but found 8
(indentation)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: golangci-lint
- GitHub Check: Run tests and collect coverage on internal and pkg
🔇 Additional comments (3)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)
102-139
: Recording rules set looks consistent and follows burn-rate conventionsWindows, record names, and labels align with the burn-rate alerts above. No issues spotted.
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)
51-51
: Fix PromQL quoting and matcher operators in CodeFlare Operator alertsFile:
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
Lines: 51, 60Replace single‐quoted label matchers with valid double‐quoted ones and standardize the operator to
=~
for both alerts:--- a/internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml +++ b/internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml @@ -51,7 +51,7 @@ - expr: absent(up{job=~'CodeFlare Operator'}) or up{job=~'CodeFlare Operator'} != 1 + expr: absent(up{job=~"CodeFlare Operator"}) or up{job=~"CodeFlare Operator"} != 1 @@ -60,7 +60,7 @@ - expr: absent_over_time(up{job="CodeFlare Operator"}[2m]) == 1 + expr: absent_over_time(up{job=~"CodeFlare Operator"}[2m]) == 1Please confirm what the
job
label actually is in your ServiceMonitor/scrape config (if you haven’t overriddenjobLabel
, it defaults to<namespace>/<servicemonitor-name>
). If the label value has no spaces (e.g.codeflare-operator
), switch to a regex like:expr: absent(up{job=~"codeflare-operator.*"}) or up{job=~"codeflare-operator.*"} != 1internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)
1-7
: Overall: solid SLO burn-rate pattern and templatingThe rule set follows the standard multi-window, multi-burn SLO approach, uses a clear instance scoping label, and integrates cleanly with the templated namespace. Once the YAML formatting nits are fixed, this looks good to ship.
Also applies to: 10-21, 23-47, 49-51, 58-99
...er/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
...er/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
...er/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
...roller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
...roller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
run make bundle fix_rules_file_name fix_error_in_dsp_monitoring fix e2e test run make lint
3fdf7b8
to
93a65af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
internal/controller/services/monitoring/monitoring_controller_actions.go (1)
89-104
: Guard against missing componentRules entries before updating Prometheus configIndexing componentRules[ch.GetName()] without checking presence will pass an empty prefix for unknown components, which can lead to incorrect config updates. Guard the lookup.
Apply this diff:
return cr.ForEach(func(ch cr.ComponentHandler) error { ci := ch.NewCRObject(dsc) + prefix, ok := componentRules[ch.GetName()] + if !ok { + // No prom rules mapping for this component; skip. + return nil + } if ch.IsEnabled(dsc) { ready, err := isComponentReady(ctx, rr.Client, ci) if err != nil { return fmt.Errorf("failed to get component status %w", err) } if !ready { // not ready, skip change on prom rules return nil } // add - return updatePrometheusConfig(ctx, true, componentRules[ch.GetName()]) + return updatePrometheusConfig(ctx, true, prefix) } else { - return updatePrometheusConfig(ctx, false, componentRules[ch.GetName()]) + return updatePrometheusConfig(ctx, false, prefix) } })
♻️ Duplicate comments (13)
tests/e2e/monitoring_test.go (1)
666-681
: Use plural helper and assert operator PrometheusRule remainsFor consistency with other helpers and to validate operator-level rules persist when a component is removed, apply:
-// Ensure the dashboard-prometheusrules is deleted -tc.EnsureResourceGone( - WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}), -) +// Ensure the dashboard-specific PrometheusRule is deleted +tc.EnsureResourcesGone( + WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}), +) + +// Operator-level rules should remain when alerting is still enabled +tc.EnsureResourceExists( + WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "operator-prometheusrules", Namespace: tc.AppsNamespace}), +)internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)
8-9
: Fix YAML indentation under groupsYAMLlint flags this list item as over-indented. Align list item to two spaces under groups:.
- - name: SLOs-probe_success_feast_operator + - name: SLOs-probe_success_feast_operatorinternal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (1)
10-42
: Differentiate alert names for clarity- - alert: Kserve Controller Probe Success Burn Rate + - alert: Kserve Controller Probe Success Burn Rate (5m/1h) ... - - alert: Kserve Controller Probe Success Burn Rate + - alert: Kserve Controller Probe Success Burn Rate (30m/6h) ... - - alert: Kserve Controller Probe Success Burn Rate + - alert: Kserve Controller Probe Success Burn Rate (2h/1d)internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)
50-61
: PromQL: use double quotes and consistent selectorPromQL requires double quotes for label values. Also standardize job label matching across alerts.
Apply this diff:
- - alert: CodeFlare Operator is not running - expr: absent(up{job=~'CodeFlare Operator'}) or up{job=~'CodeFlare Operator'} != 1 + - alert: CodeFlare Operator is not running + expr: absent(up{job=~"CodeFlare Operator"}) or up{job=~"CodeFlare Operator"} != 1 @@ - - alert: CodeFlare Operator taking too long to be up - expr: absent_over_time(up{job="CodeFlare Operator"}[2m]) == 1 + - alert: CodeFlare Operator taking too long to be up + expr: absent_over_time(up{job=~"CodeFlare Operator"}[2m]) == 1internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)
22-22
: Remove trailing spaces to satisfy lintersMultiple lines have trailing spaces; please strip them file-wide.
Also applies to: 35-35, 48-49, 57-57, 63-63, 69-69, 75-75, 81-81, 87-87, 93-93
52-56
: PromQL: compute probe_success from up metric; current expression always returns 1The existing rule uses series existence, which masks outages. Compute probe_success from actual up values and coalesce missing to 0.
Apply this diff:
- - expr: | - absent(up{job="llama-stack-k8s-operator"}) * 0 or vector(1) - labels: - instance: llama-stack-k8s-operator-controller-manager - record: probe_success + - expr: | + avg without (pod, endpoint, instance) ( + up{job="llama-stack-k8s-operator", namespace="{{.Namespace}}"} + ) or on() vector(0) + labels: + instance: llama-stack-k8s-operator-controller-manager + record: probe_successinternal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (4)
14-14
: Unify PVC regexes to avoid mismatched seriesThe used and capacity selectors should use identical regexes.
Apply this diff:
- expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99 + expr: | + kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} + / + kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 + and + kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} + / + kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99
53-100
: Make alert names unique within the group for clarity and triageRepeated alert names hinder triage and can confuse downstream routing. Add window suffixes.
Apply this diff:
- - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate (5m/1h) @@ - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate (30m/6h) @@ - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate (2h/1d) @@ - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate (6h/3d)
35-35
: PromQL strings must be double-quotedSingle quotes are invalid in PromQL label values.
Apply this diff:
- expr: absent(up{job=~'Kubeflow Notebook Controller Service Metrics'}) + expr: absent(up{job=~"Kubeflow Notebook Controller Service Metrics"}) @@ - expr: absent(up{job=~'ODH Notebook Controller Service Metrics'}) + expr: absent(up{job=~"ODH Notebook Controller Service Metrics"})Also applies to: 44-44
174-177
: PromQL: label_replace on a scalar; fix by preserving vector contextlabel_replace(min(...)) collapses to a scalar; apply min by () to keep a single-element vector, then set “name” for the combined series.
Apply this diff:
- - expr: ((min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name) or on() vector(0)) or label_replace(min(probe_success{name=~"rhods-dashboard|notebook-spawner"}), "name", "combined", "name", ".*")) + - expr: | + min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name) + or + label_replace( + min by () (probe_success{name=~"rhods-dashboard|notebook-spawner"}), + "name", "combined", "", "" + )internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (3)
21-21
: Template namespace label instead of hardcodingReplace hardcoded label with template namespace to match metadata.namespace.
Apply this diff:
- namespace: redhat-ods-applications + namespace: {{.Namespace}}Also applies to: 33-33, 45-45, 57-57, 71-71, 83-83, 95-95, 107-107
10-57
: Make route burn-rate alert names unique per windowDisambiguate alerts by embedding their time windows; improves triage and dashboards.
Apply this diff:
- - alert: RHODS Dashboard Route Error Burn Rate + - alert: RHODS Dashboard Route Error Burn Rate (5m/1h) @@ - - alert: RHODS Dashboard Route Error Burn Rate + - alert: RHODS Dashboard Route Error Burn Rate (30m/6h) @@ - - alert: RHODS Dashboard Route Error Burn Rate + - alert: RHODS Dashboard Route Error Burn Rate (2h/1d) @@ - - alert: RHODS Dashboard Route Error Burn Rate + - alert: RHODS Dashboard Route Error Burn Rate (6h/3d)
162-196
: Recording rules risk “duplicate series for the same label set”; align selectors and aggregateEach expression returns multiple series (e.g., per instance) but then force-sets labels.instance: rhods-dashboard, causing collisions at record time. Also, selectors use instance=~ while corresponding alerts use name=~, creating inconsistency.
Apply this diff to aggregate by name and drop the explicit labels override:
- - expr: | - 1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[1d]) - labels: - instance: rhods-dashboard - record: probe_success:burnrate1d + - expr: | + sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[1d])) + record: probe_success:burnrate1d - - expr: | - 1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[1h]) - labels: - instance: rhods-dashboard - record: probe_success:burnrate1h + - expr: | + sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[1h])) + record: probe_success:burnrate1h - - expr: | - 1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[2h]) - labels: - instance: rhods-dashboard - record: probe_success:burnrate2h + - expr: | + sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[2h])) + record: probe_success:burnrate2h - - expr: | - 1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[30m]) - labels: - instance: rhods-dashboard - record: probe_success:burnrate30m + - expr: | + sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[30m])) + record: probe_success:burnrate30m - - expr: | - 1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[3d]) - labels: - instance: rhods-dashboard - record: probe_success:burnrate3d + - expr: | + sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[3d])) + record: probe_success:burnrate3d - - expr: | - 1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[5m]) - labels: - instance: rhods-dashboard - record: probe_success:burnrate5m + - expr: | + sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[5m])) + record: probe_success:burnrate5m - - expr: | - 1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[6h]) - labels: - instance: rhods-dashboard - record: probe_success:burnrate6h + - expr: | + sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[6h])) + record: probe_success:burnrate6hThis aligns recordings with alert selectors (which use name) and prevents duplicate-series errors.
🧹 Nitpick comments (13)
internal/controller/status/status.go (1)
108-109
: New component condition type looks consistent; consider documenting intentConditionAlertingAvailable aligns with existing condition naming (e.g., MonitoringAvailable). For parity and discoverability, add a short comment above it describing when controllers should toggle it (e.g., set True when alerting PrometheusRules applied and CRD present).
- ConditionAlertingAvailable = "AlertingAvailable" + // ConditionAlertingAvailable indicates whether alerting (PrometheusRule deployment) is available. + ConditionAlertingAvailable = "AlertingAvailable"config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (1)
247-250
: Tighten CEL: disallow null metrics when alerting is setThe current rule passes when metrics is present but explicitly null. To clearly require a non-null metrics stanza when alerting is configured, tighten the CEL check.
- - message: Alerting configuration requires metrics to be configured - rule: 'has(self.alerting) ? has(self.metrics) : true' + - message: Alerting configuration requires metrics to be configured + rule: 'has(self.alerting) ? (has(self.metrics) && self.metrics != null) : true'Note: This still allows an empty object ({}), which you appear to rely on elsewhere, but prevents alerting with metrics: null or omitted.
tests/e2e/monitoring_test.go (1)
620-664
: Make PrometheusRule creation test resilient to preconditionsLooks good. To make it hermetic, consider verifying/setting .spec.monitoring.metrics beforehand (CRD enforces metrics when alerting is set). In this suite, previous tests do set metrics, but an explicit precondition here would reduce coupling.
Example (optional) to prepend before enabling alerting:
+ // Ensure metrics stanza exists (required by CRD when alerting is set) + tc.EventuallyResourceCreatedOrUpdated( + WithMinimalObject(gvk.DSCInitialization, tc.DSCInitializationNamespacedName), + WithMutateFunc(testf.TransformPipeline( + testf.Transform(`.spec.monitoring.managementState = "%s"`, operatorv1.Managed), + setMonitoringMetrics(), + )), + )internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (2)
22-35
: Remove trailing whitespace to satisfy linters/pipelinesMultiple lines end with trailing spaces. Clean them up to avoid YAML lint errors.
Example (apply to listed lines):
- +Also applies to: 57-58, 63-64, 69-70, 75-76, 81-82, 87-88, 93-94
10-47
: Differentiate alert names to avoid deduplication collisionsAll three alerts share the same name. Including the time windows improves triage and avoids alert dedup confusion.
- - alert: Feast Operator Probe Success Burn Rate + - alert: Feast Operator Probe Success Burn Rate (5m/1h) ... - - alert: Feast Operator Probe Success Burn Rate + - alert: Feast Operator Probe Success Burn Rate (30m/6h) ... - - alert: Feast Operator Probe Success Burn Rate + - alert: Feast Operator Probe Success Burn Rate (2h/1d)internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (1)
8-9
: Fix YAML indentation under groupsAlign list items to two spaces beneath groups: for consistency and linter compliance.
- - name: SLOs-probe_success_kserve + - name: SLOs-probe_success_kserveinternal/controller/services/monitoring/monitoring_controller_actions.go (1)
400-421
: Don’t fail reconcile on per-component rule add/cleanup errorsYou log individual component errors with the intent to not fail reconciliation, but then return a generic error which fails the reconcile anyway. This causes noisy requeues and can mask partial success.
Apply this diff to keep reconciliation non-fatal while preserving logs:
-// If we fail to add prometheus rules for a component. +// If we fail to add prometheus rules for a component. if len(addErrors) > 0 { // Log errors but don't fail the reconciliation for _, addErr := range addErrors { logf.FromContext(ctx).Error(addErr, "Failed to add prometheus rules for component") } } // If we fail to clean up prometheus rules for a component. if len(cleanupErrors) > 0 { // Log errors but don't fail the reconciliation for _, cleanupErr := range cleanupErrors { logf.FromContext(ctx).Error(cleanupErr, "Failed to cleanup prometheus rules for component") } } -if len(addErrors) > 0 || len(cleanupErrors) > 0 { - return errors.New("errors occurred while adding or cleaning up prometheus rules for components") -} - return nilIf you prefer to surface degraded state, consider marking ConditionAlertingAvailable false (with message) when any component rule op fails, but still avoid throwing hard errors here.
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (3)
8-10
: Fix YAML indentation under spec.groupsThe list items under groups are over-indented; this trips linters/YAML parsers.
Apply this diff:
spec: groups: - - name: SLOs-probe_success_codeflare - rules: + - name: SLOs-probe_success_codeflare + rules: @@ - + @@ - - name: SLOs - MCAD Controller + - name: SLOs - MCAD Controller @@ - - name: SLOs - CodeFlare Operator + - name: SLOs - CodeFlare OperatorAlso applies to: 72-72, 110-110
20-21
: Template namespace label instead of hardcodingHardcoding namespace breaks non-default installs. Align label with metadata.namespace.
Apply this diff:
labels: severity: info - namespace: redhat-ods-applications + namespace: {{.Namespace}} @@ labels: severity: info - namespace: redhat-ods-applications + namespace: {{.Namespace}} @@ labels: severity: info - namespace: redhat-ods-applications + namespace: {{.Namespace}} @@ labels: severity: info - namespace: redhat-ods-applications + namespace: {{.Namespace}} @@ labels: severity: info - namespace: redhat-ods-applications + namespace: {{.Namespace}}Also applies to: 32-33, 44-45, 53-54, 62-63
68-68
: Strip trailing whitespaceTrailing spaces can fail linters and clutter diffs. Please remove trailing spaces on this line and run a quick pass to strip EOL whitespace across the file.
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (1)
8-10
: Fix indentation under groups and rulesOver-indentation under groups and mis-indented rule item will cause YAMLlint warnings and can confuse reviewers.
Apply this diff:
spec: groups: - - name: SLOs-probe_success_llama_stack_k8s_operator - rules: + - name: SLOs-probe_success_llama_stack_k8s_operator + rules: @@ - rules: - - expr: | + rules: + - expr: |Also applies to: 52-52
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)
39-40
: Template namespace label for portabilityUse the template namespace in labels.
Apply this diff:
labels: severity: warning - namespace: redhat-ods-applications + namespace: {{.Namespace}} @@ labels: severity: warning - namespace: redhat-ods-applications + namespace: {{.Namespace}}Also applies to: 48-49
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1)
60-107
: Make probe burn-rate alert names unique per windowSame rationale as above; add window suffixes.
Apply this diff:
- - alert: RHODS Dashboard Probe Success Burn Rate + - alert: RHODS Dashboard Probe Success Burn Rate (5m/1h) @@ - - alert: RHODS Dashboard Probe Success Burn Rate + - alert: RHODS Dashboard Probe Success Burn Rate (30m/6h) @@ - - alert: RHODS Dashboard Probe Success Burn Rate + - alert: RHODS Dashboard Probe Success Burn Rate (2h/1d) @@ - - alert: RHODS Dashboard Probe Success Burn Rate + - alert: RHODS Dashboard Probe Success Burn Rate (6h/3d)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (34)
api/services/v1alpha1/monitoring_types.go
(3 hunks)api/services/v1alpha1/zz_generated.deepcopy.go
(2 hunks)bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
(2 hunks)bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
(2 hunks)bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
(2 hunks)config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
(2 hunks)config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
(2 hunks)config/rbac/role.yaml
(2 hunks)docs/api-overview.md
(4 hunks)internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/components.go
(1 hunks)internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/dscinitialization/dscinitialization_controller.go
(1 hunks)internal/controller/dscinitialization/kubebuilder_rbac.go
(1 hunks)internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/services/monitoring/monitoring_controller.go
(2 hunks)internal/controller/services/monitoring/monitoring_controller_actions.go
(3 hunks)internal/controller/services/monitoring/monitoring_controller_support.go
(2 hunks)internal/controller/status/status.go
(2 hunks)pkg/cluster/gvk/gvk.go
(1 hunks)pkg/common/common.go
(2 hunks)tests/e2e/monitoring_test.go
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (24)
- internal/controller/dscinitialization/dscinitialization_controller.go
- internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
- internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
- internal/controller/dscinitialization/kubebuilder_rbac.go
- internal/controller/services/monitoring/monitoring_controller.go
- internal/controller/components/components.go
- internal/controller/services/monitoring/monitoring_controller_support.go
- bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
- api/services/v1alpha1/zz_generated.deepcopy.go
- pkg/common/common.go
- internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
- internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
- bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
- pkg/cluster/gvk/gvk.go
- docs/api-overview.md
- config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
- internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
- config/rbac/role.yaml
- bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
- internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
- internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
- api/services/v1alpha1/monitoring_types.go
- internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
- internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
🧰 Additional context used
🧬 Code Graph Analysis (2)
internal/controller/services/monitoring/monitoring_controller_actions.go (6)
pkg/controller/types/types.go (2)
ReconciliationRequest
(75-112)TemplateInfo
(67-73)api/services/v1alpha1/monitoring_types.go (2)
Monitoring
(133-139)Alerting
(121-122)pkg/cluster/gvk/gvk.go (2)
Monitoring
(288-292)PrometheusRule
(480-484)pkg/controller/conditions/conditions.go (2)
WithReason
(19-23)WithMessage
(25-34)pkg/cluster/resources.go (2)
HasCRD
(310-312)GetDSC
(94-114)internal/controller/components/registry/registry.go (2)
ForEach
(71-73)ComponentHandler
(17-29)
tests/e2e/monitoring_test.go (5)
tests/e2e/resource_options_test.go (3)
WithMutateFunc
(194-198)WithMinimalObject
(140-155)WithCondition
(203-207)pkg/utils/test/testf/testf_support.go (2)
Transform
(78-106)TransformPipeline
(50-61)pkg/cluster/gvk/gvk.go (5)
DSCInitialization
(66-70)DataScienceCluster
(60-64)Namespace
(24-28)Dashboard
(210-214)PrometheusRule
(480-484)pkg/utils/test/matchers/jq/jq_matcher.go (1)
Match
(11-15)internal/controller/status/status.go (2)
ConditionTypeReady
(90-90)ConditionTypeProvisioningSucceeded
(95-95)
🪛 YAMLlint (1.37.1)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[warning] 31-31: wrong indentation: expected 10 but found 8
(indentation)
[error] 50-50: trailing spaces
(trailing-spaces)
[warning] 53-53: wrong indentation: expected 10 but found 8
(indentation)
[warning] 105-105: wrong indentation: expected 10 but found 8
(indentation)
[warning] 143-143: wrong indentation: expected 10 but found 8
(indentation)
[error] 171-171: trailing spaces
(trailing-spaces)
[warning] 174-174: wrong indentation: expected 10 but found 8
(indentation)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[error] 68-68: trailing spaces
(trailing-spaces)
[warning] 72-72: wrong indentation: expected 10 but found 8
(indentation)
[warning] 110-110: wrong indentation: expected 10 but found 8
(indentation)
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[error] 48-48: trailing spaces
(trailing-spaces)
[error] 49-49: trailing spaces
(trailing-spaces)
[warning] 52-52: wrong indentation: expected 10 but found 8
(indentation)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 63-63: trailing spaces
(trailing-spaces)
[error] 69-69: trailing spaces
(trailing-spaces)
[error] 75-75: trailing spaces
(trailing-spaces)
[error] 81-81: trailing spaces
(trailing-spaces)
[error] 87-87: trailing spaces
(trailing-spaces)
[error] 93-93: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: golangci-lint
- GitHub Check: Run tests and collect coverage on tests/integration
🔇 Additional comments (6)
internal/controller/status/status.go (1)
189-195
: LGTM: Alerting not-configured reason/messages addedAlertingNotConfiguredReason/Message mirror the metrics/traces pattern and will help surface user feedback when alerting is disabled/missing in DSCI.
config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (1)
93-96
: Alerting field added under spec.monitoring: OKAdding an optional object for alerting keeps it future-proof for additional knobs. No further schema constraints are necessary right now.
tests/e2e/monitoring_test.go (2)
51-53
: Good coverage: creation/deletion tests for PrometheusRulesIncluding both creation and deletion flows early in the suite is helpful to validate the end-to-end alerting wiring.
208-209
: Using whole-monitoring mutate is fine; assert behavior elsewhere remains stableSwitching to mutate the entire .spec.monitoring map is OK. Be mindful this also removes any alerting/traces stanzas previously set. Your subsequent tests reconfigure traces later; that’s fine for serial execution but makes tests order-dependent.
Do you want this test to be order-independent? If yes, I can propose localized mutations (only .spec.monitoring.metrics) and add preconditions in the PrometheusRule tests to self-contain setup.
internal/controller/services/monitoring/monitoring_controller_actions.go (2)
54-55
: LGTM: embedding operator rule templatesStacking two //go:embed directives to include both resources/ and monitoring/ into a single embed.FS is valid and keeps packaging simple.
341-349
: LGTM: gating alerting on PrometheusRule CRD and adding operator rulesProperly checks CRD availability and emits operator-level rules regardless of component readiness. Good guardrails for partial installations.
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/services/monitoring/monitoring_controller_actions.go
Show resolved
Hide resolved
...er/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (5)
internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1)
10-45
: Make alert names unique per burn-rate window to avoid dedupe/collisionsAll three alerts share the same name. This hampers triage and can lead to deduplication issues in Alertmanager. Suffix with the evaluated windows.
- - alert: ODH Model Controller Probe Success Burn Rate + - alert: ODH Model Controller Probe Success Burn Rate 5m_1h ... - - alert: ODH Model Controller Probe Success Burn Rate + - alert: ODH Model Controller Probe Success Burn Rate 30m_6h ... - - alert: ODH Model Controller Probe Success Burn Rate + - alert: ODH Model Controller Probe Success Burn Rate 2h_1dinternal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (4)
14-14
: Regex inconsistency between numerator/denominator; align to avoid mismatched seriesThe used selector allows any prefix, capacity requires prefix “jupyterhub-nb-”. Use identical patterns.
- expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99 + expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 + and + kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99
53-101
: Duplicate alert names in one group — make them uniqueAll four alerts share the same name; this will cause dedupe issues and can break loading in some setups. Suffix names with the window pairs.
- - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 5m_1h ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 30m_6h ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 2h_1d ... - - alert: RHODS Jupyter Probe Success Burn Rate + - alert: RHODS Jupyter Probe Success Burn Rate 6h_3d
174-177
: PromQL type error: label_replace applied to a scalar; use min by() to keep vectorlabel_replace() expects a vector. min(...) without by() collapses to a scalar and will fail evaluation. Fix the expression.
- - expr: ((min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name) or on() vector(0)) or label_replace(min(probe_success{name=~"rhods-dashboard|notebook-spawner"}), "name", "combined", "name", ".*")) + - expr: | + min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name) + or + label_replace( + min by () (probe_success{name=~"rhods-dashboard|notebook-spawner"}), + "name", "combined", "", "" + )
35-45
: PromQL strings must be double-quotedSingle-quoted label values are invalid. Fix both absent() expressions.
- expr: absent(up{job=~'Kubeflow Notebook Controller Service Metrics'}) + expr: absent(up{job=~"Kubeflow Notebook Controller Service Metrics"}) ... - expr: absent(up{job=~'ODH Notebook Controller Service Metrics'}) + expr: absent(up{job=~"ODH Notebook Controller Service Metrics"})
🧹 Nitpick comments (3)
internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (1)
8-9
: Group name copy/paste: use a Model Registry-specific group nameGroup is named “model_controller” in the Model Registry template. Rename for clarity.
- - name: SLOs-probe_success_model_controller + - name: SLOs-probe_success_model_registryinternal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)
23-23
: Consistent PVC selector regexMirror the same prefix regex as above for consistency.
- expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"}/kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.99 + expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.99internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)
8-8
: Fix YAML indentation/trailing space as per linterIndentation is off under groups/rules; trailing space at Line 68. Align to standard 2-space/consistent nesting to avoid parse issues.
groups: - - name: SLOs-probe_success_codeflare - rules: - - alert: CodeFlare Operator Probe Success Burn Rate + - name: SLOs-probe_success_codeflare + rules: + - alert: CodeFlare Operator Probe Success Burn Rate ... - +Also applies to: 10-10, 72-72, 110-110, 68-68
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
(1 hunks)internal/controller/services/monitoring/monitoring_controller_support.go
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- internal/controller/services/monitoring/monitoring_controller_support.go
- internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
- internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
- internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
🧰 Additional context used
🪛 YAMLlint (1.37.1)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[error] 68-68: trailing spaces
(trailing-spaces)
[warning] 72-72: wrong indentation: expected 10 but found 8
(indentation)
[warning] 110-110: wrong indentation: expected 10 but found 8
(indentation)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
[warning] 8-8: wrong indentation: expected 4 but found 6
(indentation)
[warning] 10-10: wrong indentation: expected 10 but found 8
(indentation)
[warning] 31-31: wrong indentation: expected 10 but found 8
(indentation)
[error] 50-50: trailing spaces
(trailing-spaces)
[warning] 53-53: wrong indentation: expected 10 but found 8
(indentation)
[warning] 105-105: wrong indentation: expected 10 but found 8
(indentation)
[warning] 143-143: wrong indentation: expected 10 but found 8
(indentation)
[error] 171-171: trailing spaces
(trailing-spaces)
[warning] 174-174: wrong indentation: expected 10 but found 8
(indentation)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on tests/integration
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: golangci-lint
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
...l/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approved /lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CFSNM, zdtsw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
18aa86d
into
opendatahub-io:main
/cherrypick rhoai |
@StevenTobin: #2261 failed to apply on top of branch "rhoai":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
* add built in alerting configuration for components run make bundle fix_rules_file_name fix_error_in_dsp_monitoring fix e2e test run make lint * reference application namespace in alerts that include namespace
Description
RHOAIENG-25366
How Has This Been Tested?
Screenshot or short clip
Merge criteria
Summary by CodeRabbit
New Features
Documentation
Tests
Chores