Skip to content

Conversation

StevenTobin
Copy link
Contributor

@StevenTobin StevenTobin commented Aug 6, 2025

Description

RHOAIENG-25366

  • Add alerting struct in DSCI to enable alerting.
  • Add set of alerts for each component.
  • Ensure alerts are only added for enabled and ready components.

How Has This Been Tested?

  • Install Operator
  • Create DSCI
  • Enable metrics and alerting
  • Verify the operator prometheusrules are deployed
  • Set a component to managed
  • Verify the prometheusrules for that component are deployed
  • Set the component to removed
  • Verify the prometheusrules for that component are cleaned up

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Optional Alerting configuration added to Monitoring CRs with validation requiring Metrics when Alerting is set; operator surfaces Alerting availability status.
    • Operator now deploys and manages Prometheus alerting rules per enabled component plus operator-level rules; many component rule sets and templates added.
  • Documentation

    • API docs updated to document the new Alerting schema and fields.
  • Tests

    • E2E tests for PrometheusRule creation and deletion added.
  • Chores

    • RBAC/CSV permissions expanded to include PrometheusRule.

Copy link

coderabbitai bot commented Aug 6, 2025

Walkthrough

Adds an Alerting API type and monitoring.alerting field with cross-field validation; extends CRDs, RBAC, CSV for PrometheusRule; embeds per-component PrometheusRule templates; implements deployAlerting plus add/cleanup helpers in the monitoring reconciler; propagates alerting from DSCInitialization; adds e2e tests and deepcopy/utility updates.

Changes

Cohort / File(s) Summary
API types
api/services/v1alpha1/monitoring_types.go, api/services/v1alpha1/zz_generated.deepcopy.go
Add Alerting type and Alerting *Alerting on MonitoringCommonSpec with XValidation; generate deepcopy methods and include Alerting in deep copy.
CRD schemas (config & bundle)
config/crd/bases/...dscinitialization*.yaml, config/crd/bases/...monitorings.yaml, bundle/manifests/...dscinitialization*.yaml, bundle/manifests/...monitorings.yaml
Add monitoring.alerting object to CRD schemas and an x-kubernetes-validations rule: has(self.alerting) ? has(self.metrics) : true with message requiring metrics when alerting is set.
RBAC / kubebuilder annotations / CSV
config/rbac/role.yaml, internal/controller/dscinitialization/kubebuilder_rbac.go, bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
Grant/promote prometheusrules resource (resources, finalizers, status) in ClusterRole, add kubebuilder RBAC for prometheusrules, and include prometheusrules in CSV permissions.
Monitoring reconciler (logic + support)
internal/controller/services/monitoring/monitoring_controller.go, .../monitoring_controller_actions.go, .../monitoring_controller_support.go, internal/controller/status/status.go
Add deployAlerting action, condition ConditionAlertingAvailable and constants, check PrometheusRule CRD, embed monitoring templates, add addPrometheusRules and cleanupPrometheusRules, aggregate/log per-component errors.
DSCI propagation
internal/controller/dscinitialization/dscinitialization_controller.go
Propagate dsci.Spec.Monitoring.Alerting into created Monitoring CRs.
Embedded component rules FS
internal/controller/components/components.go, pkg/common/common.go
Add embed usage and ComponentRulesFS variable; add FileExists(embed.FS, string) bool helper.
Component & operator PrometheusRule templates
internal/controller/components/*/monitoring/*-prometheusrules.tmpl.yaml, internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
Add many per-component and operator-level PrometheusRule templates (alert and recording rules for SLOs, availability, operator presence, burn-rate recordings).
Controller RBAC/ownership & GVK
pkg/cluster/gvk/gvk.go, internal/controller/dscinitialization/kubebuilder_rbac.go
Add exported PrometheusRule GVK and kubebuilder RBAC annotations for prometheusrules.
E2E tests
tests/e2e/monitoring_test.go
Add tests validating PrometheusRule creation and deletion when alerting is enabled and components managed/removed.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant DSCI as DSCInitialization
  participant Mon as Monitoring CR
  participant Recon as Monitoring Reconciler
  participant API as K8s API
  participant PR as PrometheusRule CRD

  User->>DSCI: set spec.monitoring.alerting = {}
  DSCI->>Mon: create Monitoring CR (alerting propagated)
  Recon->>API: Reconcile(Monitoring)
  Recon->>PR: Check PrometheusRule CRD exists
  alt alerting not configured or CRD missing
    Recon->>API: Set ConditionAlertingAvailable=False (NotConfigured/CRDNotFound)
  else CRD present and alerting enabled
    Recon->>API: Set ConditionAlertingAvailable=True
    Recon->>API: Apply operator-prometheusrules
    loop for each component
      alt component enabled and ready
        Recon->>API: Apply <component>-prometheusrules (addPrometheusRules)
      else component disabled
        Recon->>API: Delete <component>-prometheusrules (cleanupPrometheusRules)
      end
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I thump my paw—alerts awake!
Templates nestled in embedded hay,
Metrics must come when sirens play.
I hop through rules and tidy each name,
Prometheus hums — the rabbit claims the blame. 🐇📢

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@StevenTobin StevenTobin requested a review from kahowell August 6, 2025 14:39
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🔭 Outside diff range comments (3)
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (1)

1-99: Fix YAML formatting issues.

The PrometheusRule logic is sound with appropriate SLO-based alerting and recording rules. However, there are several YAML formatting issues that need to be addressed:

Apply this diff to fix the formatting issues:

   groups:
-      - name: SLOs-probe_success_llama_stack_k8s_operator
+    - name: SLOs-probe_success_llama_stack_k8s_operator
       rules:
         - alert: Llama Stack K8s Operator Probe Success Burn Rate
           # ... alert content ...
           labels:
             severity: warning
             instance: llama-stack-k8s-operator-controller-manager
-    
+
         - alert: Llama Stack K8s Operator Probe Success Burn Rate
           # ... alert content ...
           labels:
             severity: warning
             instance: llama-stack-k8s-operator-controller-manager
-    
+
         - alert: Llama Stack K8s Operator Probe Success Burn Rate
           # ... alert content ...
           labels:
             severity: warning
             instance: llama-stack-k8s-operator-controller-manager
-      
+
       # RecordingRules for Llama Stack
-      - name: SLOs - Llama Stack K8s Operator
+    - name: SLOs - Llama Stack K8s Operator
       rules:
-        - expr: |
+      - expr: |
           absent(up{job="llama-stack-k8s-operator"}) * 0 or vector(1)
         labels:
           instance: llama-stack-k8s-operator-controller-manager
         record: probe_success
-    
-        - expr: |
+
+      - expr: |
           1 - min(avg_over_time(probe_success{instance="llama-stack-k8s-operator-controller-manager"}[1d]))
         labels:
           instance: llama-stack-k8s-operator-controller-manager
         record: probe_success:burnrate1d
-    
+
       # Apply similar fixes to remove trailing spaces from other recording rules
internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml (1)

1-90: Fix YAML formatting and indentation issues.

The PrometheusRule content is well-structured with appropriate alerting thresholds and recording rules. However, there are formatting issues that need to be addressed:

Please fix the following YAML formatting issues:

  1. Line 8: Change indentation from 6 to 4 spaces for consistency
  2. Line 52: Change indentation from 8 to 10 spaces for the recording rules
  3. Remove trailing spaces from lines: 22, 35, 57, 63, 69, 75, 81, 87, 93

The alerting logic using different severity levels (critical for shorter time windows, warning for longer ones) is appropriate for SLO monitoring.

internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)

1-99: Fix YAML formatting issues.

The PrometheusRule configuration is correct with appropriate SLO-based alerting. However, there are YAML formatting issues that need to be addressed:

Please fix the following formatting issues:

  1. Line 8: Change indentation from 6 to 4 spaces
  2. Line 52: Change indentation from 8 to 10 spaces
  3. Remove trailing spaces from lines: 22, 35, 57, 63, 69, 75, 81, 87, 93

The alerting configuration with critical severity for short-term burn rates and warning for long-term ones follows best practices for SLO alerting.

🧹 Nitpick comments (7)
internal/controller/components/components.go (1)

7-20: Optional: collapse repetitive //go:embed directives for readability

Go 1.17+ allows a single //go:embed line to accept multiple space-separated patterns or a brace list. Collapsing 14 directives into one reduces visual noise without changing behaviour.

-//go:embed kueue/monitoring
-//go:embed trainingoperator/monitoring
-//go:embed trustyai/monitoring
-//go:embed workbenches/monitoring
-//go:embed codeflare/monitoring
-//go:embed dashboard/monitoring
-//go:embed datasciencepipelines/monitoring
-//go:embed feastoperator/monitoring
-//go:embed kserve/monitoring
-//go:embed llamastackoperator/monitoring
-//go:embed modelcontroller/monitoring
-//go:embed modelmeshserving/monitoring
-//go:embed modelregistry/monitoring
-//go:embed ray/monitoring
+//go:embed {kueue,trainingoperator,trustyai,workbenches,codeflare,dashboard,datasciencepipelines,feastoperator,kserve,llamastackoperator,modelcontroller,modelmeshserving,modelregistry,ray}/monitoring
docs/api-overview.md (1)

2863-2868: Document the “metrics-required” validation for alerting

The CRD sets a CEL rule that alerting can only be enabled when metrics is configured, yet this important behavioural contract is not mentioned here.
Add a short note such as:

Alerting requires .metrics to be non-null; the operator will reject the CR if metrics are disabled.

That prevents user confusion and failed reconciliations.

internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1)

58-107: Maintain consistency in label usage across components.

This file groups by name label while kserve groups by instance. Consider standardizing the label usage across all component PrometheusRules for consistency.

Also applies the same alert naming and namespace flexibility issues mentioned above.

internal/controller/services/monitoring/monitoring_controller_actions.go (1)

415-417: Enhance the final error message with more context.

The generic error message doesn't indicate the scope of the issues. Consider including counts or component names.

 	if len(addErrors) > 0 || len(cleanupErrors) > 0 {
-		return fmt.Errorf("errors occurred while adding or cleaning up prometheus rules for components")
+		return fmt.Errorf("prometheus rules management failed: %d add errors, %d cleanup errors", len(addErrors), len(cleanupErrors))
 	}
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (3)

8-178: YAML-lint flags multiple indentation / trailing-space issues

Several lines (e.g. Lines 8, 10, 31, 53, 105, 143, 174) have 6- or 8-space indentation where 4 is expected, and Lines 50 & 171 contain trailing spaces.
While Kubernetes is lenient, maintaining a consistent two-space increment prevents future merge conflicts and avoids tooling problems (helm-lint, kube-lint, etc.).

No functional change, but consider running yamllint -d relaxed or prettier --write on the template before committing.


19-27: severity level for 100 % PVC usage is only warning

At 100 % utilisation the PVC becomes read-only and notebooks will fail. A critical severity (or at least error) is normally expected so on-call engineers are paged.

-            severity: warning
+            severity: critical

(Adjust to your org’s routing policy.)


172-178: rhods_aggregate_availability expression is hard to read

The nested or/label_replace construction is valid but quite opaque. Consider extracting intermediate recording rules (e.g. rhods_dashboard_avail, notebook_spawner_avail) and then taking the min() of those. This keeps alert rules readable and easier to maintain.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7872157 and 23cb177.

📒 Files selected for processing (31)
  • api/services/v1alpha1/monitoring_types.go (3 hunks)
  • api/services/v1alpha1/zz_generated.deepcopy.go (2 hunks)
  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (2 hunks)
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml (2 hunks)
  • config/rbac/role.yaml (2 hunks)
  • docs/api-overview.md (4 hunks)
  • internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/components.go (1 hunks)
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/dscinitialization/dscinitialization_controller.go (1 hunks)
  • internal/controller/dscinitialization/kubebuilder_rbac.go (1 hunks)
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/services/monitoring/monitoring_controller.go (2 hunks)
  • internal/controller/services/monitoring/monitoring_controller_actions.go (3 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (2 hunks)
  • internal/controller/status/status.go (2 hunks)
  • pkg/cluster/gvk/gvk.go (1 hunks)
  • pkg/common/common.go (2 hunks)
  • tests/e2e/monitoring_test.go (2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in the opendatahub-operator repository, crd files under config/crd.rhoai/bases/ are generated files ...
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: config/crd.rhoai/bases/dscinitialization.opendatahub.io_dscinitializations.yaml:90-107
Timestamp: 2025-07-29T18:45:24.518Z
Learning: In the opendatahub-operator repository, CRD files under config/crd.rhoai/bases/ are generated files (using controller-gen from Go structs), and inconsistencies between enum and pattern validation should not be flagged as they require upstream source code changes rather than direct YAML modifications.

Applied to files:

  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
  • internal/controller/services/monitoring/monitoring_controller_support.go
📚 Learning: in the opendatahub-operator repository, when featuretrackers are being removed or deprecated, the fe...
Learnt from: mlassak
PR: opendatahub-io/opendatahub-operator#2010
File: config/crd/kustomization.yaml:22-22
Timestamp: 2025-07-22T10:32:09.737Z
Learning: In the opendatahub-operator repository, when FeatureTrackers are being removed or deprecated, the FeatureTracker CRD reference in config/crd/kustomization.yaml should be kept for backward compatibility during the migration period, even if some components no longer use FeatureTrackers.

Applied to files:

  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
📚 Learning: in rhoai patch files, there are two naming patterns for dscinitialization crd patches: files with `....
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: config/crd.rhoai/patches/cainjection_in_dscinitialization.opendatahub.io_dscinitializations.yaml:5-7
Timestamp: 2025-07-29T18:43:10.232Z
Learning: In RHOAI patch files, there are two naming patterns for DSCInitialization CRD patches: files with `.opendatahub.io` in their filename use the duplicated pattern `dscinitializations.dscinitialization.opendatahub.io.opendatahub.io` in the CRD name, while files without the group in filename use the standard pattern `dscinitializations.dscinitialization.opendatahub.io`. This duplication appears to be intentional and exists in both regular and RHOAI branches.

Applied to files:

  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
🧬 Code Graph Analysis (4)
internal/controller/services/monitoring/monitoring_controller.go (2)
pkg/cluster/gvk/gvk.go (1)
  • PrometheusRule (468-472)
pkg/controller/reconciler/reconciler_support.go (2)
  • Dynamic (69-74)
  • CrdExists (77-85)
api/services/v1alpha1/zz_generated.deepcopy.go (1)
api/services/v1alpha1/monitoring_types.go (1)
  • Alerting (117-118)
internal/controller/dscinitialization/dscinitialization_controller.go (2)
api/services/v1alpha1/monitoring_types.go (2)
  • Alerting (117-118)
  • Monitoring (129-135)
pkg/cluster/gvk/gvk.go (1)
  • Monitoring (288-292)
tests/e2e/monitoring_test.go (5)
tests/e2e/resource_options_test.go (3)
  • WithMinimalObject (140-155)
  • WithMutateFunc (194-198)
  • WithCondition (203-207)
pkg/cluster/gvk/gvk.go (5)
  • DSCInitialization (66-70)
  • DataScienceCluster (60-64)
  • Namespace (24-28)
  • Dashboard (210-214)
  • PrometheusRule (468-472)
pkg/utils/test/testf/testf_support.go (2)
  • TransformPipeline (50-61)
  • Transform (78-106)
pkg/utils/test/matchers/jq/jq_matcher.go (1)
  • Match (11-15)
internal/controller/status/status.go (2)
  • ConditionTypeReady (90-90)
  • ConditionTypeProvisioningSucceeded (95-95)
🪛 YAMLlint (1.37.1)
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[error] 48-48: trailing spaces

(trailing-spaces)


[error] 49-49: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[warning] 31-31: wrong indentation: expected 10 but found 8

(indentation)


[error] 50-50: trailing spaces

(trailing-spaces)


[warning] 53-53: wrong indentation: expected 10 but found 8

(indentation)


[warning] 105-105: wrong indentation: expected 10 but found 8

(indentation)


[warning] 143-143: wrong indentation: expected 10 but found 8

(indentation)


[error] 171-171: trailing spaces

(trailing-spaces)


[warning] 174-174: wrong indentation: expected 10 but found 8

(indentation)

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[error] 68-68: trailing spaces

(trailing-spaces)


[warning] 72-72: wrong indentation: expected 10 but found 8

(indentation)


[warning] 110-110: wrong indentation: expected 10 but found 8

(indentation)

internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

🔇 Additional comments (41)
pkg/common/common.go (2)

23-26: LGTM! Clean utility function implementation.

The new imports and FileExists function are well-implemented. The function correctly uses fs.Stat to check file existence in embedded filesystems, which is the standard Go pattern for this operation.


124-128: LGTM! Simple and effective file existence check.

The function implementation is correct and follows Go conventions for checking file existence in embedded filesystems.

internal/controller/services/monitoring/monitoring_controller.go (2)

101-101: LGTM! Consistent dynamic resource watching.

The PrometheusRule resource watching follows the same conditional pattern as other monitoring resources, ensuring the controller only manages these resources when the corresponding CRD exists.


131-131: LGTM! Well-positioned alerting deployment action.

The deployAlerting action is appropriately placed in the reconciliation chain after the monitoring stack deployment, which ensures the required infrastructure is available before deploying alerting rules.

internal/controller/dscinitialization/dscinitialization_controller.go (1)

454-454: LGTM! Consistent field assignment pattern.

The alerting configuration assignment follows the same pattern as the existing metrics and traces fields, ensuring proper propagation of the alerting configuration from DSCInitialization to the Monitoring CR.

pkg/cluster/gvk/gvk.go (1)

468-472: LGTM! Consistent GVK declaration for PrometheusRule.

The PrometheusRule GVK follows the same pattern as other declarations and correctly uses the monitoring.rhobs API group with stable v1 version, consistent with the existing ServiceMonitor GVK.

internal/controller/dscinitialization/kubebuilder_rbac.go (1)

60-62: LGTM! Comprehensive RBAC permissions for PrometheusRule management.

The RBAC annotations correctly grant full permissions on prometheusrules resources and their subresources, following the same pattern as other monitoring.rhobs resources. These permissions are essential for the operator to manage alerting rules as described in the PR objectives.

api/services/v1alpha1/zz_generated.deepcopy.go (2)

27-40: LGTM! Autogenerated deepcopy functions for Alerting type are correct.

The autogenerated deepcopy functions follow the standard Kubernetes pattern and are consistent with other deepcopy implementations in this file.


259-263: LGTM! Proper handling of optional Alerting field.

The deepcopy logic correctly handles the optional Alerting field with proper nil checking and pointer dereferencing, consistent with how other optional fields are handled in this function.

internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml (1)

1-18: Alert job label pattern validated successfully

The regex job=~'KubeRay Operator' correctly matches the Prometheus scrape configuration and unit tests:

  • In config/monitoring/prometheus/apps/prometheus-configs.yaml:
    - job_name: 'KubeRay Operator'
  • In tests/prometheus_unit_tests/ray-alerting.unit-tests.yaml:
    Series and expected alerts use up{job="KubeRay Operator"}

No further changes required.

internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml (1)

1-18: PrometheusRule job label regex validated

The job=~'Kueue Operator' expression correctly matches the job_name: 'Kueue Operator' in the Prometheus scrape configuration and the unit tests’ up{job="Kueue Operator"} series. No changes required.

internal/controller/status/status.go (2)

108-108: LGTM! New alerting condition constant follows established naming pattern.

The ConditionAlertingAvailable constant is consistently named and appropriately placed with other condition type constants.


189-190: LGTM! Alerting configuration constants follow established patterns.

The AlertingNotConfiguredReason and AlertingNotConfiguredMessage constants are consistent with other "NotConfigured" patterns in the codebase and provide clear messaging.

internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml (1)

1-18: Confirmed PrometheusRule job label matches configuration and tests

Verified that the regex matcher job=~'KubeFlow Training Operator' exactly matches the job_name: 'KubeFlow Training Operator' defined in

  • config/monitoring/prometheus/apps/prometheus-configs.yaml
    and is exercised as up{job="KubeFlow Training Operator"} in the unit tests under
  • tests/prometheus_unit_tests/trainingoperator-alerting.unit-tests.yaml

LGTM.

config/rbac/role.yaml (1)

758-770: Consider adding the deletecollection verb for parity with monitoring.coreos.com rules

Your controller already requests deletecollection for monitoring.coreos.com/prometheusrules (Lines 746-755).
If the same bulk-deletion logic is ever exercised against the monitoring.rhobs/prometheusrules resource, the absence of this verb will trigger an RBAC “forbidden” error and leave orphaned rules behind.

   - prometheusrules
+  verbs:
+  - create
+  - delete
+  - deletecollection   # ← add for symmetry & future-proofing
+  - get
+  - list
+  - patch
+  - update
+  - watch
-  verbs:
-  - create
-  - delete
-  - get
-  - list
-  - patch
-  - update
-  - watch

Do you rely on bulk cleanup via DeleteCollection anywhere in the controller? If yes, the change is mandatory; otherwise it is a harmless hardening.

config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (2)

93-96: LGTM – new alerting field added

The schema addition matches the API changes and is guarded by a cross-field validation rule further below.
No issues spotted.


242-245: Validation rule looks correct

has(self.alerting) ? has(self.metrics) : true enforces the intended dependency without over-constraining the object.

config/crd/bases/services.platform.opendatahub.io_monitorings.yaml (2)

55-57: LGTM – alerting property introduced

Schema update aligns with controller behaviour.


190-193: Cross-field validation properly enforces Metrics ↔ Alerting dependency

Same rule as in DSCInitialization CRD; looks good.

docs/api-overview.md (1)

2959-2966: Keep the spec sections consistent

MonitoringCommonSpec and MonitoringSpec now list the new alerting field, but unlike metrics and traces they do not link back to a completed Alerting schema (see earlier comment) nor mention default behaviour (e.g. “nil ⇒ disabled”).

Once the Alerting table is added, ensure these links resolve and that defaults are stated for parity with the other subsections.

Also applies to: 3000-3005

api/services/v1alpha1/monitoring_types.go (2)

116-118: Well-designed empty struct for future extensibility.

The empty Alerting struct follows Go conventions and provides a foundation for future alerting configuration fields while maintaining backward compatibility.


138-138: Excellent validation rule ensuring proper dependency management.

The XValidation rule correctly enforces that alerting configuration requires metrics to be enabled, preventing invalid configurations. The conditional logic has(self.alerting) ? has(self.metrics) : true is well-structured and the error message is clear and actionable.

Also applies to: 151-152

internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (2)

1-46: Well-structured SLO-based alerting rules.

The alerting rules follow Prometheus SLO best practices with:

  • Multiple time window combinations for burn rate detection
  • Appropriate severity levels (critical/warning) based on burn rate thresholds
  • Clear annotations with actionable messages
  • Proper error budget calculations using 98% SLI

The alert structure effectively balances sensitivity and noise reduction.


47-90: Comprehensive recording rules for efficient alerting.

The recording rules are well-designed:

  • Cover all necessary time windows (5m to 3d) for burn rate calculations
  • Use consistent naming convention probe_success:burnrate<duration>
  • Include fallback logic with absent() and vector(1) for missing metrics
  • Properly labeled for reuse in alerting rules

This approach optimizes query performance for alert evaluation.

internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml (1)

1-21: Appropriate operator-level monitoring configuration.

The PrometheusRule effectively monitors operator health by:

  • Tracking reconciliation failure rates for key controllers
  • Using 15-minute intervals suitable for controller monitoring
  • Filtering by specific job label for proper metric scoping
  • Recording metrics for efficient downstream alerting

The focus on result!="success" provides clear visibility into controller operational issues.

internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1)

1-90: Excellent consistency with other component monitoring rules.

The TrustyAI PrometheusRule maintains consistent patterns with other components:

  • Same SLO-based alerting approach with 98% error budget
  • Identical time windows and burn rate thresholds
  • Consistent alert severity mapping and annotation structure
  • Proper instance labeling for the controller manager

This standardization simplifies operational procedures and ensures uniform monitoring coverage across components.

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (3)

14-17: Consider consistency in SLI thresholds across components.

CodeFlare uses a 99.950% SLI (error budget of 0.0005) while other components use 98% (error budget of 0.02). This creates significantly different alerting sensitivities across the platform.

Should all components use consistent SLI thresholds for operational consistency, or does CodeFlare require stricter monitoring due to its critical role in distributed workloads?

Also applies to: 26-29, 38-41


50-67: Excellent operational guidance in alert annotations.

The availability alerts include valuable operational context:

  • Clear descriptions of alert conditions
  • Direct links to triage documentation
  • Appropriate severity levels for different scenarios

This enhances the operational experience and reduces mean time to resolution.


69-145: Comprehensive recording rules covering multiple components.

The recording rules effectively support both MCAD Controller and CodeFlare Operator monitoring with:

  • Consistent burn rate calculations across time windows
  • Proper job filtering for metric scoping
  • Clear instance labeling for component identification

This dual-component approach provides complete coverage for the CodeFlare ecosystem.

tests/e2e/monitoring_test.go (3)

52-53: LGTM: Well-structured test case additions.

The new test cases for PrometheusRule creation and deletion are properly integrated into the test suite and follow the existing pattern.


603-648: LGTM: Comprehensive PrometheusRule creation test.

The test correctly:

  • Enables alerting configuration in DSCI
  • Sets dashboard component to managed state
  • Validates dashboard resource readiness with proper condition checks
  • Verifies existence of both component-specific and operator PrometheusRule resources

650-665: LGTM: Proper cleanup validation.

The deletion test correctly verifies that PrometheusRule resources are cleaned up when a component is disabled by setting its management state to "Removed".

internal/controller/services/monitoring/monitoring_controller_support.go (2)

188-201: LGTM: Well-implemented PrometheusRule template management.

The function properly:

  • Constructs the expected file path for component PrometheusRule templates
  • Validates file existence in the embedded filesystem before adding to templates
  • Provides clear error messaging for missing files
  • Follows the established pattern for template management

203-219: LGTM: Proper PrometheusRule cleanup implementation.

The cleanup function correctly:

  • Creates an unstructured object with the proper PrometheusRule GVK
  • Constructs the expected resource name and namespace
  • Handles NotFound errors gracefully (resource already deleted)
  • Provides clear error messages for deletion failures
internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (2)

1-7: LGTM!

The PrometheusRule metadata and structure are correctly defined with proper API version and templated namespace.


48-51: No action required: job selector matches Prometheus config

The job_name: 'Kserve Controller Manager' used in the alerting rule exactly matches the scrape configuration in config/monitoring/prometheus/apps/prometheus-configs.yaml:

  • In config/monitoring/prometheus/apps/prometheus-configs.yaml, you have
      - job_name: 'Kserve Controller Manager'
        honor_labels: true
        metrics_path: /metrics
  • The alert rule in
    internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
    correctly uses
    absent(up{job=~'Kserve Controller Manager'}) * 0 or vector(1)
    

Since the selector aligns with your Prometheus scrape job, no changes are needed here.

internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1)

109-196: Recording rules are well-structured.

The recording rules correctly calculate burn rates for both HAProxy HTTP responses and probe success metrics over multiple time windows.

internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1)

51-51: Verify job name with spaces.

Similar to kserve, the job name 'ODH Model Controller' contains spaces. Verify this matches your actual Prometheus job configuration.

internal/controller/services/monitoring/monitoring_controller_actions.go (2)

311-339: Well-structured validation and CRD existence check.

The function properly validates the instance type, checks for alerting configuration, and verifies CRD availability before proceeding. Good defensive programming.


359-392: Excellent error collection pattern.

The approach of collecting errors without failing fast allows all components to be processed, providing better visibility into multiple issues. The separation of add and cleanup errors aids in debugging.

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)

14-24: Regex mismatch between numerator & denominator

The numerator regex is .*jupyterhub-nb-.* (leading .*), the denominator is jupyterhub-nb-.* (no leading wildcard).
This asymmetry is probably unintentional and can hide mis-matched series, slightly skewing the ratio.

-          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 ...
+          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} > 0.9 ...

Same for the 100 % rule on Line 23.

Comment on lines 2756 to +2764


#### Alerting



Alerting configuration for Prometheus


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Alerting section is incomplete – no schema table provided

The new header correctly introduces the Alerting type but the usual field-by-field Markdown table is missing.
Consumers of this generated API reference have no way to know which keys are accepted (e.g. disabled, receivers, etc.) or their validation constraints.

Please re-run controller-gen/make manifests (or whichever doc generator you use) after adding proper godoc comments to the Alerting struct so the table is rendered.

🤖 Prompt for AI Agents
In docs/api-overview.md around lines 2756 to 2764, the Alerting section lacks
the detailed schema table showing its fields and validation constraints. To fix
this, add comprehensive godoc comments to the Alerting struct in the source
code, then re-run the documentation generator such as controller-gen or make
manifests to regenerate the API reference with the complete field-by-field
Markdown table included.

@zdtsw
Copy link
Member

zdtsw commented Aug 6, 2025

@StevenTobin StevenTobin force-pushed the RHOAIENG-25366_enable_built_in_alerting branch from 23cb177 to ec962a0 Compare August 6, 2025 15:00
@StevenTobin
Copy link
Contributor Author

@zdtsw Partly yes, but we need to work out the migration path first. For now this is just to enable the alerting.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (4)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)

8-10: Fix YAML indentation and formatting issues.

The static analysis correctly identifies formatting problems that were previously flagged. The indentation issues and trailing spaces still need to be addressed.

Also applies to: 68-68, 72-72, 110-110

internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (2)

20-20: Reconsider using 'info' severity for error burn rate alerts.

Error burn rate alerts typically warrant 'warning' or 'critical' severity. Using 'info' severity may cause these alerts to be ignored in production monitoring systems.

Also applies to: 32-32, 44-44, 56-56


101-102: Fix inconsistency between alert message and actual timing.

The alert messages claim components have been NotReady for more than 5 minutes, but the expressions use max_over_time(...[3m]) with for: 2m, totaling approximately 3-5 minutes.

Also applies to: 110-111, 119-120, 128-129

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)

53-100: Duplicate alert names break the whole rule file

Prometheus rejects any rule group that contains duplicate alert: names, causing all rules in SLOs-probe_success_workbench to be ignored.
The four alerts here are all named RHODS Jupyter Probe Success Burn Rate.

-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 5m_1h
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 30m_6h
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 2h_1d
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 6h_3d

Make each alert name unique (or move them into separate groups) so the rule file loads.

🧹 Nitpick comments (2)
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)

52-56: Review the probe success initialization logic.

The expression absent(up{job="feast-operator"}) * 0 or vector(1) sets the probe_success metric to 1 when the up metric is absent, defaulting to "success". This could mask actual failures during operator startup or unavailability periods.

Consider whether defaulting to success when metrics are absent aligns with your monitoring strategy.

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)

141-171: Redundant recording rule & trailing whitespace

rhods_active_users (Line 147) and cluster:usage:consumption:rhods:active_users (Line 160) are derived from the same selector, producing two identical series. Consider dropping one or changing the expression to capture a different aspect (e.g., burst vs steady-state users).

Also, Line 171 contains trailing spaces—cleaning these up avoids noisy diffs.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23cb177 and ec962a0.

📒 Files selected for processing (31)
  • api/services/v1alpha1/monitoring_types.go (3 hunks)
  • api/services/v1alpha1/zz_generated.deepcopy.go (2 hunks)
  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (2 hunks)
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml (2 hunks)
  • config/rbac/role.yaml (2 hunks)
  • docs/api-overview.md (4 hunks)
  • internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/components.go (1 hunks)
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/dscinitialization/dscinitialization_controller.go (1 hunks)
  • internal/controller/dscinitialization/kubebuilder_rbac.go (1 hunks)
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/services/monitoring/monitoring_controller.go (2 hunks)
  • internal/controller/services/monitoring/monitoring_controller_actions.go (3 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (2 hunks)
  • internal/controller/status/status.go (2 hunks)
  • pkg/cluster/gvk/gvk.go (1 hunks)
  • pkg/common/common.go (2 hunks)
  • tests/e2e/monitoring_test.go (2 hunks)
✅ Files skipped from review due to trivial changes (3)
  • pkg/cluster/gvk/gvk.go
  • internal/controller/components/components.go
  • docs/api-overview.md
🚧 Files skipped from review as they are similar to previous changes (22)
  • pkg/common/common.go
  • config/rbac/role.yaml
  • internal/controller/dscinitialization/kubebuilder_rbac.go
  • internal/controller/dscinitialization/dscinitialization_controller.go
  • internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml
  • internal/controller/status/status.go
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
  • api/services/v1alpha1/monitoring_types.go
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
  • api/services/v1alpha1/zz_generated.deepcopy.go
  • tests/e2e/monitoring_test.go
  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
  • internal/controller/services/monitoring/monitoring_controller.go
  • internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
  • internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
  • internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
  • internal/controller/services/monitoring/monitoring_controller_support.go
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: in rhoai patch files, there are two naming patterns for dscinitialization crd patches: files with `....
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: config/crd.rhoai/patches/cainjection_in_dscinitialization.opendatahub.io_dscinitializations.yaml:5-7
Timestamp: 2025-07-29T18:43:10.232Z
Learning: In RHOAI patch files, there are two naming patterns for DSCInitialization CRD patches: files with `.opendatahub.io` in their filename use the duplicated pattern `dscinitializations.dscinitialization.opendatahub.io.opendatahub.io` in the CRD name, while files without the group in filename use the standard pattern `dscinitializations.dscinitialization.opendatahub.io`. This duplication appears to be intentional and exists in both regular and RHOAI branches.

Applied to files:

  • internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml
🪛 YAMLlint (1.37.1)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[error] 68-68: trailing spaces

(trailing-spaces)


[warning] 72-72: wrong indentation: expected 10 but found 8

(indentation)


[warning] 110-110: wrong indentation: expected 10 but found 8

(indentation)

internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[error] 48-48: trailing spaces

(trailing-spaces)


[error] 49-49: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[warning] 31-31: wrong indentation: expected 10 but found 8

(indentation)


[error] 50-50: trailing spaces

(trailing-spaces)


[warning] 53-53: wrong indentation: expected 10 but found 8

(indentation)


[warning] 105-105: wrong indentation: expected 10 but found 8

(indentation)


[warning] 143-143: wrong indentation: expected 10 but found 8

(indentation)


[error] 171-171: trailing spaces

(trailing-spaces)


[warning] 174-174: wrong indentation: expected 10 but found 8

(indentation)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: Run tests and collect coverage on tests/integration
  • GitHub Check: golangci-lint
🔇 Additional comments (11)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (2)

14-17: Verify burn rate threshold calculations.

The burn rate thresholds use (14.40 * (1-0.99950)) which equals 14.40 * 0.0005 = 0.0072. This represents a very strict SLO (99.95% success rate) with high sensitivity. Ensure this aligns with your actual service level objectives.


72-106: LGTM on recording rules structure.

The recording rules correctly calculate burn rates using 1 - avg_over_time(probe_success{...}[duration]) pattern across multiple time windows. The labeling with instance: mcad-controller is consistent.

internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)

15-15: Different SLO threshold compared to other components.

This component uses a 98% success rate threshold (1-0.98000) while CodeFlare uses 99.95%. Ensure this difference is intentional and aligns with the service requirements for Feast Operator.

Also applies to: 17-17

internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)

15-15: Consistent SLO threshold with Feast Operator.

Uses the same 98% success rate threshold as Feast Operator, which is good for consistency across similar operator components.

Also applies to: 17-17


52-56: Probe success initialization follows same pattern.

Uses the same absent(up{job="llama-stack-k8s-operator"}) * 0 or vector(1) pattern as Feast Operator for consistent behavior across components.

internal/controller/services/monitoring/monitoring_controller_actions.go (4)

311-324: Well-structured function initialization and early validation.

The function correctly validates the instance type and checks for alerting configuration, with appropriate condition marking when alerting is not configured.


326-348: Proper CRD existence check and template addition.

Good practice to check for PrometheusRule CRD availability before proceeding. The operator-level rules are correctly added regardless of component states.


359-392: Excellent error collection strategy for component processing.

The approach of collecting errors for each component while continuing to process others is robust. This prevents one failing component from blocking alerting setup for all others.


400-420: Comprehensive error reporting and handling.

The function logs individual component errors while still returning an aggregate error if any issues occurred. This provides good visibility into which specific components had problems.

internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (2)

135-172: Recording rules structure is well-implemented.

The probe success burn rate recording rules correctly calculate burn rates across multiple time windows with proper labeling and consistent instance naming.


174-224: HTTP response burn rate calculations look correct.

The HAProxy backend HTTP response burn rate recording rules properly calculate 5xx error rates across different time windows using rate functions and division for percentage calculation.

@zdtsw
Copy link
Member

zdtsw commented Aug 7, 2025

do we need it for ODH 2.33.0 release?

@StevenTobin
Copy link
Contributor Author

It would be good to have this is 2.33 so we have all the foundational pieces of observability in, but I don't think it's a blocker if not.

@StevenTobin
Copy link
Contributor Author

/test opendatahub-operator-e2e

@zdtsw
Copy link
Member

zdtsw commented Aug 7, 2025

It would be good to have this is 2.33 so we have all the foundational pieces of observability in, but I don't think it's a blocker if not.

what i am thinking is, if we need it for 2.23.0 RHOAI we will need it for 2.33.0 ODH, then we need focus gettting this PR in main, and we can leave the sync to rhoai for the next 2 weeks.
but if it is not the case, we will need get other PR in main as high priority.

@StevenTobin
Copy link
Contributor Author

The target for this is 2.24 RHOAI.

@StevenTobin
Copy link
Contributor Author

/test opendatahub-operator-e2e

@StevenTobin StevenTobin force-pushed the RHOAIENG-25366_enable_built_in_alerting branch from ec962a0 to 2376986 Compare August 7, 2025 12:58
@StevenTobin
Copy link
Contributor Author

/test opendatahub-operator-e2e

@StevenTobin StevenTobin force-pushed the RHOAIENG-25366_enable_built_in_alerting branch from a9907bb to c041af2 Compare August 13, 2025 09:07
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (13)
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (2)

10-57: Rename alerts per window and template the namespace label in alert labels

All four alerts in this group share the same name and hardcode the namespace label. Unique names help triage and routing; templating the namespace improves portability.

Apply the following changes:

-        - alert: RHODS Dashboard Route Error Burn Rate
+        - alert: RHODS Dashboard Route Error Burn Rate (5m/1h)
           annotations:
             message: 'High error budget burn for {{`{{`}}$labels.route{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).'
             summary: RHODS Dashboard Route Error Burn Rate
           expr: |
-            sum(haproxy_backend_http_responses_total:burnrate5m{route=~"rhods-dashboard"}) by (route) > (14.40 * (1-0.99950))
+            sum(haproxy_backend_http_responses_total:burnrate5m{route="rhods-dashboard"}) by (route) > (14.40 * (1-0.99950))
             and
-            sum(haproxy_backend_http_responses_total:burnrate1h{route=~"rhods-dashboard"}) by (route) > (14.40 * (1-0.99950))
+            sum(haproxy_backend_http_responses_total:burnrate1h{route="rhods-dashboard"}) by (route) > (14.40 * (1-0.99950))
           for: 2m
           labels:
             severity: critical
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
-        - alert: RHODS Dashboard Route Error Burn Rate
+        - alert: RHODS Dashboard Route Error Burn Rate (30m/6h)
           annotations:
             message: 'High error budget burn for {{`{{`}}$labels.route{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).'
             summary: RHODS Dashboard Route Error Burn Rate
           expr: |
-            sum(haproxy_backend_http_responses_total:burnrate30m{route=~"rhods-dashboard"}) by (route) > (6.00 * (1-0.99950))
+            sum(haproxy_backend_http_responses_total:burnrate30m{route="rhods-dashboard"}) by (route) > (6.00 * (1-0.99950))
             and
-            sum(haproxy_backend_http_responses_total:burnrate6h{route=~"rhods-dashboard"}) by (route) > (6.00 * (1-0.99950))
+            sum(haproxy_backend_http_responses_total:burnrate6h{route="rhods-dashboard"}) by (route) > (6.00 * (1-0.99950))
           for: 15m
           labels:
             severity: critical
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
-        - alert: RHODS Dashboard Route Error Burn Rate
+        - alert: RHODS Dashboard Route Error Burn Rate (2h/1d)
           annotations:
             message: 'High error budget burn for {{`{{`}}$labels.route{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).'
             summary: RHODS Dashboard Route Error Burn Rate
           expr: |
-            sum(haproxy_backend_http_responses_total:burnrate2h{route=~"rhods-dashboard"}) by (route) > (3.00 * (1-0.99950))
+            sum(haproxy_backend_http_responses_total:burnrate2h{route="rhods-dashboard"}) by (route) > (3.00 * (1-0.99950))
             and
-            sum(haproxy_backend_http_responses_total:burnrate1d{route=~"rhods-dashboard"}) by (route) > (3.00 * (1-0.99950))
+            sum(haproxy_backend_http_responses_total:burnrate1d{route="rhods-dashboard"}) by (route) > (3.00 * (1-0.99950))
           for: 1h
           labels:
             severity: warning
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
-        - alert: RHODS Dashboard Route Error Burn Rate
+        - alert: RHODS Dashboard Route Error Burn Rate (6h/3d)
           annotations:
             message: 'High error budget burn for {{`{{`}}$labels.route{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).'
             summary: RHODS Dashboard Route Error Burn Rate
           expr: |
-            sum(haproxy_backend_http_responses_total:burnrate6h{route=~"rhods-dashboard"}) by (route) > (1.00 * (1-0.99950))
+            sum(haproxy_backend_http_responses_total:burnrate6h{route="rhods-dashboard"}) by (route) > (1.00 * (1-0.99950))
             and
-            sum(haproxy_backend_http_responses_total:burnrate3d{route=~"rhods-dashboard"}) by (route) > (1.00 * (1-0.99950))
+            sum(haproxy_backend_http_responses_total:burnrate3d{route="rhods-dashboard"}) by (route) > (1.00 * (1-0.99950))
           for: 3h
           labels:
             severity: warning
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}

58-107: Apply the same naming and namespace-label fixes to probe alerts

Make probe alert names unique per window and template the namespace label. Also prefer exact-match selectors to avoid regex overhead when the value is static.

-        - alert: RHODS Dashboard Probe Success Burn Rate
+        - alert: RHODS Dashboard Probe Success Burn Rate (5m/1h)
           annotations:
             message: 'High error budget burn for {{`{{`}}$labels.name{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).'
             summary: RHODS Dashboard Probe Success Burn Rate
           expr: |
-            sum(probe_success:burnrate5m{name=~"rhods-dashboard"}) by (name) > (14.40 * (1-0.98))
+            sum(probe_success:burnrate5m{name="rhods-dashboard"}) by (name) > (14.40 * (1-0.98))
             and
-            sum(probe_success:burnrate1h{name=~"rhods-dashboard"}) by (name) > (14.40 * (1-0.98))
+            sum(probe_success:burnrate1h{name="rhods-dashboard"}) by (name) > (14.40 * (1-0.98))
           for: 2m
           labels:
             severity: critical
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
-        - alert: RHODS Dashboard Probe Success Burn Rate
+        - alert: RHODS Dashboard Probe Success Burn Rate (30m/6h)
           annotations:
             message: 'High error budget burn for {{`{{`}}$labels.name{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).'
             summary: RHODS Dashboard Probe Success Burn Rate
           expr: |
-            sum(probe_success:burnrate30m{name=~"rhods-dashboard"}) by (name) > (6.00 * (1-0.98))
+            sum(probe_success:burnrate30m{name="rhods-dashboard"}) by (name) > (6.00 * (1-0.98))
             and
-            sum(probe_success:burnrate6h{name=~"rhods-dashboard"}) by (name) > (6.00 * (1-0.98))
+            sum(probe_success:burnrate6h{name="rhods-dashboard"}) by (name) > (6.00 * (1-0.98))
           for: 15m
           labels:
             severity: critical
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
-        - alert: RHODS Dashboard Probe Success Burn Rate
+        - alert: RHODS Dashboard Probe Success Burn Rate (2h/1d)
           annotations:
             message: 'High error budget burn for {{`{{`}}$labels.name{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).'
             summary: RHODS Dashboard Probe Success Burn Rate
           expr: |
-            sum(probe_success:burnrate2h{name=~"rhods-dashboard"}) by (name) > (3.00 * (1-0.98))
+            sum(probe_success:burnrate2h{name="rhods-dashboard"}) by (name) > (3.00 * (1-0.98))
             and
-            sum(probe_success:burnrate1d{name=~"rhods-dashboard"}) by (name) > (3.00 * (1-0.98))
+            sum(probe_success:burnrate1d{name="rhods-dashboard"}) by (name) > (3.00 * (1-0.98))
           for: 1h
           labels:
             severity: warning
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
-        - alert: RHODS Dashboard Probe Success Burn Rate
+        - alert: RHODS Dashboard Probe Success Burn Rate (6h/3d)
           annotations:
             message: 'High error budget burn for {{`{{`}}$labels.name{{`}}`}} (current value: {{`{{`}}$value{{`}}`}} ).'
             summary: RHODS Dashboard Probe Success Burn Rate
           expr: |
-            sum(probe_success:burnrate6h{name=~"rhods-dashboard"}) by (name) > (1.00 * (1-0.98))
+            sum(probe_success:burnrate6h{name="rhods-dashboard"}) by (name) > (1.00 * (1-0.98))
             and
-            sum(probe_success:burnrate3d{name=~"rhods-dashboard"}) by (name) > (1.00 * (1-0.98))
+            sum(probe_success:burnrate3d{name="rhods-dashboard"}) by (name) > (1.00 * (1-0.98))
           for: 3h
           labels:
             severity: warning
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1)

10-46: Duplicate alert names within a group will break rule loading

Each alert in a group must have a unique name. These three alerts share the same name, which will cause the Prometheus rule group to be rejected.

-        - alert: TrustyAI Controller Probe Success Burn Rate
+        - alert: TrustyAI Controller Probe Success Burn Rate 5m_1h
...
-        - alert: TrustyAI Controller Probe Success Burn Rate
+        - alert: TrustyAI Controller Probe Success Burn Rate 30m_6h
...
-        - alert: TrustyAI Controller Probe Success Burn Rate
+        - alert: TrustyAI Controller Probe Success Burn Rate 2h_1d
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (2)

10-24: Unify PVC regex patterns and simplify the expression

The used vs capacity selectors use different regexes, which risks mismatched series. Use identical patterns on both sides. Also, parenthesize once to avoid recomputing the ratio visually.

-          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99
+          expr: (kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"}) > 0.9
+            and
+            (kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"}) < 0.99

53-100: Duplicate alert names within group will break rule loading

All four alerts in SLOs-probe_success_workbench have the same name. Make them unique (suffix with window pair) or split into separate groups.

-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 5m_1h
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 30m_6h
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 2h_1d
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 6h_3d
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)

8-8: Fix list indentation under groups

YAMLlint flags wrong indentation. The list item under spec.groups should be indented consistently (2 spaces under groups).

Apply this diff:

-      - name: SLOs-probe_success_llama_stack_k8s_operator
+    - name: SLOs-probe_success_llama_stack_k8s_operator

52-52: Correct indentation for first recording rule entry

YAMLlint expects 10 spaces before the list entry under rules, not 8.

Apply this diff:

-        - expr: |
+          - expr: |
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (2)

52-52: Correct indentation for first recording rule entry

YAMLlint expects 10 spaces before the list entry under rules, not 8.

Apply this diff:

-        - expr: |
+          - expr: |

8-8: Fix list indentation under groups

YAMLlint flags wrong indentation. The list item under spec.groups should be indented consistently (2 spaces under groups).

Apply this diff:

-      - name: SLOs-probe_success_feast_operator
+    - name: SLOs-probe_success_feast_operator
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (4)

51-61: Fix PromQL label quoting and standardize matching; current expressions won’t evaluate.

PromQL requires double quotes in label matchers. Also, using a space-containing job value is brittle and likely incorrect for up. Prefer matching the operator pod in the target namespace to avoid reliance on scrape job names.

Proposed, more robust expressions:

-            expr: absent(up{job=~'CodeFlare Operator'}) or up{job=~'CodeFlare Operator'} != 1
+            expr: |
+              absent(up{namespace="{{.Namespace}}", pod=~"codeflare-operator-.*"})
+              OR
+              up{namespace="{{.Namespace}}", pod=~"codeflare-operator-.*"} != 1
...
-            expr: absent_over_time(up{job="CodeFlare Operator"}[2m]) == 1
+            expr: absent_over_time(up{namespace="{{.Namespace}}", pod=~"codeflare-operator-.*"}[2m]) == 1

If you do need to match on job, use double quotes and a sane regex (and confirm the actual label value):

  • up{job="codeflare-operator.*"} not up{job='CodeFlare Operator'}

70-76: Normalize indentation for rules entries (YAMLlint: expected 10 spaces).

Child list items under rules should be indented two spaces deeper than rules:.

       - name: SLOs - MCAD Controller
-        rules:
-        - expr: |
+        rules:
+        - expr: |
             1 - avg_over_time(probe_success{instance=~"mcad-controller-.*", job="user_facing_endpoints_status_codeflare"}[1d])

108-114: Normalize indentation for rules entries in “SLOs - CodeFlare Operator”.

Same indentation problem as above.

       - name: SLOs - CodeFlare Operator
-        rules:
-        - expr: |
+        rules:
+        - expr: |
             1 - avg_over_time(probe_success{instance=~"codeflare-operator-.*", job="user_facing_endpoints_status_codeflare"}[1d])

7-15: Fix invalid YAML indentation under spec.groups (will fail to parse).

The list item under groups and its nested rules/items are over-indented. Align them with 2-space steps.

Apply:

 spec:
   groups:
-      - name: SLOs-probe_success_codeflare
-        rules:
-        - alert: CodeFlare Operator Probe Success Burn Rate
+    - name: SLOs-probe_success_codeflare
+      rules:
+      - alert: CodeFlare Operator Probe Success Burn Rate
🧹 Nitpick comments (14)
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (2)

8-112: Optional: place recording rules before alerts to reduce cold-start gaps

Prometheus doesn’t guarantee cross-group evaluation order. Having alerts depend on recording rules in later groups can cause transient “empty” results on first runs/reloads.

Consider grouping recordings ahead of alerts, or combine dependent alerts and recordings into the same group (recordings first, then alerts) to ensure data availability within the same evaluation cycle.


15-17: Prefer exact-match selectors instead of regex when static

Use route="rhods-dashboard" and name="rhods-dashboard" instead of regex matches to reduce query overhead and noise.

You already see this reflected in the diffs above; apply consistently across all occurrences.

Also applies to: 27-29, 39-41, 51-53, 65-67, 77-79, 89-91, 101-103

pkg/common/common.go (1)

23-27: Remove unnecessary embed import after generalizing FileExists

If you adopt the fs.FS interface in FileExists, the embed import becomes unused and should be removed.

-import (
-	"crypto/sha256"
-	"embed"
+import (
+	"crypto/sha256"
 	b64 "encoding/base64"
 	"fmt"
 	"io/fs"
 	"os"
 	"regexp"
internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml (2)

11-17: Tighten job matcher to avoid unintended matches

Anchor the job regex to avoid accidental partial matches (e.g., jobs containing “Kueue Operator” as a substring). Optionally, also scope to the target namespace if appropriate.

-          expr: absent(up{job=~'Kueue Operator'}) or up{job=~'Kueue Operator'} != 1
+          expr: absent(up{job=~'^Kueue Operator$'}) or up{job=~'^Kueue Operator$'} != 1

Follow-up:

  • Verify the actual job label value produced by the ServiceMonitor/PodMonitor for Kueue. If it differs, adjust the selector accordingly (e.g., service, namespace, or pod labels). I can help update the rule once the canonical labels are confirmed.

3-5: Consider aligning resource name with other component templates

Other components use the “-prometheusrules” pattern (e.g., trustyai-prometheusrules, workbenches-prometheusrules). For consistency and discoverability, consider renaming the resource to kueue-prometheusrules.

internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1)

10-46: Optional: Align thresholds/labels with a runbook URL and consistent summaries

Consider adding runbook_url and making summaries concise (“TrustyAI Controller probe SLO burn rate high (5m/1h)”) for better triage.

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (2)

8-8: YAML lint issues: indentation and trailing spaces

There are multiple indentation warnings and trailing spaces flagged by YAMLlint. Please fix to avoid CI/tooling noise and potential parsing issues.

  • Wrong indentation at lines: 8, 10, 31, 53, 105, 143, 174 (expected 4/10 spaces per linter hints).
  • Trailing spaces at lines: 50, 171.

Also applies to: 10-10, 31-31, 50-50, 53-53, 105-105, 143-143, 171-171, 174-174


35-48: Tighten controller job matchers

As with other rules, consider anchoring the job regex and optionally scoping by namespace. This reduces false positives if similarly named jobs exist.

-          expr: absent(up{job=~'Kubeflow Notebook Controller Service Metrics'})
+          expr: absent(up{job=~'^Kubeflow Notebook Controller Service Metrics$'})
...
-          expr: absent(up{job=~'ODH Notebook Controller Service Metrics'})
+          expr: absent(up{job=~'^ODH Notebook Controller Service Metrics$'})

Follow-up:

  • Verify the actual job label values produced by the ServiceMonitors for these controllers. I can adjust selectors once the canonical labels are confirmed.
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)

10-21: Severity consistency across components

Two shorter-window alerts are severity: critical here, while similar templates (e.g., llama stack) use warning. If this disparity is intentional, ignore. Otherwise, consider aligning severities for consistency across components.

Do you want me to standardize severities across all component PrometheusRule templates?

Also applies to: 23-34, 36-47

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (2)

68-68: Remove trailing whitespace on blank line.

YAMLlint flags trailing spaces here.

-      
+

47-67: Optional: Add a component label to aid Alertmanager routing.

Consider adding labels:

  • component: codeflare
  • product: rhods or odh (as applicable)

This helps downstream routing/ownership without parsing alert names.

internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml (3)

15-18: Simplify expressions: remove unnecessary sum by(instance) and use equality matchers

Recording rules already produce a single series per instance. Using sum(...) by (instance) is redundant. Equality matcher is also sufficient and marginally cheaper than regex.

-            sum(probe_success:burnrate5m{instance=~"modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000))
+            probe_success:burnrate5m{instance="modelmesh-controller"} > (14.40 * (1-0.98000))
             and
-            sum(probe_success:burnrate1h{instance=~"modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000))
+            probe_success:burnrate1h{instance="modelmesh-controller"} > (14.40 * (1-0.98000))
-            sum(probe_success:burnrate30m{instance=~"modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000))
+            probe_success:burnrate30m{instance="modelmesh-controller"} > (6.00 * (1-0.98000))
             and
-            sum(probe_success:burnrate6h{instance=~"modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000))
+            probe_success:burnrate6h{instance="modelmesh-controller"} > (6.00 * (1-0.98000))
-            sum(probe_success:burnrate2h{instance=~"modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000))
+            probe_success:burnrate2h{instance="modelmesh-controller"} > (3.00 * (1-0.98000))
             and
-            sum(probe_success:burnrate1d{instance=~"modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000))
+            probe_success:burnrate1d{instance="modelmesh-controller"} > (3.00 * (1-0.98000))

Also applies to: 27-30, 39-42


10-14: Differentiate the three alerts with a label or unique name for easier triage

All three alerts share the same name and differ only by expr and severity. Add a label (e.g., window=short|medium|long) or suffix the alert name to simplify routing, silencing, and dashboards.

Example:

-        - alert: Modelmesh Controller Probe Success Burn Rate
+        - alert: Modelmesh Controller Probe Success Burn Rate (short)
+          labels:
+            window: short
...
-        - alert: Modelmesh Controller Probe Success Burn Rate
+        - alert: Modelmesh Controller Probe Success Burn Rate (medium)
+          labels:
+            window: medium
...
-        - alert: Modelmesh Controller Probe Success Burn Rate
+        - alert: Modelmesh Controller Probe Success Burn Rate (long)
+          labels:
+            window: long

Also applies to: 22-26, 34-38


47-90: Sanity-check the base signal used for SLOs

You’re deriving SLOs from “up” via a synthetic probe_success. If modelmesh exposes a dedicated readiness/health metric, consider using it for better signal quality (up only measures scrape reachability).

If a better health metric exists (e.g., controller reconciliation errors, queue depth, or a health gauge), I can help propose an alternative set of recording rules and thresholds.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2376986 and c041af2.

📒 Files selected for processing (34)
  • api/services/v1alpha1/monitoring_types.go (3 hunks)
  • api/services/v1alpha1/zz_generated.deepcopy.go (2 hunks)
  • bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml (2 hunks)
  • bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (3 hunks)
  • bundle/manifests/services.platform.opendatahub.io_monitorings.yaml (2 hunks)
  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (2 hunks)
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml (2 hunks)
  • config/rbac/role.yaml (2 hunks)
  • docs/api-overview.md (4 hunks)
  • internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/components.go (1 hunks)
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/dscinitialization/dscinitialization_controller.go (1 hunks)
  • internal/controller/dscinitialization/kubebuilder_rbac.go (1 hunks)
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/services/monitoring/monitoring_controller.go (2 hunks)
  • internal/controller/services/monitoring/monitoring_controller_actions.go (3 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (2 hunks)
  • internal/controller/status/status.go (2 hunks)
  • pkg/cluster/gvk/gvk.go (1 hunks)
  • pkg/common/common.go (2 hunks)
  • tests/e2e/monitoring_test.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (21)
  • internal/controller/dscinitialization/dscinitialization_controller.go
  • internal/controller/components/components.go
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
  • pkg/cluster/gvk/gvk.go
  • internal/controller/services/monitoring/monitoring_controller_support.go
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
  • internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
  • internal/controller/services/monitoring/monitoring_controller.go
  • tests/e2e/monitoring_test.go
  • api/services/v1alpha1/zz_generated.deepcopy.go
  • internal/controller/dscinitialization/kubebuilder_rbac.go
  • internal/controller/services/monitoring/monitoring_controller_actions.go
  • api/services/v1alpha1/monitoring_types.go
  • docs/api-overview.md
  • config/rbac/role.yaml
  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-29T18:44:42.749Z
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: bundle.rhoai/manifests/rhods-operator.clusterserviceversion.yaml:1548-1550
Timestamp: 2025-07-29T18:44:42.749Z
Learning: In the opendatahub-operator repository's rhoai branch, the CSV manifest intentionally uses `quay.io/opendatahub/opendatahub-operator:latest` as the image reference even though it's for the rhods-operator, and this configuration difference from the main branch is by design.

Applied to files:

  • bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
🪛 YAMLlint (1.37.1)
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[error] 68-68: trailing spaces

(trailing-spaces)


[warning] 72-72: wrong indentation: expected 10 but found 8

(indentation)


[warning] 110-110: wrong indentation: expected 10 but found 8

(indentation)

internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[error] 48-48: trailing spaces

(trailing-spaces)


[error] 49-49: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[warning] 31-31: wrong indentation: expected 10 but found 8

(indentation)


[error] 50-50: trailing spaces

(trailing-spaces)


[warning] 53-53: wrong indentation: expected 10 but found 8

(indentation)


[warning] 105-105: wrong indentation: expected 10 but found 8

(indentation)


[warning] 143-143: wrong indentation: expected 10 but found 8

(indentation)


[error] 171-171: trailing spaces

(trailing-spaces)


[warning] 174-174: wrong indentation: expected 10 but found 8

(indentation)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on tests/integration
  • GitHub Check: golangci-lint
  • GitHub Check: Run tests and collect coverage on internal and pkg
🔇 Additional comments (11)
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1)

1-2: Confirm PrometheusRule API group monitoring.rhobs/v1 is supported in target clusters

I’ve verified the repository references and RBAC rules for both API groups:

• 14 PrometheusRule templates under internal/controller/.../monitoring/*-prometheusrules.tmpl.yaml use
apiVersion: monitoring.rhobs/v1
• Base config and some ServiceMonitor/PodMonitor templates use
apiVersion: monitoring.coreos.com/v1
• RBAC in config/rbac/role.yaml and the CSV manifest grants permissions for both monitoring.rhobs and monitoring.coreos.com
• No PrometheusRule CRDs are vendored—cluster must provide the CRD for whichever group is in use

Please confirm that RHOAI 2.24/ODH 2.33 target clusters have the PrometheusRule CRD installed for monitoring.rhobs/v1, or adjust these templates to use monitoring.coreos.com/v1 if that’s the supported API in your environment.

internal/controller/status/status.go (2)

108-109: LGTM: new ConditionAlertingAvailable

The new condition type aligns with existing condition naming and will help surface alerting availability clearly.


189-191: LGTM: “AlertingNotConfigured” reason/message

Consistent with Metrics/Traces constants. This will simplify status reporting in the monitoring controller.

bundle/manifests/services.platform.opendatahub.io_monitorings.yaml (2)

55-57: CRD: Add alerting under spec — LGTM

The new alerting object is correctly added under spec with description. This keeps the schema forward‑compatible.


195-197: CRD cross-field validation — LGTM

Validation ensuring alerting implies metrics is correct and well-scoped at the spec level.

bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml (2)

93-95: DSCI CRD: Add monitoring.alerting — LGTM

The alerting object is properly introduced under monitoring with a clear description.


247-249: DSCI CRD cross-field validation — LGTM

Validation mirrors Monitoring CRD: alerting requires metrics. Consistent and correct.

bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (2)

1018-1041: RBAC: Add PrometheusRule permissions for monitoring.rhobs — LGTM

Create/update/list and finalizers/status for monitoring.rhobs PrometheusRule are correctly included, aligning with the new alerting flow.


1498-1498: Confirm image tag strategy for CSV deployment

The manager image is set to quay.io/opendatahub/opendatahub-operator:latest. Please confirm this matches the release process for the ODH CSV on main (some branches intentionally use latest; just validating this is intended here).

Would you like me to switch this to a pinned version or a CI-replaced placeholder (e.g., REPLACE_IMAGE) if that’s the repository standard?

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)

1-1: Confirm CRD group apiVersion “monitoring.rhobs/v1” availability in target clusters.

Many clusters expose PrometheusRule at monitoring.coreos.com/v1. If monitoring.rhobs/v1 is not registered, these resources won’t reconcile.

Would you like me to scan the repo/manifests for the installed PrometheusRule GVKs and generate a compatibility matrix for ODH 2.33 and RHOAI 2.24?

internal/controller/components/modelmeshserving/monitoring/modelmeshservice-prometheusrules.tmpl.yaml (1)

1-2: PrometheusRule API group monitoring.rhobs/v1 is correct and consistent

All PrometheusRule templates and GVK registrations in this repo use monitoring.rhobs/v1:

  • pkg/cluster/gvk/gvk.go (lines 480–483) defines
    PrometheusRule = schema.GroupVersionKind{Group: "monitoring.rhobs", Version: "v1", Kind: "PrometheusRule"}
  • Only monitoring.rhobs/v1 appears in *.tmpl.yaml for PrometheusRule
  • Runtime guard via cluster.HasCRD(ctx, rr.Client, gvk.PrometheusRule) ensures we only create these resources if the CRD is installed

No change is required. If you need to support clusters exposing PrometheusRule under monitoring.coreos.com/v1, you’d have to update the GVK in pkg/cluster/gvk/gvk.go, the YAML templates, and any affected tests.

Comment on lines +124 to +128
// FileExists checks if a file exists in the embedded filesystem.
func FileExists(fsys embed.FS, path string) bool {
_, err := fs.Stat(fsys, path)
return err == nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Generalize FS type and avoid treating directories as files in FileExists

Accept fs.FS (interface) instead of embed.FS for broader applicability, and ensure directories don’t return true. This avoids false positives when a directory matches the path and eliminates tight coupling to embed.FS.

-func FileExists(fsys embed.FS, path string) bool {
-	_, err := fs.Stat(fsys, path)
-	return err == nil
-}
+func FileExists(fsys fs.FS, path string) bool {
+	info, err := fs.Stat(fsys, path)
+	if err != nil {
+		return false
+	}
+	return !info.IsDir()
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// FileExists checks if a file exists in the embedded filesystem.
func FileExists(fsys embed.FS, path string) bool {
_, err := fs.Stat(fsys, path)
return err == nil
}
// FileExists checks if a file exists in the embedded filesystem.
func FileExists(fsys fs.FS, path string) bool {
info, err := fs.Stat(fsys, path)
if err != nil {
return false
}
return !info.IsDir()
}
🤖 Prompt for AI Agents
In pkg/common/common.go around lines 124 to 128, change FileExists to accept the
fs.FS interface instead of embed.FS and ensure it returns false for directories:
call fs.Stat(fsys, path), if err != nil return false, otherwise check the
returned FileInfo and return !fi.IsDir(). Update any callers to pass an fs.FS
(embed.FS still implements fs.FS) and adjust imports if necessary.

@StevenTobin
Copy link
Contributor Author

/test opendatahub-operator-e2e

@StevenTobin StevenTobin force-pushed the RHOAIENG-25366_enable_built_in_alerting branch 2 times, most recently from c3ef7cf to 8ce120c Compare August 14, 2025 08:36
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
tests/e2e/monitoring_test.go (1)

642-664: Optional: assert rule ownership or basic content to increase signal

You currently assert existence. Consider asserting at least one group/label expected by your templates or ownerReferences to the Monitoring CR to catch template regressions.

If desired, I can propose concrete jq matchers for your current templates.

bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (2)

1015-1027: RBAC for monitoring.rhobs/prometheusrules: consider adding deletecollection

You added create/read/update/delete for rhobs PrometheusRules. For parity with monitoring.coreos.com and to enable label-selective cleanup patterns, add deletecollection.

Apply this diff:

         - apiGroups:
           - monitoring.rhobs
           resources:
           - monitoringstacks
           - prometheusrules
           - servicemonitors
           verbs:
           - create
           - delete
+          - deletecollection
           - get
           - list
           - patch
           - update
           - watch

92-97: Expose alerting in ALM example for discoverability

Including "alerting": {} in the DSCInitialization example helps users discover the new feature via OLM UI.

Apply this diff:

           "monitoring": {
             "managementState": "Managed",
-            "metrics": {},
+            "metrics": {},
+            "alerting": {},
             "namespace": "opendatahub"
           },
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c3ef7cf and 8ce120c.

📒 Files selected for processing (4)
  • bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml (2 hunks)
  • bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (2 hunks)
  • bundle/manifests/services.platform.opendatahub.io_monitorings.yaml (2 hunks)
  • tests/e2e/monitoring_test.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-13T21:04:32.696Z
Learnt from: kahowell
PR: opendatahub-io/opendatahub-operator#2220
File: .github/workflows/prometheus-unit-tests.yaml:5-5
Timestamp: 2025-08-13T21:04:32.696Z
Learning: In the opendatahub-operator repository, prometheus monitoring configurations are shared between ODH and RHOAI platforms. The prometheus directory only exists in odh-config/monitoring/prometheus/, not in rhoai-config. Both platforms reference the same prometheus manifests from the odh-config location.

Applied to files:

  • bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: Run tests and collect coverage on tests/integration
🔇 Additional comments (3)
bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml (1)

93-96: Adding monitoring.alerting looks correct and well-scoped

Good addition and placement under spec.monitoring. Keeping it as a free-form object allows forward compatibility with future fields.

tests/e2e/monitoring_test.go (2)

51-53: New alerting test cases: good coverage and placement

Adding creation and deletion tests for PrometheusRule improves behavioral coverage and aligns with the PR objective to enable built-in alerting.


625-632: Ensure metrics is configured when enabling alerting (reduce flakiness)

CRD cross-field validation and controller behavior can reject or make alerting = {} flaky if .spec.monitoring.metrics isn't set. Make the test self-sufficient by configuring metrics in the same mutate pipeline.

  • File: tests/e2e/monitoring_test.go — around lines 625–632 (the EventuallyResourceCreatedOrUpdated call)

Apply this diff:

 tc.EventuallyResourceCreatedOrUpdated(
   WithMinimalObject(gvk.DSCInitialization, tc.DSCInitializationNamespacedName),
   WithMutateFunc(testf.TransformPipeline(
     testf.Transform(`.spec.monitoring.managementState = "%s"`, operatorv1.Managed),
+    // Ensure metrics is configured so alerting is valid and stack is present
+    setMonitoringMetrics(),
     testf.Transform(`.spec.monitoring.alerting = {}`),
   )),
 )

Comment on lines +678 to +681
tc.EnsureResourceGone(
WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Use existing plural helper (and verify availability) + assert operator rules persist

Elsewhere in this file you use EnsureResourcesGone. For consistency and to avoid potential missing helper issues, prefer the plural form. Also, after removing the dashboard component, it’s valuable to assert that operator-level rules are still present when alerting remains enabled.

Apply this diff:

-// Ensure the dashboard-prometheusrules is deleted
-tc.EnsureResourceGone(
-  WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
-)
+// Ensure the dashboard-specific PrometheusRule is deleted
+tc.EnsureResourcesGone(
+  WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
+)
+
+// Operator-level rules should remain when alerting is still enabled
+tc.EnsureResourceExists(
+  WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "operator-prometheusrules", Namespace: tc.AppsNamespace}),
+)

Run this script to verify whether EnsureResourceGone exists in the test helpers; if not, the change above avoids a build failure:


🏁 Script executed:

#!/bin/bash
# Check for EnsureResourceGone helper presence
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourceGone\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourceGone\\(' -A 2 || true

Length of output: 2409


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "Searching for EnsureResourcesGone definition and usages..."
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourcesGone\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourcesGone\\(' -A 2 || true
echo
echo "Searching for EnsureResourceExists / related helpers..."
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourceExists\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourceExists\\(' -A 2 || true
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourcesExist\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourcesExist\\(' -A 2 || true
rg -n --glob '!vendor' $'func \\(.*\\) EnsureResourcesExists\\(' -A 2 || true
rg -n --glob '!vendor' $'EnsureResourcesExists\\(' -A 2 || true
echo
echo "Searching for PrometheusRule usages and any operator-prometheusrules literal..."
rg -n --glob '!vendor' -S $'PrometheusRule' -g '!**/vendor/**' || true
rg -n --glob '!vendor' -S $'operator-prometheusrules' -g '!**/vendor/**' || true

Length of output: 37728


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "---- tests/e2e/monitoring_test.go: lines 600-700 ----"
sed -n '600,700p' tests/e2e/monitoring_test.go || true
echo
echo "---- tests/e2e/test_context_test.go: lines 320-400 ----"
sed -n '320,400p' tests/e2e/test_context_test.go || true

Length of output: 6842


Use plural helper and assert operator PrometheusRule remains

EnsureResourcesGone and EnsureResourceExists are present in tests/e2e/test_context_test.go, so switch to the plural helper for consistency and add an assertion that operator-prometheusrules still exists after removing the dashboard.

  • File: tests/e2e/monitoring_test.go — update ValidatePrometheusRuleDeletion (around lines 678–681).

Apply this diff:

-// Ensure the dashboard-prometheusrules is deleted
-tc.EnsureResourceGone(
-	WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
-)
+// Ensure the dashboard-specific PrometheusRule is deleted
+tc.EnsureResourcesGone(
+	WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
+)
+
+// Operator-level rules should remain when alerting is still enabled
+tc.EnsureResourceExists(
+	WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "operator-prometheusrules", Namespace: tc.AppsNamespace}),
+)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tc.EnsureResourceGone(
WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
)
}
// Ensure the dashboard-specific PrometheusRule is deleted
tc.EnsureResourcesGone(
WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
)
// Operator-level rules should remain when alerting is still enabled
tc.EnsureResourceExists(
WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "operator-prometheusrules", Namespace: tc.AppsNamespace}),
)
}
🤖 Prompt for AI Agents
In tests/e2e/monitoring_test.go around lines 678 to 681 in
ValidatePrometheusRuleDeletion, switch the single-resource call to the plural
helper: replace tc.EnsureResourceGone(WithMinimalObject(gvk.PrometheusRule,
types.NamespacedName{Name: "dashboard-prometheusrules", Namespace:
tc.AppsNamespace})) with
tc.EnsureResourcesGone(WithMinimalObjects(gvk.PrometheusRule,
types.NamespacedName{Name: "dashboard-prometheusrules", Namespace:
tc.AppsNamespace})), and then add a call to assert the operator rule still
exists: tc.EnsureResourceExists(WithMinimalObject(gvk.PrometheusRule,
types.NamespacedName{Name: "operator-prometheusrules", Namespace:
tc.AppsNamespace})).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml (2)

15-18: Use exact-match label selector instead of regex for a fixed value.

instance=~"modelmesh-controller" is a regex match; use instance="modelmesh-controller" for precision and performance.

Apply this diff in each alert expression:

-            sum(probe_success:burnrate5m{instance=~"modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000))
+            sum(probe_success:burnrate5m{instance="modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000))
-            sum(probe_success:burnrate1h{instance=~"modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000))
+            sum(probe_success:burnrate1h{instance="modelmesh-controller"}) by (instance) > (14.40 * (1-0.98000))

-            sum(probe_success:burnrate30m{instance=~"modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000))
+            sum(probe_success:burnrate30m{instance="modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000))
-            sum(probe_success:burnrate6h{instance=~"modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000))
+            sum(probe_success:burnrate6h{instance="modelmesh-controller"}) by (instance) > (6.00 * (1-0.98000))

-            sum(probe_success:burnrate2h{instance=~"modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000))
+            sum(probe_success:burnrate2h{instance="modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000))
-            sum(probe_success:burnrate1d{instance=~"modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000))
+            sum(probe_success:burnrate1d{instance="modelmesh-controller"}) by (instance) > (3.00 * (1-0.98000))

Also applies to: 27-30, 39-42


6-9: Consider adding standard labels/annotations for discoverability and routing.

Adding labels like app.kubernetes.io/component, app.kubernetes.io/part-of, and odh.component=modelmeshserving helps selection, ownership, and cleanup across components.

I can draft a consistent labeling scheme used by other component PrometheusRules in this PR so the monitoring controller can manage them uniformly.

Also applies to: 47-49

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8ce120c and 217dd60.

📒 Files selected for processing (1)
  • internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml (1 hunks)
🔇 Additional comments (2)
internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml (2)

10-14: Alert rule naming/structure LGTM; multi-window burn-rate thresholds align with 98% SLO.

Once the base series and selectors are fixed, these three windows and durations should behave as expected.

Also applies to: 22-26, 34-38


1-2: PrometheusRule API group — repo uses monitoring.rhobs/v1; original concern not applicable

Quick verification: most PrometheusRule manifests in the repo use monitoring.rhobs/v1; only one file uses monitoring.coreos.com/v1. The file under review is consistent with the repo standard.

Files found (apiVersion -> value):

  • config/monitoring/base/rhods-prometheusrules.yaml -> monitoring.coreos.com/v1
  • internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml -> monitoring.rhobs/v1

Recommendation: If the intended target is the Prometheus Operator CRD (monitoring.coreos.com/v1), pick that as the canonical API and update the monitoring.rhobs.v1 files; otherwise, keep monitoring.rhobs/v1 and ignore the original suggestion.

Likely an incorrect or invalid review comment.

@StevenTobin
Copy link
Contributor Author

/test opendatahub-operator-e2e

1 similar comment
@StevenTobin
Copy link
Contributor Author

/test opendatahub-operator-e2e

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (12)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (3)

14-14: Align PVC regex selectors between numerator and denominator to avoid mismatched series

The used/capacity selectors use different regexes, which can silently drop series during division. Make them identical.

Apply:

-          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99
+          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99

53-100: Duplicate alert names in one group will prevent the entire rule file from loading—make them unique

All four alerts share the same name in the group SLOs-probe_success_workbench. Prometheus rejects duplicate alert names within a group; the whole group (and often file) will fail to load.

Suffix the alert names (and optionally summaries) with the window pair:

-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 5m_1h
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 30m_6h
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 2h_1d
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 6h_3d

Optional, mirror the suffix in summary: to aid triage.

To validate there are no remaining duplicates and no single-quoted label selectors, run:

#!/bin/bash
set -euo pipefail
f='internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml'

echo "Duplicate alert names within SLO group:"
awk '
  $0 ~ /- name: SLOs-probe_success_workbench/ { in=1; next }
  in && $0 ~ /- name:/ { in=0 }
  in && $0 ~ /^\s*- alert:\s*(.*)$/ { name=$0; gsub(/.*- alert:\s*/,"",name); c[name]++ }
  END { for (k in c) if (c[k]>1) printf "%s -> %d\n", k, c[k] }
' "$f" || true

echo "Single-quoted label selectors (invalid in PromQL):"
rg -n --color=never "=[~!]?'[^']+'" "$f" || true

174-177: PromQL type error: label_replace applied to a scalar; use vector form with aggregation

label_replace(min(probe_success{...}), ...) applies label_replace to a scalar (since min(...) without by() collapses to a scalar). This fails at evaluation time. Keep it as a single-element vector using min by () and then apply label_replace. Also keep the per-name vector via min by (name).

-        - expr: ((min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name) or on() vector(0)) or label_replace(min(probe_success{name=~"rhods-dashboard|notebook-spawner"}), "name", "combined", "name", ".*"))
+        - expr: |
+            min by (name) (probe_success{name=~"rhods-dashboard|notebook-spawner"})
+              or
+            label_replace(
+              min by () (probe_success{name=~"rhods-dashboard|notebook-spawner"}),
+              "name", "combined", "", ""
+            )

Optional: if these probes are scraped under a dedicated job, constrain with job="user_facing_endpoints_status_workbenches" to avoid mixing unrelated series.

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (2)

20-21: Template the namespace label instead of hardcoding. [duplicate]

Hardcoding the alert label namespace breaks multi-namespace installs. Use the templated namespace to mirror metadata.namespace.

Apply this diff at each label block:

           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}

And similarly for the other occurrences on Lines 32-33, 44-45, 52-55, and 61-64.

Also applies to: 32-33, 44-45, 52-55, 61-64


8-10: Correct YAML indentation and remove trailing spaces. [duplicate]

YAMLlint errors indicate mis-indentation and trailing whitespace that will cause parsing or lint failures. Align list items and keys to standard PrometheusRule structure.

Apply these diffs:

-      - name: SLOs-probe_success_codeflare
+    - name: SLOs-probe_success_codeflare
-        rules:
+    rules:
-        - alert: CodeFlare Operator Probe Success Burn Rate
+      - alert: CodeFlare Operator Probe Success Burn Rate
-      
+
-        - expr: |
+      - expr: |
-        - expr: |
+      - expr: |

Note: The intent is:

  • spec
    • groups
      • name
      • rules
        • alert/expr/etc.

Please re-run yamllint after changes.

Also applies to: 68-68, 72-72, 110-110

internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)

49-49: Remove trailing space in comment.

-      # RecordingRules for Llama Stack 
+      # RecordingRules for Llama Stack

52-56: PromQL: compute probe_success from up; current expression can mask outages.

absent(up)*0 or vector(1) returns 1 if the series merely exists, even when up==0 (down), hiding outages. Record probe_success from the actual up metric, scoped to the namespace, and coalesce missing to 0.

-      - expr: |
-          absent(up{job="llama-stack-k8s-operator"}) * 0 or vector(1)
+      - expr: |
+          avg without (pod, endpoint, instance) (
+            up{job="llama-stack-k8s-operator", namespace="{{.Namespace}}"}
+          ) or on() vector(0)
         labels:
           instance: llama-stack-k8s-operator-controller-manager
         record: probe_success
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (3)

8-8: Fix indentation under spec.groups

YAMLlint flags the list item under spec.groups as over-indented (6 spaces). It should be indented by 4 spaces to be a peer of subsequent groups.

Apply this diff:

-      - name: SLOs-probe_success_feast_operator
+    - name: SLOs-probe_success_feast_operator

52-52: Correct indentation for rule items under rules

The list item under rules should be indented by 10 spaces (currently 8). This is a common source of YAML parsing errors.

Apply this diff:

-        - expr: |
+          - expr: |

22-22: Remove trailing whitespace to satisfy YAML linters

Multiple blank lines contain trailing spaces. Clean them up to prevent lint/pipeline failures.

Apply this diff:

-    
+
-    
+
-    
+
-    
+
-    
+
-    
+
-    
+
-    
+
-    
+

Also applies to: 35-35, 57-57, 63-63, 69-69, 75-75, 81-81, 87-87, 93-93

internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (2)

20-21: Use appropriate severity (warning/critical) for burn-rate alerts (not info).

These are SLO violations; “info” is too low for actionable burn-rate alerts and will likely be ignored by routing/pagers. Align with other components (page on fast-burn, ticket on slow-burn).

Example mapping:

  • 5m/1h: critical
  • 30m/6h: critical or warning (choose per policy)
  • 2h/1d: warning
  • 6h/3d: warning

Do you want me to generate an exact diff with your preferred mapping?

Also applies to: 32-33, 44-45, 56-57


98-133: Correct PromQL syntax in readiness alerts and address message vs timing mismatch.

  • PromQL: Aggregator must include the by clause: min by (...) (expr).
  • The message says “more than 5 minutes” but the condition is effectively 3m window + 2m for. Align text or timings.

Apply the syntax fix:

-          expr: min(max_over_time(data_science_pipelines_application_ready[3m])) by (dspa_name, dspa_namespace) == 0
+          expr: min by (dspa_name, dspa_namespace) (max_over_time(data_science_pipelines_application_ready[3m])) == 0

Repeat the same change for APIServer, PersistenceAgent, and ScheduledWorkflows expressions in this group.

For the timing, either:

  • Update “5 minutes” → “3 minutes”, or
  • Use a 5m range and adjust for to match.

Example to truly cover 5 minutes:

-          expr: min by (dspa_name, dspa_namespace) (max_over_time(data_science_pipelines_application_ready[3m])) == 0
-          for: 2m
+          expr: min by (dspa_name, dspa_namespace) (max_over_time(data_science_pipelines_application_ready[5m])) == 0
+          for: 1m
🧹 Nitpick comments (12)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (2)

23-23: PromQL: normalize regex and spacing in the 100% PVC usage alert

Keep the same anchored regex in both operands and add spacing for readability.

-          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"}/kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.99
+          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.99

8-10: YAML lint: indentation and trailing spaces

YAMLlint flags indentation style and trailing spaces. Not functionally breaking, but it may fail CI/style checks. Recommend normalizing indent to 2 spaces under groups: and removing trailing spaces.

Minimal trailing-space cleanup:

-      
+

And:

-            
+

Consider running your repo’s YAML formatter/linter to auto-fix the remaining indentation warnings.

Also applies to: 31-31, 50-50, 53-53, 105-105, 143-143, 171-171, 174-174

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (2)

69-71: Fix mismatched comment header for recording rules.

The comment says “RecordingRules for CodeFlare Operator” but the following group is “SLOs - MCAD Controller”. Update the comment to avoid confusion.

-      # RecordingRules for CodeFlare Operator
+      # RecordingRules for MCAD Controller

47-59: Optional: Aggregate multiple targets and use explicit down condition.

If multiple CodeFlare Operator targets exist, the current rule may produce multiple alerts or miss partial-down scenarios. Consider aggregating and alerting when all targets are down.

Example (optional):

  • Use sum(up{job=~"CodeFlare Operator"}) == 0
  • Or min_over_time(sum by (job)(up{job=~"CodeFlare Operator"})[2m]) == 0 for stability

Not blocking; offered for robustness.

internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)

15-21: Prefer exact label match over regex and keep selectors minimal.

Use instance="..." instead of instance=~"..." since the value is fixed; regex adds overhead and potential surprises.

-              sum(probe_success:burnrate5m{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000))
+              sum(probe_success:burnrate5m{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000))
@@
-              sum(probe_success:burnrate1h{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000))
+              sum(probe_success:burnrate1h{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000))
@@
-              sum(probe_success:burnrate30m{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000))
+              sum(probe_success:burnrate30m{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000))
@@
-              sum(probe_success:burnrate6h{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000))
+              sum(probe_success:burnrate6h{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000))
@@
-              sum(probe_success:burnrate2h{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000))
+              sum(probe_success:burnrate2h{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000))
@@
-              sum(probe_success:burnrate1d{instance=~"llama-stack-k8s-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000))
+              sum(probe_success:burnrate1d{instance="llama-stack-k8s-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000))

Also applies to: 28-34, 41-47


8-8: Unify group naming style for readability.

Group names mix hyphenated and spaced styles:

  • SLOs-probe_success_llama_stack_k8s_operator
  • SLOs - Llama Stack K8s Operator

Pick one convention (suggest the spaced form) for consistency.

-    - name: SLOs-probe_success_llama_stack_k8s_operator
+    - name: SLOs - Llama Stack K8s Operator (Alerts)
@@
-  - name: SLOs - Llama Stack K8s Operator
+  - name: SLOs - Llama Stack K8s Operator (RecordingRules)

Also applies to: 50-50

internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (2)

15-17: Simplify PromQL by removing unnecessary sum/by and regex equality

These selectors return a single series per instance; sum by (instance) and =~ can be dropped for clarity and minor eval savings.

Apply this diff:

-              sum(probe_success:burnrate5m{instance=~"feast-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000))
+              probe_success:burnrate5m{instance="feast-operator-controller-manager"} > (14.40 * (1-0.98000))
               and
-              sum(probe_success:burnrate1h{instance=~"feast-operator-controller-manager"}) by (instance) > (14.40 * (1-0.98000))
+              probe_success:burnrate1h{instance="feast-operator-controller-manager"} > (14.40 * (1-0.98000))
-              sum(probe_success:burnrate30m{instance=~"feast-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000))
+              probe_success:burnrate30m{instance="feast-operator-controller-manager"} > (6.00 * (1-0.98000))
               and
-              sum(probe_success:burnrate6h{instance=~"feast-operator-controller-manager"}) by (instance) > (6.00 * (1-0.98000))
+              probe_success:burnrate6h{instance="feast-operator-controller-manager"} > (6.00 * (1-0.98000))
-              sum(probe_success:burnrate2h{instance=~"feast-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000))
+              probe_success:burnrate2h{instance="feast-operator-controller-manager"} > (3.00 * (1-0.98000))
               and
-              sum(probe_success:burnrate1d{instance=~"feast-operator-controller-manager"}) by (instance) > (3.00 * (1-0.98000))
+              probe_success:burnrate1d{instance="feast-operator-controller-manager"} > (3.00 * (1-0.98000))

Also applies to: 28-30, 41-43


59-62: Drop redundant min() around avg_over_time()

avg_over_time() already returns a single series for this selector; min() is unnecessary noise.

Apply this diff:

-            1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[1d]))
+            1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[1d])
-            1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[1h]))
+            1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[1h])
-            1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[2h]))
+            1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[2h])
-            1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[30m]))
+            1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[30m])
-            1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[3d]))
+            1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[3d])
-            1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[5m]))
+            1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[5m])
-            1 - min(avg_over_time(probe_success{instance="feast-operator-controller-manager"}[6h]))
+            1 - avg_over_time(probe_success{instance="feast-operator-controller-manager"}[6h])

Also applies to: 65-68, 71-74, 77-80, 83-86, 89-92, 95-98

internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (4)

98-129: Make alert names consistent: use “Data Science Pipelines …” (plural) throughout.

Alert names in this section mix singular/plural. Use “Pipelines” consistently.

-        - alert: Data Science Pipeline Application Unavailable
+        - alert: Data Science Pipelines Application Unavailable
...
-        - alert: Data Science Pipeline APIServer Unavailable
+        - alert: Data Science Pipelines APIServer Unavailable
...
-        - alert: Data Science Pipeline PersistenceAgent Unavailable
+        - alert: Data Science Pipelines PersistenceAgent Unavailable
...
-        - alert: Data Science Pipeline ScheduledWorkflows Unavailable
+        - alert: Data Science Pipelines ScheduledWorkflows Unavailable

19-33: Avoid hard-coding namespace label; template it to the target namespace.

Hard-coding ‘redhat-ods-applications’ risks misrouting alerts when the operator is installed elsewhere. Use the templated namespace to keep routing consistent with deployment.

           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{ .Namespace }}
...
           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{ .Namespace }}

Repeat within this group for the other two rules (lines 44-45 and 56-57).


69-95: Template the namespace label here as well.

           labels:
             severity: critical
-            namespace: redhat-ods-applications
+            namespace: {{ .Namespace }}
...
           labels:
             severity: critical
-            namespace: redhat-ods-applications
+            namespace: {{ .Namespace }}
...
           labels:
             severity: warning
-            namespace: redhat-ods-applications
+            namespace: {{ .Namespace }}

104-133: Template the namespace label for readiness alerts.

           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{ .Namespace }}
...
           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{ .Namespace }}
...
           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{ .Namespace }}
...
           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{ .Namespace }}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2d5ac29 and 3fdf7b8.

📒 Files selected for processing (34)
  • api/services/v1alpha1/monitoring_types.go (3 hunks)
  • api/services/v1alpha1/zz_generated.deepcopy.go (2 hunks)
  • bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml (2 hunks)
  • bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (2 hunks)
  • bundle/manifests/services.platform.opendatahub.io_monitorings.yaml (2 hunks)
  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (2 hunks)
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml (2 hunks)
  • config/rbac/role.yaml (2 hunks)
  • docs/api-overview.md (4 hunks)
  • internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/components.go (1 hunks)
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/dscinitialization/dscinitialization_controller.go (1 hunks)
  • internal/controller/dscinitialization/kubebuilder_rbac.go (1 hunks)
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/services/monitoring/monitoring_controller.go (2 hunks)
  • internal/controller/services/monitoring/monitoring_controller_actions.go (3 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (2 hunks)
  • internal/controller/status/status.go (2 hunks)
  • pkg/cluster/gvk/gvk.go (1 hunks)
  • pkg/common/common.go (2 hunks)
  • tests/e2e/monitoring_test.go (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (29)
  • internal/controller/dscinitialization/dscinitialization_controller.go
  • internal/controller/services/monitoring/monitoring_controller_support.go
  • internal/controller/services/monitoring/monitoring_controller.go
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
  • pkg/common/common.go
  • pkg/cluster/gvk/gvk.go
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
  • internal/controller/components/components.go
  • internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
  • internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
  • api/services/v1alpha1/zz_generated.deepcopy.go
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
  • internal/controller/services/monitoring/monitoring_controller_actions.go
  • bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
  • internal/controller/status/status.go
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
  • bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
  • tests/e2e/monitoring_test.go
  • bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
  • config/rbac/role.yaml
  • internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
  • internal/controller/dscinitialization/kubebuilder_rbac.go
  • api/services/v1alpha1/monitoring_types.go
  • internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
  • docs/api-overview.md
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
🧰 Additional context used
🪛 YAMLlint (1.37.1)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[error] 68-68: trailing spaces

(trailing-spaces)


[warning] 72-72: wrong indentation: expected 10 but found 8

(indentation)


[warning] 110-110: wrong indentation: expected 10 but found 8

(indentation)

internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[error] 48-48: trailing spaces

(trailing-spaces)


[error] 49-49: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[warning] 31-31: wrong indentation: expected 10 but found 8

(indentation)


[error] 50-50: trailing spaces

(trailing-spaces)


[warning] 53-53: wrong indentation: expected 10 but found 8

(indentation)


[warning] 105-105: wrong indentation: expected 10 but found 8

(indentation)


[warning] 143-143: wrong indentation: expected 10 but found 8

(indentation)


[error] 171-171: trailing spaces

(trailing-spaces)


[warning] 174-174: wrong indentation: expected 10 but found 8

(indentation)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: golangci-lint
  • GitHub Check: Run tests and collect coverage on internal and pkg
🔇 Additional comments (3)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)

102-139: Recording rules set looks consistent and follows burn-rate conventions

Windows, record names, and labels align with the burn-rate alerts above. No issues spotted.

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)

51-51: Fix PromQL quoting and matcher operators in CodeFlare Operator alerts

File: internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
Lines: 51, 60

Replace single‐quoted label matchers with valid double‐quoted ones and standardize the operator to =~ for both alerts:

--- a/internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
+++ b/internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml
@@ -51,7 +51,7 @@
-            expr: absent(up{job=~'CodeFlare Operator'}) or up{job=~'CodeFlare Operator'} != 1
+            expr: absent(up{job=~"CodeFlare Operator"}) or up{job=~"CodeFlare Operator"} != 1

@@ -60,7 +60,7 @@
-            expr: absent_over_time(up{job="CodeFlare Operator"}[2m]) == 1
+            expr: absent_over_time(up{job=~"CodeFlare Operator"}[2m]) == 1

Please confirm what the job label actually is in your ServiceMonitor/scrape config (if you haven’t overridden jobLabel, it defaults to <namespace>/<servicemonitor-name>). If the label value has no spaces (e.g. codeflare-operator), switch to a regex like:

expr: absent(up{job=~"codeflare-operator.*"}) or up{job=~"codeflare-operator.*"} != 1
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)

1-7: Overall: solid SLO burn-rate pattern and templating

The rule set follows the standard multi-window, multi-burn SLO approach, uses a clear instance scoping label, and integrates cleanly with the templated namespace. Once the YAML formatting nits are fixed, this looks good to ship.

Also applies to: 10-21, 23-47, 49-51, 58-99

run make bundle

fix_rules_file_name

fix_error_in_dsp_monitoring

fix e2e test

run make lint
@StevenTobin StevenTobin force-pushed the RHOAIENG-25366_enable_built_in_alerting branch from 3fdf7b8 to 93a65af Compare August 18, 2025 07:54
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
internal/controller/services/monitoring/monitoring_controller_actions.go (1)

89-104: Guard against missing componentRules entries before updating Prometheus config

Indexing componentRules[ch.GetName()] without checking presence will pass an empty prefix for unknown components, which can lead to incorrect config updates. Guard the lookup.

Apply this diff:

 return cr.ForEach(func(ch cr.ComponentHandler) error {
   ci := ch.NewCRObject(dsc)
+  prefix, ok := componentRules[ch.GetName()]
+  if !ok {
+    // No prom rules mapping for this component; skip.
+    return nil
+  }
   if ch.IsEnabled(dsc) {
     ready, err := isComponentReady(ctx, rr.Client, ci)
     if err != nil {
       return fmt.Errorf("failed to get component status %w", err)
     }
     if !ready { // not ready, skip change on prom rules
       return nil
     }
     // add
-    return updatePrometheusConfig(ctx, true, componentRules[ch.GetName()])
+    return updatePrometheusConfig(ctx, true, prefix)
   } else {
-    return updatePrometheusConfig(ctx, false, componentRules[ch.GetName()])
+    return updatePrometheusConfig(ctx, false, prefix)
   }
 })
♻️ Duplicate comments (13)
tests/e2e/monitoring_test.go (1)

666-681: Use plural helper and assert operator PrometheusRule remains

For consistency with other helpers and to validate operator-level rules persist when a component is removed, apply:

-// Ensure the dashboard-prometheusrules is deleted
-tc.EnsureResourceGone(
-  WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
-)
+// Ensure the dashboard-specific PrometheusRule is deleted
+tc.EnsureResourcesGone(
+  WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "dashboard-prometheusrules", Namespace: tc.AppsNamespace}),
+)
+
+// Operator-level rules should remain when alerting is still enabled
+tc.EnsureResourceExists(
+  WithMinimalObject(gvk.PrometheusRule, types.NamespacedName{Name: "operator-prometheusrules", Namespace: tc.AppsNamespace}),
+)
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1)

8-9: Fix YAML indentation under groups

YAMLlint flags this list item as over-indented. Align list item to two spaces under groups:.

-      - name: SLOs-probe_success_feast_operator
+    - name: SLOs-probe_success_feast_operator
internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (1)

10-42: Differentiate alert names for clarity

-        - alert: Kserve Controller Probe Success Burn Rate
+        - alert: Kserve Controller Probe Success Burn Rate (5m/1h)
...
-        - alert: Kserve Controller Probe Success Burn Rate
+        - alert: Kserve Controller Probe Success Burn Rate (30m/6h)
...
-        - alert: Kserve Controller Probe Success Burn Rate
+        - alert: Kserve Controller Probe Success Burn Rate (2h/1d)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)

50-61: PromQL: use double quotes and consistent selector

PromQL requires double quotes for label values. Also standardize job label matching across alerts.

Apply this diff:

-          - alert: CodeFlare Operator is not running
-            expr: absent(up{job=~'CodeFlare Operator'}) or up{job=~'CodeFlare Operator'} != 1
+          - alert: CodeFlare Operator is not running
+            expr: absent(up{job=~"CodeFlare Operator"}) or up{job=~"CodeFlare Operator"} != 1
@@
-          - alert: CodeFlare Operator taking too long to be up
-            expr: absent_over_time(up{job="CodeFlare Operator"}[2m]) == 1
+          - alert: CodeFlare Operator taking too long to be up
+            expr: absent_over_time(up{job=~"CodeFlare Operator"}[2m]) == 1
internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (2)

22-22: Remove trailing spaces to satisfy linters

Multiple lines have trailing spaces; please strip them file-wide.

Also applies to: 35-35, 48-49, 57-57, 63-63, 69-69, 75-75, 81-81, 87-87, 93-93


52-56: PromQL: compute probe_success from up metric; current expression always returns 1

The existing rule uses series existence, which masks outages. Compute probe_success from actual up values and coalesce missing to 0.

Apply this diff:

-        - expr: |
-            absent(up{job="llama-stack-k8s-operator"}) * 0 or vector(1)
-          labels:
-            instance: llama-stack-k8s-operator-controller-manager
-          record: probe_success
+      - expr: |
+          avg without (pod, endpoint, instance) (
+            up{job="llama-stack-k8s-operator", namespace="{{.Namespace}}"}
+          ) or on() vector(0)
+        labels:
+          instance: llama-stack-k8s-operator-controller-manager
+        record: probe_success
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (4)

14-14: Unify PVC regexes to avoid mismatched series

The used and capacity selectors should use identical regexes.

Apply this diff:

-          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99
+          expr: |
+            kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"}
+              /
+            kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9
+              and
+            kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"}
+              /
+            kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99

53-100: Make alert names unique within the group for clarity and triage

Repeated alert names hinder triage and can confuse downstream routing. Add window suffixes.

Apply this diff:

-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate (5m/1h)
@@
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate (30m/6h)
@@
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate (2h/1d)
@@
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate (6h/3d)

35-35: PromQL strings must be double-quoted

Single quotes are invalid in PromQL label values.

Apply this diff:

-          expr: absent(up{job=~'Kubeflow Notebook Controller Service Metrics'})
+          expr: absent(up{job=~"Kubeflow Notebook Controller Service Metrics"})
@@
-          expr: absent(up{job=~'ODH Notebook Controller Service Metrics'})
+          expr: absent(up{job=~"ODH Notebook Controller Service Metrics"})

Also applies to: 44-44


174-177: PromQL: label_replace on a scalar; fix by preserving vector context

label_replace(min(...)) collapses to a scalar; apply min by () to keep a single-element vector, then set “name” for the combined series.

Apply this diff:

-        - expr: ((min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name) or on() vector(0)) or label_replace(min(probe_success{name=~"rhods-dashboard|notebook-spawner"}), "name", "combined", "name", ".*"))
+        - expr: |
+            min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name)
+              or
+            label_replace(
+              min by () (probe_success{name=~"rhods-dashboard|notebook-spawner"}),
+              "name", "combined", "", ""
+            )
internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (3)

21-21: Template namespace label instead of hardcoding

Replace hardcoded label with template namespace to match metadata.namespace.

Apply this diff:

-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}

Also applies to: 33-33, 45-45, 57-57, 71-71, 83-83, 95-95, 107-107


10-57: Make route burn-rate alert names unique per window

Disambiguate alerts by embedding their time windows; improves triage and dashboards.

Apply this diff:

-        - alert: RHODS Dashboard Route Error Burn Rate
+        - alert: RHODS Dashboard Route Error Burn Rate (5m/1h)
@@
-        - alert: RHODS Dashboard Route Error Burn Rate
+        - alert: RHODS Dashboard Route Error Burn Rate (30m/6h)
@@
-        - alert: RHODS Dashboard Route Error Burn Rate
+        - alert: RHODS Dashboard Route Error Burn Rate (2h/1d)
@@
-        - alert: RHODS Dashboard Route Error Burn Rate
+        - alert: RHODS Dashboard Route Error Burn Rate (6h/3d)

162-196: Recording rules risk “duplicate series for the same label set”; align selectors and aggregate

Each expression returns multiple series (e.g., per instance) but then force-sets labels.instance: rhods-dashboard, causing collisions at record time. Also, selectors use instance=~ while corresponding alerts use name=~, creating inconsistency.

Apply this diff to aggregate by name and drop the explicit labels override:

-        - expr: |
-            1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[1d])
-          labels:
-            instance: rhods-dashboard
-          record: probe_success:burnrate1d
+        - expr: |
+            sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[1d]))
+          record: probe_success:burnrate1d
-        - expr: |
-            1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[1h])
-          labels:
-            instance: rhods-dashboard
-          record: probe_success:burnrate1h
+        - expr: |
+            sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[1h]))
+          record: probe_success:burnrate1h
-        - expr: |
-            1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[2h])
-          labels:
-            instance: rhods-dashboard
-          record: probe_success:burnrate2h
+        - expr: |
+            sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[2h]))
+          record: probe_success:burnrate2h
-        - expr: |
-            1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[30m])
-          labels:
-            instance: rhods-dashboard
-          record: probe_success:burnrate30m
+        - expr: |
+            sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[30m]))
+          record: probe_success:burnrate30m
-        - expr: |
-            1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[3d])
-          labels:
-            instance: rhods-dashboard
-          record: probe_success:burnrate3d
+        - expr: |
+            sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[3d]))
+          record: probe_success:burnrate3d
-        - expr: |
-            1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[5m])
-          labels:
-            instance: rhods-dashboard
-          record: probe_success:burnrate5m
+        - expr: |
+            sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[5m]))
+          record: probe_success:burnrate5m
-        - expr: |
-            1 - avg_over_time(probe_success{instance=~"rhods-dashboard-.*", job="user_facing_endpoints_status_rhods_dashboard"}[6h])
-          labels:
-            instance: rhods-dashboard
-          record: probe_success:burnrate6h
+        - expr: |
+            sum by (name) (1 - avg_over_time(probe_success{job="user_facing_endpoints_status_rhods_dashboard", name="rhods-dashboard"}[6h]))
+          record: probe_success:burnrate6h

This aligns recordings with alert selectors (which use name) and prevents duplicate-series errors.

🧹 Nitpick comments (13)
internal/controller/status/status.go (1)

108-109: New component condition type looks consistent; consider documenting intent

ConditionAlertingAvailable aligns with existing condition naming (e.g., MonitoringAvailable). For parity and discoverability, add a short comment above it describing when controllers should toggle it (e.g., set True when alerting PrometheusRules applied and CRD present).

-    ConditionAlertingAvailable               = "AlertingAvailable"
+    // ConditionAlertingAvailable indicates whether alerting (PrometheusRule deployment) is available.
+    ConditionAlertingAvailable               = "AlertingAvailable"
config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (1)

247-250: Tighten CEL: disallow null metrics when alerting is set

The current rule passes when metrics is present but explicitly null. To clearly require a non-null metrics stanza when alerting is configured, tighten the CEL check.

-                - message: Alerting configuration requires metrics to be configured
-                  rule: 'has(self.alerting) ? has(self.metrics) : true'
+                - message: Alerting configuration requires metrics to be configured
+                  rule: 'has(self.alerting) ? (has(self.metrics) && self.metrics != null) : true'

Note: This still allows an empty object ({}), which you appear to rely on elsewhere, but prevents alerting with metrics: null or omitted.

tests/e2e/monitoring_test.go (1)

620-664: Make PrometheusRule creation test resilient to preconditions

Looks good. To make it hermetic, consider verifying/setting .spec.monitoring.metrics beforehand (CRD enforces metrics when alerting is set). In this suite, previous tests do set metrics, but an explicit precondition here would reduce coupling.

Example (optional) to prepend before enabling alerting:

+ // Ensure metrics stanza exists (required by CRD when alerting is set)
+ tc.EventuallyResourceCreatedOrUpdated(
+   WithMinimalObject(gvk.DSCInitialization, tc.DSCInitializationNamespacedName),
+   WithMutateFunc(testf.TransformPipeline(
+     testf.Transform(`.spec.monitoring.managementState = "%s"`, operatorv1.Managed),
+     setMonitoringMetrics(),
+   )),
+ )
internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (2)

22-35: Remove trailing whitespace to satisfy linters/pipelines

Multiple lines end with trailing spaces. Clean them up to avoid YAML lint errors.

Example (apply to listed lines):

-              
+

Also applies to: 57-58, 63-64, 69-70, 75-76, 81-82, 87-88, 93-94


10-47: Differentiate alert names to avoid deduplication collisions

All three alerts share the same name. Including the time windows improves triage and avoids alert dedup confusion.

-          - alert: Feast Operator Probe Success Burn Rate
+          - alert: Feast Operator Probe Success Burn Rate (5m/1h)
...
-          - alert: Feast Operator Probe Success Burn Rate
+          - alert: Feast Operator Probe Success Burn Rate (30m/6h)
...
-          - alert: Feast Operator Probe Success Burn Rate
+          - alert: Feast Operator Probe Success Burn Rate (2h/1d)
internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (1)

8-9: Fix YAML indentation under groups

Align list items to two spaces beneath groups: for consistency and linter compliance.

-      - name: SLOs-probe_success_kserve
+    - name: SLOs-probe_success_kserve
internal/controller/services/monitoring/monitoring_controller_actions.go (1)

400-421: Don’t fail reconcile on per-component rule add/cleanup errors

You log individual component errors with the intent to not fail reconciliation, but then return a generic error which fails the reconcile anyway. This causes noisy requeues and can mask partial success.

Apply this diff to keep reconciliation non-fatal while preserving logs:

-// If we fail to add prometheus rules for a component.
+// If we fail to add prometheus rules for a component.
 if len(addErrors) > 0 {
   // Log errors but don't fail the reconciliation
   for _, addErr := range addErrors {
     logf.FromContext(ctx).Error(addErr, "Failed to add prometheus rules for component")
   }
 }
 
 // If we fail to clean up prometheus rules for a component.
 if len(cleanupErrors) > 0 {
   // Log errors but don't fail the reconciliation
   for _, cleanupErr := range cleanupErrors {
     logf.FromContext(ctx).Error(cleanupErr, "Failed to cleanup prometheus rules for component")
   }
 }
 
-if len(addErrors) > 0 || len(cleanupErrors) > 0 {
-  return errors.New("errors occurred while adding or cleaning up prometheus rules for components")
-}
-
 return nil

If you prefer to surface degraded state, consider marking ConditionAlertingAvailable false (with message) when any component rule op fails, but still avoid throwing hard errors here.

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (3)

8-10: Fix YAML indentation under spec.groups

The list items under groups are over-indented; this trips linters/YAML parsers.

Apply this diff:

 spec:
   groups:
-      - name: SLOs-probe_success_codeflare
-        rules:
+    - name: SLOs-probe_success_codeflare
+      rules:
@@
-      
+
@@
-      - name: SLOs - MCAD Controller
+    - name: SLOs - MCAD Controller
@@
-      - name: SLOs - CodeFlare Operator
+    - name: SLOs - CodeFlare Operator

Also applies to: 72-72, 110-110


20-21: Template namespace label instead of hardcoding

Hardcoding namespace breaks non-default installs. Align label with metadata.namespace.

Apply this diff:

           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
@@
           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
@@
           labels:
             severity: info
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
@@
             labels:
               severity: info
-              namespace: redhat-ods-applications
+              namespace: {{.Namespace}}
@@
             labels:
               severity: info
-              namespace: redhat-ods-applications
+              namespace: {{.Namespace}}

Also applies to: 32-33, 44-45, 53-54, 62-63


68-68: Strip trailing whitespace

Trailing spaces can fail linters and clutter diffs. Please remove trailing spaces on this line and run a quick pass to strip EOL whitespace across the file.

internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (1)

8-10: Fix indentation under groups and rules

Over-indentation under groups and mis-indented rule item will cause YAMLlint warnings and can confuse reviewers.

Apply this diff:

 spec:
   groups:
-      - name: SLOs-probe_success_llama_stack_k8s_operator
-        rules:
+    - name: SLOs-probe_success_llama_stack_k8s_operator
+      rules:
@@
-        rules:
-        - expr: |
+      rules:
+      - expr: |

Also applies to: 52-52

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)

39-40: Template namespace label for portability

Use the template namespace in labels.

Apply this diff:

           labels:
             severity: warning
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}
@@
           labels:
             severity: warning
-            namespace: redhat-ods-applications
+            namespace: {{.Namespace}}

Also applies to: 48-49

internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1)

60-107: Make probe burn-rate alert names unique per window

Same rationale as above; add window suffixes.

Apply this diff:

-        - alert: RHODS Dashboard Probe Success Burn Rate
+        - alert: RHODS Dashboard Probe Success Burn Rate (5m/1h)
@@
-        - alert: RHODS Dashboard Probe Success Burn Rate
+        - alert: RHODS Dashboard Probe Success Burn Rate (30m/6h)
@@
-        - alert: RHODS Dashboard Probe Success Burn Rate
+        - alert: RHODS Dashboard Probe Success Burn Rate (2h/1d)
@@
-        - alert: RHODS Dashboard Probe Success Burn Rate
+        - alert: RHODS Dashboard Probe Success Burn Rate (6h/3d)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3fdf7b8 and 93a65af.

📒 Files selected for processing (34)
  • api/services/v1alpha1/monitoring_types.go (3 hunks)
  • api/services/v1alpha1/zz_generated.deepcopy.go (2 hunks)
  • bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml (2 hunks)
  • bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (2 hunks)
  • bundle/manifests/services.platform.opendatahub.io_monitorings.yaml (2 hunks)
  • config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (2 hunks)
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml (2 hunks)
  • config/rbac/role.yaml (2 hunks)
  • docs/api-overview.md (4 hunks)
  • internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/components.go (1 hunks)
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kserve/monitoring/kserve-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/dscinitialization/dscinitialization_controller.go (1 hunks)
  • internal/controller/dscinitialization/kubebuilder_rbac.go (1 hunks)
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/services/monitoring/monitoring_controller.go (2 hunks)
  • internal/controller/services/monitoring/monitoring_controller_actions.go (3 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (2 hunks)
  • internal/controller/status/status.go (2 hunks)
  • pkg/cluster/gvk/gvk.go (1 hunks)
  • pkg/common/common.go (2 hunks)
  • tests/e2e/monitoring_test.go (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (24)
  • internal/controller/dscinitialization/dscinitialization_controller.go
  • internal/controller/components/trainingoperator/monitoring/trainingoperator-prometheusrules.tmpl.yaml
  • internal/controller/components/ray/monitoring/ray-prometheusrules.tmpl.yaml
  • internal/controller/dscinitialization/kubebuilder_rbac.go
  • internal/controller/services/monitoring/monitoring_controller.go
  • internal/controller/components/components.go
  • internal/controller/services/monitoring/monitoring_controller_support.go
  • bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
  • api/services/v1alpha1/zz_generated.deepcopy.go
  • pkg/common/common.go
  • internal/controller/services/monitoring/monitoring/operator-prometheusrules.tmpl.yaml
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml
  • bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
  • pkg/cluster/gvk/gvk.go
  • docs/api-overview.md
  • config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
  • config/rbac/role.yaml
  • bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml
  • internal/controller/components/trustyai/monitoring/trustyai-prometheusrules.tmpl.yaml
  • api/services/v1alpha1/monitoring_types.go
  • internal/controller/components/kueue/monitoring/kueue-prometheusrules.tmpl.yaml
  • internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
🧰 Additional context used
🧬 Code Graph Analysis (2)
internal/controller/services/monitoring/monitoring_controller_actions.go (6)
pkg/controller/types/types.go (2)
  • ReconciliationRequest (75-112)
  • TemplateInfo (67-73)
api/services/v1alpha1/monitoring_types.go (2)
  • Monitoring (133-139)
  • Alerting (121-122)
pkg/cluster/gvk/gvk.go (2)
  • Monitoring (288-292)
  • PrometheusRule (480-484)
pkg/controller/conditions/conditions.go (2)
  • WithReason (19-23)
  • WithMessage (25-34)
pkg/cluster/resources.go (2)
  • HasCRD (310-312)
  • GetDSC (94-114)
internal/controller/components/registry/registry.go (2)
  • ForEach (71-73)
  • ComponentHandler (17-29)
tests/e2e/monitoring_test.go (5)
tests/e2e/resource_options_test.go (3)
  • WithMutateFunc (194-198)
  • WithMinimalObject (140-155)
  • WithCondition (203-207)
pkg/utils/test/testf/testf_support.go (2)
  • Transform (78-106)
  • TransformPipeline (50-61)
pkg/cluster/gvk/gvk.go (5)
  • DSCInitialization (66-70)
  • DataScienceCluster (60-64)
  • Namespace (24-28)
  • Dashboard (210-214)
  • PrometheusRule (480-484)
pkg/utils/test/matchers/jq/jq_matcher.go (1)
  • Match (11-15)
internal/controller/status/status.go (2)
  • ConditionTypeReady (90-90)
  • ConditionTypeProvisioningSucceeded (95-95)
🪛 YAMLlint (1.37.1)
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[warning] 31-31: wrong indentation: expected 10 but found 8

(indentation)


[error] 50-50: trailing spaces

(trailing-spaces)


[warning] 53-53: wrong indentation: expected 10 but found 8

(indentation)


[warning] 105-105: wrong indentation: expected 10 but found 8

(indentation)


[warning] 143-143: wrong indentation: expected 10 but found 8

(indentation)


[error] 171-171: trailing spaces

(trailing-spaces)


[warning] 174-174: wrong indentation: expected 10 but found 8

(indentation)

internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[error] 68-68: trailing spaces

(trailing-spaces)


[warning] 72-72: wrong indentation: expected 10 but found 8

(indentation)


[warning] 110-110: wrong indentation: expected 10 but found 8

(indentation)

internal/controller/components/feastoperator/monitoring/feastoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

internal/controller/components/llamastackoperator/monitoring/llamastackoperator-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 35-35: trailing spaces

(trailing-spaces)


[error] 48-48: trailing spaces

(trailing-spaces)


[error] 49-49: trailing spaces

(trailing-spaces)


[warning] 52-52: wrong indentation: expected 10 but found 8

(indentation)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 75-75: trailing spaces

(trailing-spaces)


[error] 81-81: trailing spaces

(trailing-spaces)


[error] 87-87: trailing spaces

(trailing-spaces)


[error] 93-93: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint
  • GitHub Check: Run tests and collect coverage on tests/integration
🔇 Additional comments (6)
internal/controller/status/status.go (1)

189-195: LGTM: Alerting not-configured reason/messages added

AlertingNotConfiguredReason/Message mirror the metrics/traces pattern and will help surface user feedback when alerting is disabled/missing in DSCI.

config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml (1)

93-96: Alerting field added under spec.monitoring: OK

Adding an optional object for alerting keeps it future-proof for additional knobs. No further schema constraints are necessary right now.

tests/e2e/monitoring_test.go (2)

51-53: Good coverage: creation/deletion tests for PrometheusRules

Including both creation and deletion flows early in the suite is helpful to validate the end-to-end alerting wiring.


208-209: Using whole-monitoring mutate is fine; assert behavior elsewhere remains stable

Switching to mutate the entire .spec.monitoring map is OK. Be mindful this also removes any alerting/traces stanzas previously set. Your subsequent tests reconfigure traces later; that’s fine for serial execution but makes tests order-dependent.

Do you want this test to be order-independent? If yes, I can propose localized mutations (only .spec.monitoring.metrics) and add preconditions in the PrometheusRule tests to self-contain setup.

internal/controller/services/monitoring/monitoring_controller_actions.go (2)

54-55: LGTM: embedding operator rule templates

Stacking two //go:embed directives to include both resources/ and monitoring/ into a single embed.FS is valid and keeps packaging simple.


341-349: LGTM: gating alerting on PrometheusRule CRD and adding operator rules

Properly checks CRD availability and emits operator-level rules regardless of component readiness. Good guardrails for partial installations.

@openshift-ci openshift-ci bot added the lgtm label Aug 18, 2025
@openshift-ci openshift-ci bot removed the lgtm label Aug 18, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (5)
internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1)

10-45: Make alert names unique per burn-rate window to avoid dedupe/collisions

All three alerts share the same name. This hampers triage and can lead to deduplication issues in Alertmanager. Suffix with the evaluated windows.

-        - alert: ODH Model Controller Probe Success Burn Rate
+        - alert: ODH Model Controller Probe Success Burn Rate 5m_1h
...
-        - alert: ODH Model Controller Probe Success Burn Rate
+        - alert: ODH Model Controller Probe Success Burn Rate 30m_6h
...
-        - alert: ODH Model Controller Probe Success Burn Rate
+        - alert: ODH Model Controller Probe Success Burn Rate 2h_1d
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (4)

14-14: Regex inconsistency between numerator/denominator; align to avoid mismatched series

The used selector allows any prefix, capacity requires prefix “jupyterhub-nb-”. Use identical patterns.

-          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9 and kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99
+          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.9
+            and
+            kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} < 0.99

53-101: Duplicate alert names in one group — make them unique

All four alerts share the same name; this will cause dedupe issues and can break loading in some setups. Suffix names with the window pairs.

-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 5m_1h
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 30m_6h
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 2h_1d
...
-        - alert: RHODS Jupyter Probe Success Burn Rate
+        - alert: RHODS Jupyter Probe Success Burn Rate 6h_3d

174-177: PromQL type error: label_replace applied to a scalar; use min by() to keep vector

label_replace() expects a vector. min(...) without by() collapses to a scalar and will fail evaluation. Fix the expression.

-        - expr: ((min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name) or on() vector(0)) or label_replace(min(probe_success{name=~"rhods-dashboard|notebook-spawner"}), "name", "combined", "name", ".*"))
+        - expr: |
+            min(probe_success{name=~"rhods-dashboard|notebook-spawner"}) by (name)
+              or
+            label_replace(
+              min by () (probe_success{name=~"rhods-dashboard|notebook-spawner"}),
+              "name", "combined", "", ""
+            )

35-45: PromQL strings must be double-quoted

Single-quoted label values are invalid. Fix both absent() expressions.

-          expr: absent(up{job=~'Kubeflow Notebook Controller Service Metrics'})
+          expr: absent(up{job=~"Kubeflow Notebook Controller Service Metrics"})
...
-          expr: absent(up{job=~'ODH Notebook Controller Service Metrics'})
+          expr: absent(up{job=~"ODH Notebook Controller Service Metrics"})
🧹 Nitpick comments (3)
internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (1)

8-9: Group name copy/paste: use a Model Registry-specific group name

Group is named “model_controller” in the Model Registry template. Rename for clarity.

-      - name: SLOs-probe_success_model_controller
+      - name: SLOs-probe_success_model_registry
internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1)

23-23: Consistent PVC selector regex

Mirror the same prefix regex as above for consistency.

-          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*jupyterhub-nb-.*"}/kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.99
+          expr: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"jupyterhub-nb-.*"} > 0.99
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1)

8-8: Fix YAML indentation/trailing space as per linter

Indentation is off under groups/rules; trailing space at Line 68. Align to standard 2-space/consistent nesting to avoid parse issues.

   groups:
-      - name: SLOs-probe_success_codeflare
-        rules:
-        - alert: CodeFlare Operator Probe Success Burn Rate
+    - name: SLOs-probe_success_codeflare
+      rules:
+      - alert: CodeFlare Operator Probe Success Burn Rate
...
-      
+

Also applies to: 10-10, 72-72, 110-110, 68-68

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 93a65af and adf8745.

📒 Files selected for processing (8)
  • internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelcontroller/monitoring/modelcontroller-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/modelregistry/monitoring/modelregistry-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml (1 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • internal/controller/services/monitoring/monitoring_controller_support.go
  • internal/controller/components/datasciencepipelines/monitoring/datasciencepipelines-prometheusrules.tmpl.yaml
  • internal/controller/components/modelmeshserving/monitoring/modelmeshserving-prometheusrules.tmpl.yaml
  • internal/controller/components/dashboard/monitoring/dashboard-prometheusrules.tmpl.yaml
🧰 Additional context used
🪛 YAMLlint (1.37.1)
internal/controller/components/codeflare/monitoring/codeflare-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[error] 68-68: trailing spaces

(trailing-spaces)


[warning] 72-72: wrong indentation: expected 10 but found 8

(indentation)


[warning] 110-110: wrong indentation: expected 10 but found 8

(indentation)

internal/controller/components/workbenches/monitoring/workbenches-prometheusrules.tmpl.yaml

[warning] 8-8: wrong indentation: expected 4 but found 6

(indentation)


[warning] 10-10: wrong indentation: expected 10 but found 8

(indentation)


[warning] 31-31: wrong indentation: expected 10 but found 8

(indentation)


[error] 50-50: trailing spaces

(trailing-spaces)


[warning] 53-53: wrong indentation: expected 10 but found 8

(indentation)


[warning] 105-105: wrong indentation: expected 10 but found 8

(indentation)


[warning] 143-143: wrong indentation: expected 10 but found 8

(indentation)


[error] 171-171: trailing spaces

(trailing-spaces)


[warning] 174-174: wrong indentation: expected 10 but found 8

(indentation)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on tests/integration
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint

@StevenTobin StevenTobin requested a review from CFSNM August 18, 2025 11:43
Copy link
Member

@CFSNM CFSNM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approved /lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 18, 2025
Copy link

openshift-ci bot commented Aug 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CFSNM, zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 18aa86d into opendatahub-io:main Aug 18, 2025
16 of 17 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in ODH Platform Planning Aug 18, 2025
@StevenTobin
Copy link
Contributor Author

/cherrypick rhoai

@openshift-cherrypick-robot

@StevenTobin: #2261 failed to apply on top of branch "rhoai":

Applying: add built in alerting configuration for components
.git/rebase-apply/patch:401: trailing whitespace.
      
.git/rebase-apply/patch:969: trailing whitespace.
    
.git/rebase-apply/patch:982: trailing whitespace.
    
.git/rebase-apply/patch:1004: trailing whitespace.
    
.git/rebase-apply/patch:1010: trailing whitespace.
    
warning: squelched 18 whitespace errors
warning: 23 lines add whitespace errors.
Using index info to reconstruct a base tree...
M	api/services/v1alpha1/monitoring_types.go
M	bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
A	bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
M	bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
M	config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
M	config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
M	config/rbac/role.yaml
M	docs/api-overview.md
M	internal/controller/dscinitialization/dscinitialization_controller.go
M	internal/controller/services/monitoring/monitoring_controller_support.go
M	internal/controller/status/status.go
M	pkg/cluster/gvk/gvk.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cluster/gvk/gvk.go
Auto-merging internal/controller/status/status.go
Auto-merging internal/controller/services/monitoring/monitoring_controller_support.go
Auto-merging internal/controller/dscinitialization/dscinitialization_controller.go
Auto-merging docs/api-overview.md
Auto-merging config/rbac/role.yaml
Auto-merging config/crd/bases/services.platform.opendatahub.io_monitorings.yaml
Auto-merging config/crd/bases/dscinitialization.opendatahub.io_dscinitializations.yaml
Auto-merging bundle/manifests/services.platform.opendatahub.io_monitorings.yaml
Auto-merging bundle/manifests/rhods-operator.clusterserviceversion.yaml
Auto-merging bundle/manifests/dscinitialization.opendatahub.io_dscinitializations.yaml
Auto-merging api/services/v1alpha1/monitoring_types.go
Applying: reference application namespace in alerts that include namespace
Using index info to reconstruct a base tree...
M	internal/controller/services/monitoring/monitoring_controller_support.go
Falling back to patching base and 3-way merge...
Auto-merging internal/controller/services/monitoring/monitoring_controller_support.go
CONFLICT (content): Merge conflict in internal/controller/services/monitoring/monitoring_controller_support.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0002 reference application namespace in alerts that include namespace

In response to this:

/cherrypick rhoai

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sukumars321 pushed a commit to sukumars321/opendatahub-operator that referenced this pull request Aug 19, 2025
* add built in alerting configuration for components

run make bundle

fix_rules_file_name

fix_error_in_dsp_monitoring

fix e2e test

run make lint

* reference application namespace in alerts that include namespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants