Skip to content

Conversation

sradco
Copy link

@sradco sradco commented Jul 24, 2025

This PR includes the enhancement proposal for a new Alerts Management UI.

@openshift-ci openshift-ci bot requested review from jan--f and simonpasquier July 24, 2025 15:55
Copy link
Contributor

openshift-ci bot commented Jul 24, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jan--f for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sradco sradco force-pushed the alert_managment_enhancment_proposal branch 2 times, most recently from c7c9d56 to 019bb16 Compare July 24, 2025 17:37
@sradco
Copy link
Author

sradco commented Jul 24, 2025

@machadovilaca @avlitman @nunnatsa please review this enhancement proposal.

@sradco sradco changed the title Enhancment proposal for alerts managment Enhancement proposal for alerts management Jul 27, 2025
@sradco sradco force-pushed the alert_managment_enhancment_proposal branch from 019bb16 to 1f7975e Compare July 27, 2025 07:39
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the proposal, it's not clear to me:

  • whether we need a new supported API or only UI improvements are required.
  • what are the gaps in the existing implementation that we want to fix (e.g. AlertingRule & AlertRelabel CRDs already exist but what make them hard to use?).


## Goals
1. CRUD operations on user‑defined alerts via UI and API.
2. Clone platform or user alerts to customize thresholds or scopes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support customization for user-defined alerting rules? Right now, alert customization via AlertingRule is only meant for platform alerting rules.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support user-defined alerting rules and create the Prometheusrules same as its already possible today in the CLI

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But by support we mean I mean the creation, and lifecycle management, but not faulty expressions/summary/descriptions or any other user defined detail .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support user-defined alerting rules and create the Prometheusrules same as its already possible today in the CLI

you mean applying manifests with oc?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would wrap the existing API(oc) that allows to create the alert rules, yes.
We want to keep APIs with same prefix as the new planned APIS, for clear definition for the UI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following your last comment, sorry. Which APIs are we talking about? For me CRDs are also APIs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we would wrap only the API for creating the new user defined alerts.
Other APIs require more complicated logic, since they will handle both User alerts and alert rules and platform alerts .

1. CRUD operations on user‑defined alerts via UI and API.
2. Clone platform or user alerts to customize thresholds or scopes.
3. Disable/enable alerts by creating/updating entries in the `AlertRelabelConfig` CR.
4. Create/manage silences in Alertmanager and reflect this in the UI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the gaps in the current product? It already supports silence management in the console. The same remark applies to the following bullet points.

Copy link
Author

@sradco sradco Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effort will integrate the silences as part of the overall feature and will use the existing silences APIs, UI may require some updates for a unified view and management.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above, I still don't understand what's missing in the console wrt silences.

@sradco sradco force-pushed the alert_managment_enhancment_proposal branch from 1f7975e to e67b7c0 Compare July 28, 2025 15:11
@sradco
Copy link
Author

sradco commented Jul 28, 2025

Reading the proposal, it's not clear to me:

whether we need a new supported API or only UI improvements are required.
what are the gaps in the existing implementation that we want to fix (e.g. AlertingRule & AlertRelabel CRDs already exist but what make them hard to use?).

@simonpasquier We need REST APIs to support adding a UI for managing alerts.
There are no gaps for doing this plainly with CLI, but the gap is of missing REST APIs to do this through the UI.

@sradco sradco force-pushed the alert_managment_enhancment_proposal branch 2 times, most recently from 6ef5ca0 to 31dbb17 Compare July 29, 2025 08:23
@sradco sradco force-pushed the alert_managment_enhancment_proposal branch from 31dbb17 to e2704cf Compare July 29, 2025 12:16
@simonpasquier
Copy link
Contributor

@simonpasquier We need REST APIs to support adding a UI for managing alerts.
There are no gaps for doing this plainly with CLI, but the gap is of missing REST APIs to do this through the UI.

I feel that we're taking the problem backwards and start from the implementation rather than user experience. As an OpenShift user/cluster admin, do I need a REST API? Or is the main problem that the UI displays information which is inconsistent currently?

@sradco sradco force-pushed the alert_managment_enhancment_proposal branch 4 times, most recently from 08d669d to bb3093a Compare July 30, 2025 11:55
@sradco sradco changed the title Enhancement proposal for alerts management Enhancement proposal for add API to support alerts management UI Jul 30, 2025
@sradco sradco changed the title Enhancement proposal for add API to support alerts management UI Enhancement proposal for new APIs to support alerts management UI Jul 30, 2025
@sradco sradco force-pushed the alert_managment_enhancment_proposal branch 4 times, most recently from ae4f850 to 9c285f2 Compare July 31, 2025 09:17
so that I can quickly mute a noisy alert or restore it once the underlying issue is resolved.

3. **Clone and customize platform alerts**
As a cluster admin, I want to clone an existing built‑in alert, adjust its threshold and severity, and save it as a user‑defined rule,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and save it as a user‑defined rule

IMHO this part isn't needed. The use case here is that you want to "replace" the definition of an built-in alerting rule because it doesn't fit your environment and that the new rule behaves as another platform alerting rule.

I would also recommend concrete examples (e.g. "by default the NodeNetworkReceiveErrs alert fires when there's more than 1% of received packets in error but I want the threshold to be 2%" and "I want the SamplesDegraded alerting rule's severity to be info rather than warning").

Copy link
Author

@sradco sradco Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. Sometimes users will clone and add another similar alert and keep the existing alert.
For example if you want to fire one alert for warning in 70% and another for critical at 90%.
Or just use the source alert expression and details as a reference for your new alert and you want to inspect the expression.
We will not disable the source alert as part of the clone process, users are expected to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed but I would separate the use cases for clarity.

  1. Replace an existing platform alerting rule by a custom definition.
  2. Create a new platform alerting rule from an existing platform alerting rule.
  3. Create a new user-defined alerting rule from a plaform or user-defined alerting rule.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier How can a user create a new "Platform alerting rule"? (line 2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@sradco sradco Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Create a new platform alerting rule from an existing platform alerting rule.

The link you provided is about creating alerting rules out of platform metrics, not alerts.

And below it, it says for updating platform alerts:

  • "As a cluster administrator, you can modify core platform alerts before Alertmanager routes them to a receiver. For example, you can change the severity label of an alert, add a custom label, or exclude an alert from being sent to Alertmanager."
  • "You must create the AlertRelabelConfig object in the openshift-monitoring namespace. Otherwise, the alert label will not change."

so that I can tailor default monitoring to suit my team’s SLAs without modifying upstream operators.

4. **Create alert from a metric query**
As a developer, I want to write or paste a PromQL expression in a “Create Alert” form, specify duration and severity, and save it,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you could create an alerting rule from the Metrics > Observe page?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in scope of this effort, but possible later on I assume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would simplify the workflow a lot: you use existing data to tune the expression and then you turn it into a rule.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier We can simplify the process a bit by not allowing any updates to Platform alerts and always assume updates require creating a new alert and disabling the Platform one.
This could be a valid alternative that we can consider.

So that we only use the alertRelabelConfig to disable/enable alerts and to set Group and Component labels. WDYT?


## Goals
1. CRUD operations on user‑defined alerts via UI and API.
2. Clone platform or user alerts to customize thresholds or scopes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following your last comment, sorry. Which APIs are we talking about? For me CRDs are also APIs.

1. CRUD operations on user‑defined alerts via UI and API.
2. Clone platform or user alerts to customize thresholds or scopes.
3. Disable/enable alerts by creating/updating entries in the `AlertRelabelConfig` CR.
4. Create/manage silences in Alertmanager and reflect this in the UI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above, I still don't understand what's missing in the console wrt silences.

so that I can quickly mute a noisy alert or restore it once the underlying issue is resolved.

3. **Clone and customize platform alerts**
As a cluster admin, I want to clone an existing built‑in alert, adjust its threshold and severity, and save it as a user‑defined rule,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed but I would separate the use cases for clarity.

  1. Replace an existing platform alerting rule by a custom definition.
  2. Create a new platform alerting rule from an existing platform alerting rule.
  3. Create a new user-defined alerting rule from a plaform or user-defined alerting rule.

so that I can tailor default monitoring to suit my team’s SLAs without modifying upstream operators.

4. **Create alert from a metric query**
As a developer, I want to write or paste a PromQL expression in a “Create Alert” form, specify duration and severity, and save it,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would simplify the workflow a lot: you use existing data to tune the expression and then you turn it into a rule.

@sradco sradco force-pushed the alert_managment_enhancment_proposal branch from 9c285f2 to 5ecd10a Compare August 18, 2025 14:11
- "@jan--f"
- "@jgbernalp"
api-approvers:
- TBD
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier who can be the api approvers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The api-approvers field is meant for custom resource definitions which we're not going to add/modify IIUC.
If we implement a new REST API, I assume that the approvers depend on where's going to live.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan to add the new REST APIs to CMO

Copy link
Contributor

@simonpasquier simonpasquier Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMO is a single replica deployment so not a good fit for an API server since it won't be highly available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally that intention should perhaps be mentioned in this proposal.

Copy link
Author

@sradco sradco Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMO is a single replica deployment so not a good fit for an API server since it won't be highly available.

@simonpasquier, @jan--f What do you propose?
When we discussed this with @jgbernalp I believe that we agreed this would be part of CMO.

Additionally that intention should perhaps be mentioned in this proposal.
@jan--f I updated the proposal. Please see if this is better.

@sradco sradco force-pushed the alert_managment_enhancment_proposal branch 2 times, most recently from f086b26 to e141f54 Compare August 20, 2025 11:34
@sradco
Copy link
Author

sradco commented Aug 20, 2025

Hi @jan--f, @jgbernalp, I would appreciate your review of this proposal

Copy link
Contributor

@jan--f jan--f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think improving how we deal with alerts in the UI is a great effort. We could do with a more unified and bulk friendly approach there. Some mockups for this would be helpful to illustrate what is being proposed (and where the current implementation falls short).

The proposed REST API doesn't make sense to me. Iiuc we already have the k8s API that covers all the the proposed functionality. The motivation section mentions While it's possible to customize built-in alerting rules with the AlertingRule+AlertRelabelConfig CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration. however this proposal doesn't demonstrate how to pass data to the new proposed API nor does it outline how correctness should be verified a priori.

# New APIs to support Alerts UI Management in OpenShift

## Summary
Provide a user‑friendly UI and REST API for defining, viewing, editing, disabling/enabling and silencing Prometheus alerts without manual YAML edits, reducing alert fatigue and improving operational efficiency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the summary needs to be revised, it mixes up the what and the how.

Suggested change
Provide a user‑friendly UI and REST API for defining, viewing, editing, disabling/enabling and silencing Prometheus alerts without manual YAML edits, reducing alert fatigue and improving operational efficiency.
Improve alert management in the OCP console to expose a unified view of the alerting rules (and associated alerts), provide flexible filtering capabilities and allow rules management without manual YAML edits.


## Summary
Provide a user‑friendly UI and REST API for defining, viewing, editing, disabling/enabling and silencing Prometheus alerts without manual YAML edits, reducing alert fatigue and improving operational efficiency.
Platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it doesn't fit in the summary but later. Also there's no AlertingRules.

Platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`.

## Motivation
- While it's possible to customize built-in alerting rules with the `AlertingRule` + `AlertRelabelConfig` CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- While it's possible to customize built-in alerting rules with the `AlertingRule` + `AlertRelabelConfig` CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration.
- While it's possible to customize built-in alerting rules with the `AlertingRule` + `AlertRelabelConfig` CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration. Built-in alerting rules and alerts are still visible in the OCP console after they've been overridden .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


## Motivation
- While it's possible to customize built-in alerting rules with the `AlertingRule` + `AlertRelabelConfig` CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration.
- Some operational teams prefer an interactive console and API to manage alerts safely, guided by best practices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Some operational teams prefer an interactive console and API to manage alerts safely, guided by best practices.
- Some operational teams prefer an interactive console and API to manage alerts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.


### User Stories

1. **Bulk disable during maintenance**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I commented before but "disable during maintenance" is a wrong use case. Otherwise it's similar to the next user story.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I thought I updated it. Sorry.

- **Alerts View**: show current firing/pending instances, silence status, relabel context
- **Silencing Panel**: define matchers, duration, comment - Keep

### 3. Data Model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part doesn't make sense to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was already updates in the design and not reflected here but in the UX doc.


Rule identity

- A rule is uniquely identified by the tuple: `(namespace, prometheusrule, ruleName, severity)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumption is wrong IMHO: I can find examples where 2 different rules would have the tuple.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please expend on this? I dont think there should be the same alert rule, with the same severity.

Note: The namespace and prometheus where added and must be specified, so they will identify where it is saved to.

Copy link
Contributor

@simonpasquier simonpasquier Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example is KubeAPIErrorBudgetBurn which has 2 alert definitions for each severity level (the distinction is based on the long and short labels).

image image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier oh, this is bad. Thank you for pointing this out!
I was not aware this was possible.
Ill go back and review this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the primary key for a rule is composed of

  • the namespace &name of the PrometheusRule resource
  • the rule group
  • the alert or record name
  • the label name/value pairs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @simonpasquier , I updated based on your suggestion.
I think group is not needed, but if its a must I this we can add it to the path.

- `namespace`
- `component`
- `severity`
- `state` (one of: `enabled`, `disabled`, `silenced`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) only alerts can be silenced, rules are not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today we show in the alerting rules page if the alert rule is silenced.
When you create a silence for example: silence all info alerts.
You will see that the info alert rules are silenced..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier do you know where is this calculated today ? on the UI side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that it's computed on the frontend side. @jgbernalp would know :)

You will see that the info alert rules are silenced..

Sorry for being nit-picky but the UI tells how many alerts associated to the alerting rule are silenced.

image


## Proposal

### 1. API Endpoints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still fail to see the reason for a new API on top of what exists today:

  • Prometheus/Thanos APIs (/api/v1/rules and /api/v1/alerts)
  • Alertmanager API
  • Kubernetes API (for CRDs)

@sradco sradco force-pushed the alert_managment_enhancment_proposal branch 6 times, most recently from 2a192fc to 7fcde81 Compare September 1, 2025 14:55
## Summary
Provide a user‑friendly UI in the OpenShift Console to manage Prometheus alerts. Including defining, viewing, editing, disabling/enabling and silencing alerts without manual YAML edits, and provide alert grouping in order to reduce alerts fatigue and improve operational efficiency.

This includes adding a new REST APIs, Platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ee

Suggested change
This includes adding a new REST APIs, Platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`.
As described in the existing [alert overrides](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alert-overrides.md) proposal, platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`.


1. **Bulk disable that are not required**
As a cluster administrator, I want to select and disable multiple alerts in one action,
so that I can permanently suppress non‑critical notifications and reduce noise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if "disable alerts" means "silence active alerts" or "disable an existing alerting rule". If the latter, it overlaps with the next story.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one is for bulk update and second is for single alert. Its "disable"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've talked about it before and my preference would be to limit this capability to platform alerting rules. Otherwise we need to expand how it will function for user-defined rules.

4. **Create a custom alerting rule**
As a cluster admin/developer, I want to create an alerting rule using a form which allows me to specify mandatory fields (PromQL expression, name) and recommended fields (for duration, well-known labels/annotations such as severity, summary).

5. **Clone an alert base on platform or user-defined alerting rule**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. **Clone an alert base on platform or user-defined alerting rule**
5. **Clone an alert based on an existing alerting rule (platform or user-defined)**

after I used it to tune the expression that I need.

## Goals
1. Add a Console UI for managing alerts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really adding since there's already an Alerting page, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No UI for managing alerts today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being nit-picky but we need to be precise in the document and avoid confusing alerts and alerting rules. I feel that sometimes one name is used for the other.
I could argue that we can (at least partially) manage alerts from the UI since we can silence. IIUC what we're mostly after is being able to edit an alerting rule without editing the YAML manifest of the PrometheusRule resource holding that rule.


## Goals
1. Add a Console UI for managing alerts.
2. Standardize `group` and `component` labels on alert rules to clearly surface priority and impact, to help administrators to understand what to address first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while related to the UI improvements, it could be a separate enhancement proposal since it will impact all component owners (OCP core as well as layered products).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the effort. The UI uses these and its part of the MVP.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI will allow to show aggregated alerts based on labels at the to in the "summery" in alerts page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then there's no detail in this enhancement about what it entails for existing operators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, we have https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md which describes the requirements and guidelines for OCP alerting rules (including layered products). It's fine to add new recommended labels but it should be documented there since it's the primary source of information for RH dev engineers.


Key flows:
- Web clients authenticate and access the OpenShift Console UI.
- Console calls the Alerting API to list/manage alerts and rules.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jan--f @jgbernalp this is where I'm not clear if the Alerting API component is absolutely needed. What prevents the aggregation to happen in the console (client-side)?

If we deploy this intermediary API service then it needs to have access to the user's credentials to ensure that it gets only the resources allowed for the user.

## Open Questions
1. **Per‑Alert vs. Single‑File**: Should each user‑defined alert reside in its own `PrometheusRule` file, or group all into one? A customer noted per‑alert files may simplify GitOps/Argo CD maintenance—does that hold true at scale?
2. **Read‑Only Detection**: Which annotations, labels or ownerReferences reliably indicate GitOps‑managed resources to render them read‑only in our UI?
3. **Concurrent Operator Updates**: How should we handle cases where upstream operators update their own `PrometheusRule` CRs—should we reconcile `AlertRelabelConfig` entries periodically?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's handled by CMO. If the question is "what happens when an operator changes the definition and the related AlertRelabelConfig no longer matches?" then (as of now) the onus is on the cluster admin who need to detect the drift and update their customization.

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

## Upgrade / Downgrade Strategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important upgrade consideration is that existing URLs should remain operational because we know that users depend on them.

- **GA**: best‑practice guidance in UI; multi‑namespace filtering; full test coverage; complete documentation

## Open Questions
1. **Per‑Alert vs. Single‑File**: Should each user‑defined alert reside in its own `PrometheusRule` file, or group all into one? A customer noted per‑alert files may simplify GitOps/Argo CD maintenance—does that hold true at scale?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the flexibility should be left to the users: when they go through the creation wizard, the final step should ask whether they want to create a new PrometheusRule (and possibly in which namespace) or if they want to append to an existing PrometheusRule (and to which group).

@sradco sradco force-pushed the alert_managment_enhancment_proposal branch from 7fcde81 to ec558d0 Compare September 16, 2025 16:43
@sradco sradco force-pushed the alert_managment_enhancment_proposal branch from ec558d0 to 68975c4 Compare September 16, 2025 16:57
Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

@sradco: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 68975c4 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

5. Disable/enable alerts by creating/updating entries in the `AlertRelabelConfig` CR.
6. Create/manage silences in Alertmanager and reflect this in the UI. - Already exists today.
7. Aggregate view of all alerting rules, showing definitions plus relabel context.
8. Aggregate view of all alerts, showing status (Pending, Firing, Silenced) and relabel context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these two points assume some kind of mapping between the alert/alerting rule and its corresponding relabel context? There's no explicit connection or contract right now if I am not mistaken.

- This is not in the current scope.

**Sub-tab: Manage Groups**
- Default groups will include **Impact Group: Cluster** and **Impact Group: Namespace**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for the group values to reflect the Source values? platform and user

Copy link
Contributor

@tremes tremes Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah it doesn't probably make sense. The source and impact are two different things. Feel free to close this comment

- `specHash` is server‑generated from the rule spec to ensure uniqueness and stability. It is computed as SHA‑256 (hex, lowercase) of the normalized fields:
- `expr`: trimmed with consecutive whitespace collapsed
- `for`: normalized to seconds
- `labels`: static labels only; drop empty values; sort by key ascending; join as `key=value` lines separated by `\n`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by static labels please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants