-
Notifications
You must be signed in to change notification settings - Fork 520
Enhancement proposal for new APIs to support alerts management UI #1822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Enhancement proposal for new APIs to support alerts management UI #1822
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c7c9d56
to
019bb16
Compare
@machadovilaca @avlitman @nunnatsa please review this enhancement proposal. |
019bb16
to
1f7975e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the proposal, it's not clear to me:
- whether we need a new supported API or only UI improvements are required.
- what are the gaps in the existing implementation that we want to fix (e.g. AlertingRule & AlertRelabel CRDs already exist but what make them hard to use?).
|
||
## Goals | ||
1. CRUD operations on user‑defined alerts via UI and API. | ||
2. Clone platform or user alerts to customize thresholds or scopes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support customization for user-defined alerting rules? Right now, alert customization via AlertingRule
is only meant for platform alerting rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to support user-defined alerting rules and create the Prometheusrules same as its already possible today in the CLI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But by support we mean I mean the creation, and lifecycle management, but not faulty expressions/summary/descriptions or any other user defined detail .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to support user-defined alerting rules and create the Prometheusrules same as its already possible today in the CLI
you mean applying manifests with oc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would wrap the existing API(oc) that allows to create the alert rules, yes.
We want to keep APIs with same prefix as the new planned APIS, for clear definition for the UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following your last comment, sorry. Which APIs are we talking about? For me CRDs are also APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we would wrap only the API for creating the new user defined alerts.
Other APIs require more complicated logic, since they will handle both User alerts and alert rules and platform alerts .
1. CRUD operations on user‑defined alerts via UI and API. | ||
2. Clone platform or user alerts to customize thresholds or scopes. | ||
3. Disable/enable alerts by creating/updating entries in the `AlertRelabelConfig` CR. | ||
4. Create/manage silences in Alertmanager and reflect this in the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the gaps in the current product? It already supports silence management in the console. The same remark applies to the following bullet points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effort will integrate the silences as part of the overall feature and will use the existing silences APIs, UI may require some updates for a unified view and management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above, I still don't understand what's missing in the console wrt silences.
1f7975e
to
e67b7c0
Compare
@simonpasquier We need REST APIs to support adding a UI for managing alerts. |
6ef5ca0
to
31dbb17
Compare
31dbb17
to
e2704cf
Compare
I feel that we're taking the problem backwards and start from the implementation rather than user experience. As an OpenShift user/cluster admin, do I need a REST API? Or is the main problem that the UI displays information which is inconsistent currently? |
08d669d
to
bb3093a
Compare
ae4f850
to
9c285f2
Compare
so that I can quickly mute a noisy alert or restore it once the underlying issue is resolved. | ||
|
||
3. **Clone and customize platform alerts** | ||
As a cluster admin, I want to clone an existing built‑in alert, adjust its threshold and severity, and save it as a user‑defined rule, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and save it as a user‑defined rule
IMHO this part isn't needed. The use case here is that you want to "replace" the definition of an built-in alerting rule because it doesn't fit your environment and that the new rule behaves as another platform alerting rule.
I would also recommend concrete examples (e.g. "by default the NodeNetworkReceiveErrs alert fires when there's more than 1% of received packets in error but I want the threshold to be 2%" and "I want the SamplesDegraded alerting rule's severity to be info rather than warning").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily. Sometimes users will clone and add another similar alert and keep the existing alert.
For example if you want to fire one alert for warning in 70% and another for critical at 90%.
Or just use the source alert expression and details as a reference for your new alert and you want to inspect the expression.
We will not disable the source alert as part of the clone process, users are expected to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed but I would separate the use cases for clarity.
- Replace an existing platform alerting rule by a custom definition.
- Create a new platform alerting rule from an existing platform alerting rule.
- Create a new user-defined alerting rule from a plaform or user-defined alerting rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier How can a user create a new "Platform alerting rule"? (line 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Create a new platform alerting rule from an existing platform alerting rule.
The link you provided is about creating alerting rules out of platform metrics, not alerts.
And below it, it says for updating platform alerts:
- "As a cluster administrator, you can modify core platform alerts before Alertmanager routes them to a receiver. For example, you can change the severity label of an alert, add a custom label, or exclude an alert from being sent to Alertmanager."
- "You must create the AlertRelabelConfig object in the openshift-monitoring namespace. Otherwise, the alert label will not change."
so that I can tailor default monitoring to suit my team’s SLAs without modifying upstream operators. | ||
|
||
4. **Create alert from a metric query** | ||
As a developer, I want to write or paste a PromQL expression in a “Create Alert” form, specify duration and severity, and save it, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you could create an alerting rule from the Metrics > Observe page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in scope of this effort, but possible later on I assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would simplify the workflow a lot: you use existing data to tune the expression and then you turn it into a rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier We can simplify the process a bit by not allowing any updates to Platform
alerts and always assume updates require creating a new alert and disabling the Platform one.
This could be a valid alternative that we can consider.
So that we only use the alertRelabelConfig to disable/enable alerts and to set Group and Component labels. WDYT?
|
||
## Goals | ||
1. CRUD operations on user‑defined alerts via UI and API. | ||
2. Clone platform or user alerts to customize thresholds or scopes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following your last comment, sorry. Which APIs are we talking about? For me CRDs are also APIs.
1. CRUD operations on user‑defined alerts via UI and API. | ||
2. Clone platform or user alerts to customize thresholds or scopes. | ||
3. Disable/enable alerts by creating/updating entries in the `AlertRelabelConfig` CR. | ||
4. Create/manage silences in Alertmanager and reflect this in the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above, I still don't understand what's missing in the console wrt silences.
so that I can quickly mute a noisy alert or restore it once the underlying issue is resolved. | ||
|
||
3. **Clone and customize platform alerts** | ||
As a cluster admin, I want to clone an existing built‑in alert, adjust its threshold and severity, and save it as a user‑defined rule, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed but I would separate the use cases for clarity.
- Replace an existing platform alerting rule by a custom definition.
- Create a new platform alerting rule from an existing platform alerting rule.
- Create a new user-defined alerting rule from a plaform or user-defined alerting rule.
so that I can tailor default monitoring to suit my team’s SLAs without modifying upstream operators. | ||
|
||
4. **Create alert from a metric query** | ||
As a developer, I want to write or paste a PromQL expression in a “Create Alert” form, specify duration and severity, and save it, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would simplify the workflow a lot: you use existing data to tune the expression and then you turn it into a rule.
9c285f2
to
5ecd10a
Compare
- "@jan--f" | ||
- "@jgbernalp" | ||
api-approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier who can be the api approvers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The api-approvers
field is meant for custom resource definitions which we're not going to add/modify IIUC.
If we implement a new REST API, I assume that the approvers depend on where's going to live.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We plan to add the new REST APIs to CMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMO is a single replica deployment so not a good fit for an API server since it won't be highly available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally that intention should perhaps be mentioned in this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMO is a single replica deployment so not a good fit for an API server since it won't be highly available.
@simonpasquier, @jan--f What do you propose?
When we discussed this with @jgbernalp I believe that we agreed this would be part of CMO.
Additionally that intention should perhaps be mentioned in this proposal.
@jan--f I updated the proposal. Please see if this is better.
f086b26
to
e141f54
Compare
Hi @jan--f, @jgbernalp, I would appreciate your review of this proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think improving how we deal with alerts in the UI is a great effort. We could do with a more unified and bulk friendly approach there. Some mockups for this would be helpful to illustrate what is being proposed (and where the current implementation falls short).
The proposed REST API doesn't make sense to me. Iiuc we already have the k8s API that covers all the the proposed functionality. The motivation section mentions While it's possible to customize built-in alerting rules with the
AlertingRule+
AlertRelabelConfig CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration.
however this proposal doesn't demonstrate how to pass data to the new proposed API nor does it outline how correctness should be verified a priori.
# New APIs to support Alerts UI Management in OpenShift | ||
|
||
## Summary | ||
Provide a user‑friendly UI and REST API for defining, viewing, editing, disabling/enabling and silencing Prometheus alerts without manual YAML edits, reducing alert fatigue and improving operational efficiency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO the summary needs to be revised, it mixes up the what and the how.
Provide a user‑friendly UI and REST API for defining, viewing, editing, disabling/enabling and silencing Prometheus alerts without manual YAML edits, reducing alert fatigue and improving operational efficiency. | |
Improve alert management in the OCP console to expose a unified view of the alerting rules (and associated alerts), provide flexible filtering capabilities and allow rules management without manual YAML edits. |
|
||
## Summary | ||
Provide a user‑friendly UI and REST API for defining, viewing, editing, disabling/enabling and silencing Prometheus alerts without manual YAML edits, reducing alert fatigue and improving operational efficiency. | ||
Platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it doesn't fit in the summary but later. Also there's no AlertingRules
.
Platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`. | ||
|
||
## Motivation | ||
- While it's possible to customize built-in alerting rules with the `AlertingRule` + `AlertRelabelConfig` CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- While it's possible to customize built-in alerting rules with the `AlertingRule` + `AlertRelabelConfig` CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration. | |
- While it's possible to customize built-in alerting rules with the `AlertingRule` + `AlertRelabelConfig` CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration. Built-in alerting rules and alerts are still visible in the OCP console after they've been overridden . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
||
## Motivation | ||
- While it's possible to customize built-in alerting rules with the `AlertingRule` + `AlertRelabelConfig` CRDs, the process is cumbersome and error-prone. It requires creating YAML manifests manually and there's no practical way to verify the correctness of the configuration. | ||
- Some operational teams prefer an interactive console and API to manage alerts safely, guided by best practices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Some operational teams prefer an interactive console and API to manage alerts safely, guided by best practices. | |
- Some operational teams prefer an interactive console and API to manage alerts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
|
||
### User Stories | ||
|
||
1. **Bulk disable during maintenance** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I commented before but "disable during maintenance" is a wrong use case. Otherwise it's similar to the next user story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I thought I updated it. Sorry.
- **Alerts View**: show current firing/pending instances, silence status, relabel context | ||
- **Silencing Panel**: define matchers, duration, comment - Keep | ||
|
||
### 3. Data Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part doesn't make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was already updates in the design and not reflected here but in the UX doc.
|
||
Rule identity | ||
|
||
- A rule is uniquely identified by the tuple: `(namespace, prometheusrule, ruleName, severity)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assumption is wrong IMHO: I can find examples where 2 different rules would have the tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please expend on this? I dont think there should be the same alert rule, with the same severity.
Note: The namespace and prometheus where added and must be specified, so they will identify where it is saved to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier oh, this is bad. Thank you for pointing this out!
I was not aware this was possible.
Ill go back and review this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the primary key for a rule is composed of
- the namespace &name of the PrometheusRule resource
- the rule group
- the alert or record name
- the label name/value pairs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @simonpasquier , I updated based on your suggestion.
I think group is not needed, but if its a must I this we can add it to the path.
- `namespace` | ||
- `component` | ||
- `severity` | ||
- `state` (one of: `enabled`, `disabled`, `silenced`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) only alerts can be silenced, rules are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today we show in the alerting rules page if the alert rule is silenced.
When you create a silence for example: silence all info alerts.
You will see that the info alert rules are silenced..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier do you know where is this calculated today ? on the UI side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that it's computed on the frontend side. @jgbernalp would know :)
You will see that the info alert rules are silenced..
Sorry for being nit-picky but the UI tells how many alerts associated to the alerting rule are silenced.

|
||
## Proposal | ||
|
||
### 1. API Endpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still fail to see the reason for a new API on top of what exists today:
- Prometheus/Thanos APIs (/api/v1/rules and /api/v1/alerts)
- Alertmanager API
- Kubernetes API (for CRDs)
2a192fc
to
7fcde81
Compare
## Summary | ||
Provide a user‑friendly UI in the OpenShift Console to manage Prometheus alerts. Including defining, viewing, editing, disabling/enabling and silencing alerts without manual YAML edits, and provide alert grouping in order to reduce alerts fatigue and improve operational efficiency. | ||
|
||
This includes adding a new REST APIs, Platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This includes adding a new REST APIs, Platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`. | |
As described in the existing [alert overrides](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alert-overrides.md) proposal, platform alerts will be overriden leveraging the `AlertRelabelConfig` CR (in `cluster-monitoring-config`) rather than editing their original AlertingRules in the `PrometheusRule`. |
|
||
1. **Bulk disable that are not required** | ||
As a cluster administrator, I want to select and disable multiple alerts in one action, | ||
so that I can permanently suppress non‑critical notifications and reduce noise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if "disable alerts" means "silence active alerts" or "disable an existing alerting rule". If the latter, it overlaps with the next story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one is for bulk update and second is for single alert. Its "disable"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've talked about it before and my preference would be to limit this capability to platform alerting rules. Otherwise we need to expand how it will function for user-defined rules.
4. **Create a custom alerting rule** | ||
As a cluster admin/developer, I want to create an alerting rule using a form which allows me to specify mandatory fields (PromQL expression, name) and recommended fields (for duration, well-known labels/annotations such as severity, summary). | ||
|
||
5. **Clone an alert base on platform or user-defined alerting rule** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5. **Clone an alert base on platform or user-defined alerting rule** | |
5. **Clone an alert based on an existing alerting rule (platform or user-defined)** |
after I used it to tune the expression that I need. | ||
|
||
## Goals | ||
1. Add a Console UI for managing alerts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really adding since there's already an Alerting page, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No UI for managing alerts today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being nit-picky but we need to be precise in the document and avoid confusing alerts and alerting rules. I feel that sometimes one name is used for the other.
I could argue that we can (at least partially) manage alerts from the UI since we can silence. IIUC what we're mostly after is being able to edit an alerting rule without editing the YAML manifest of the PrometheusRule resource holding that rule.
|
||
## Goals | ||
1. Add a Console UI for managing alerts. | ||
2. Standardize `group` and `component` labels on alert rules to clearly surface priority and impact, to help administrators to understand what to address first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while related to the UI improvements, it could be a separate enhancement proposal since it will impact all component owners (OCP core as well as layered products).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the effort. The UI uses these and its part of the MVP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UI will allow to show aggregated alerts based on labels at the to in the "summery" in alerts page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then there's no detail in this enhancement about what it entails for existing operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific, we have https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md which describes the requirements and guidelines for OCP alerting rules (including layered products). It's fine to add new recommended labels but it should be documented there since it's the primary source of information for RH dev engineers.
|
||
Key flows: | ||
- Web clients authenticate and access the OpenShift Console UI. | ||
- Console calls the Alerting API to list/manage alerts and rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jan--f @jgbernalp this is where I'm not clear if the Alerting API component is absolutely needed. What prevents the aggregation to happen in the console (client-side)?
If we deploy this intermediary API service then it needs to have access to the user's credentials to ensure that it gets only the resources allowed for the user.
## Open Questions | ||
1. **Per‑Alert vs. Single‑File**: Should each user‑defined alert reside in its own `PrometheusRule` file, or group all into one? A customer noted per‑alert files may simplify GitOps/Argo CD maintenance—does that hold true at scale? | ||
2. **Read‑Only Detection**: Which annotations, labels or ownerReferences reliably indicate GitOps‑managed resources to render them read‑only in our UI? | ||
3. **Concurrent Operator Updates**: How should we handle cases where upstream operators update their own `PrometheusRule` CRs—should we reconcile `AlertRelabelConfig` entries periodically? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's handled by CMO. If the question is "what happens when an operator changes the definition and the related AlertRelabelConfig no longer matches?" then (as of now) the onus is on the cluster admin who need to detect the drift and update their customization.
- Announce deprecation and support policy of the existing feature | ||
- Deprecate the feature | ||
|
||
## Upgrade / Downgrade Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An important upgrade consideration is that existing URLs should remain operational because we know that users depend on them.
- **GA**: best‑practice guidance in UI; multi‑namespace filtering; full test coverage; complete documentation | ||
|
||
## Open Questions | ||
1. **Per‑Alert vs. Single‑File**: Should each user‑defined alert reside in its own `PrometheusRule` file, or group all into one? A customer noted per‑alert files may simplify GitOps/Argo CD maintenance—does that hold true at scale? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO the flexibility should be left to the users: when they go through the creation wizard, the final step should ask whether they want to create a new PrometheusRule (and possibly in which namespace) or if they want to append to an existing PrometheusRule (and to which group).
7fcde81
to
ec558d0
Compare
Signed-off-by: Shirly Radco <[email protected]>
ec558d0
to
68975c4
Compare
@sradco: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
5. Disable/enable alerts by creating/updating entries in the `AlertRelabelConfig` CR. | ||
6. Create/manage silences in Alertmanager and reflect this in the UI. - Already exists today. | ||
7. Aggregate view of all alerting rules, showing definitions plus relabel context. | ||
8. Aggregate view of all alerts, showing status (Pending, Firing, Silenced) and relabel context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these two points assume some kind of mapping between the alert/alerting rule and its corresponding relabel context? There's no explicit connection or contract right now if I am not mistaken.
- This is not in the current scope. | ||
|
||
**Sub-tab: Manage Groups** | ||
- Default groups will include **Impact Group: Cluster** and **Impact Group: Namespace**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for the group values to reflect the Source
values? platform
and user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah it doesn't probably make sense. The source and impact are two different things. Feel free to close this comment
- `specHash` is server‑generated from the rule spec to ensure uniqueness and stability. It is computed as SHA‑256 (hex, lowercase) of the normalized fields: | ||
- `expr`: trimmed with consecutive whitespace collapsed | ||
- `for`: normalized to seconds | ||
- `labels`: static labels only; drop empty values; sort by key ascending; join as `key=value` lines separated by `\n` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by static labels please?
This PR includes the enhancement proposal for a new Alerts Management UI.