Skip to content

Enhancement proposal for multi-cluster alerts management#1921

Open
sradco wants to merge 1 commit intoopenshift:masterfrom
sradco:multi_cluster_alert_managment_enhancment
Open

Enhancement proposal for multi-cluster alerts management#1921
sradco wants to merge 1 commit intoopenshift:masterfrom
sradco:multi_cluster_alert_managment_enhancment

Conversation

@sradco
Copy link

@sradco sradco commented Jan 12, 2026

This PR includes the enhancement proposal for a new Multi-Cluster Alerts Management UI.

@openshift-ci openshift-ci bot requested review from jan--f and moadz January 12, 2026 19:39
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jan--f for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sradco
Copy link
Author

sradco commented Jan 12, 2026

@coleenquadros @jacobbaungard @simonpasquier @jgbernalp @jan--f @moadz I would appreciate your review of this proposal for multi-cluster alerting management UI.

@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch 3 times, most recently from 7e23e3d to 4749f64 Compare January 18, 2026 10:44
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2026
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2026
@sradco
Copy link
Author

sradco commented Feb 25, 2026

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2026
@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch 2 times, most recently from 90d1528 to 53326b6 Compare March 4, 2026 15:23
@sradco
Copy link
Author

sradco commented Mar 4, 2026

Hi @jacobbaungard , Please review this enhancement proposal.
It is built on to of #1822 and #1917.

@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch 2 times, most recently from 4e9725c to 61e639b Compare March 11, 2026 13:56
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick pass but the proposal would benefit from being split into different parts because it's quite impossible to review in the current state. I'd recommend to focus on 1 part at a time like visualization of spoke/hub alerts in the console.
I'd expect also some inputs from the ACM observability folks about the recommended approach for alert silencing.


**Hub Alertmanager (transient, in-memory)**

Hub AM holds active alert instances in memory, backed by a local PVC for silences and notification state. It is not a persistent store — resolved alerts are dropped after a configurable grace period (`resolve_timeout: 5m`). Hub AM can answer "what is firing right now?" but cannot answer "what was firing yesterday?" Once an alert resolves and the grace period passes, it is gone from hub AM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) resolve_timeout has nothing to do with storage. It defines the duration that should be added to alerts without an end timestamp. In practice alerts emitted by Prometheus/Thanos always have a non-zero end.

- **No ARC-applied labels**: The `ALERTS` metric is produced by Prometheus rule evaluation, before ARCs are applied. It lacks `openshift_io_alert_rule_id`, `openshift_io_alert_rule_component`, and `openshift_io_alert_rule_layer`.
- **No silence awareness**: Silenced alerts still appear as `alertstate="firing"` in the `ALERTS` metric — Prometheus does not know about Alertmanager silences.
- **`managed_cluster` is stripped**: The metrics-collector strips the `managed_cluster` label during federation. Only the `cluster` label (added by MCOA addon write relabel configs) is available on hub Thanos.
- **No disabled alert awareness**: ARC-dropped alerts never fire, so they are absent from `ALERTS`, but there is no way to distinguish "never fired" from "disabled by ARC."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm RelabelConfig don't change the alerting rules evaluated by Prometheus but only the alerts sent to Alertmanager.


**Future iteration — `HubAlertingRule` CRD as single source of truth:**

To address MVP limitations, introduce a `HubAlertingRule` CRD in `open-cluster-management-observability`. All hub rules — both operator defaults and user-created custom rules — are represented as CRDs. A reconciler watches these CRDs and generates the ConfigMaps that ThanosRuler reads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce yet another rule CRD? We already have the PrometheusRule CRD.

The controller periodically polls each spoke Alertmanager (`GET /api/v2/silences` via ManagedClusterProxy) and reconciles the state on hub AM:

- **Create**: when a new active silence is found on a spoke, the controller creates a replica on hub AM. The replica includes all original matchers plus an additional `managed_cluster=<cluster-name>` matcher to scope it to that spoke's alerts. A label or annotation `sync.source=<cluster-name>/<silence-id>` is added to the hub silence comment for traceability and to prevent conflicts with user-created hub silences.
- **Update**: if a spoke silence's `endsAt` is extended or matchers change, the controller expires the old hub replica and creates a new one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that a request to update a silence means "expire the silence" then "create new silence". Similarly deleting a silence = expiring it.


For MVP, the UI focuses on the real-time alerts page (hub AM). Historical alert views are a future enhancement that depends on the `alerts_effective_*` metric being deployed and federated.

### Silence Sync Controller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got concerns about the whole approach around silences. Silences in a "single Alertmanager cluster" situation are replicated using an approach which favors availability over consistency (using Conflict-free replicated data types under the hood). It means that we have no real guarantee that 2 Alertmanager instances in the same spoke have a consistent state for silences.


### Hub Rule Management

Hub alerting rules are evaluated by MCOA ThanosRuler over federated data from hub Thanos. ThanosRuler uses ConfigMap-based rule files, not PrometheusRule CRDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the first action should be to move away from configmap rules and adopt PrometheusRule to have a consistent approach between single cluster and multi cluster.

Hub ThanosRuler has no ARC (AlertRelabelConfig) pipeline. The disable mechanism differs from single-cluster:
- On spokes: ARC `action: Drop` prevents the alert from firing while keeping the rule definition visible in the API.
- On hub (MVP): no per-rule disable. Removing a rule from `thanos-ruler-custom-rules` deletes it. Users can silence individual alerts via the hub Alertmanager as a workaround. Default hub rules cannot be modified — use silences.
- On hub (future CRD): setting `spec.enabled: false` on the CRD removes the rule from the generated ConfigMap. ThanosRuler stops evaluating it. The rule definition remains visible in the API via the CRD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for a spec field. If there's a need for rule disablement (and that a big "if" for me), it should work from the resource's labels.


Hub Alertmanager receives alerts from all spoke clusters (via `additionalAlertmanagerConfigs`) and from ThanosRuler. By default, hub AM is configured with a `null` receiver — it accepts and stores alert state but does not send notifications. However, users can customize the hub AM configuration to add real notification receivers (Slack, PagerDuty, email, webhooks, etc.) and routing rules.

This makes hub AM a potential **centralized notification hub** for the fleet: instead of configuring receivers on each individual spoke cluster, users can configure them once on hub AM and receive notifications about alerts from all managed clusters in a single place. The `managed_cluster` label on spoke alerts enables routing by cluster (e.g., production clusters to PagerDuty, dev clusters to Slack).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it highlighted as "potential"? Isn't this a configuration already supported today for ACM users?

- Hub AM can serve as a centralized notification hub for spoke alerts. Users can configure receivers (Slack, PagerDuty, email) on hub AM and route notifications by `managed_cluster` label — enabling fleet-wide notification management from a single configuration point instead of configuring receivers on each spoke individually.
- The hub AM config Secret uses `skip-creation-if-exist: "true"`, so user customizations are preserved across operator reconciliation.
- Future UI improvements could include managing hub AM receivers and routes from the console, multi‑cluster routing by cluster labels (region, team), notifications by impact group and component, and team‑scoped subscriptions honoring RBAC.
- The silence sync controller is essential for notification consistency: spoke silences must be replicated to hub AM so that both spoke-local and hub-centralized notifications are suppressed for silenced alerts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that users configure both Alertmanager. For context, we added the possibility to disable Alertmanager in the spoke clusters at the request of ACM long time ago already so that alert notifications would be only managed at the hub level.

Signed-off-by: Shirly Radco <sradco@redhat.com>
@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch from 61e639b to 35ff2f6 Compare March 18, 2026 11:40
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2026

@sradco: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 35ff2f6 link true /test markdownlint

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants