Enhancement proposal for multi-cluster alerts management#1921

Open

sradco wants to merge 1 commit intoopenshift:masterfrom

sradco:multi_cluster_alert_managment_enhancment

sradco commented Jan 12, 2026 •

edited

Loading

This PR includes the enhancement proposal for a new Multi-Cluster Alerts Management UI.

openshift-ci bot requested review from jan--f and moadz

January 12, 2026 19:39

Contributor

openshift-ci bot commented Jan 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jan--f for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

enhancements/monitoring/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Author

sradco commented Jan 12, 2026 •

edited

Loading

@coleenquadros @jacobbaungard @simonpasquier @jgbernalp @jan--f @moadz I would appreciate your review of this proposal for multi-cluster alerting management UI.

sradco force-pushed the multi_cluster_alert_managment_enhancment branch 3 times, most recently from 7e23e3d to 4749f64 Compare

January 18, 2026 10:44

openshift-bot commented Feb 16, 2026

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-ci bot added the lifecycle/stale label

openshift-bot commented Feb 23, 2026

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-ci bot added lifecycle/rotten and removed lifecycle/stale labels

Author

sradco commented Feb 25, 2026

/remove-lifecycle rotten

openshift-ci bot removed the lifecycle/rotten label

sradco force-pushed the multi_cluster_alert_managment_enhancment branch 2 times, most recently from 90d1528 to 53326b6 Compare

March 4, 2026 15:23

Author

sradco commented Mar 4, 2026

Hi @jacobbaungard , Please review this enhancement proposal.
It is built on to of #1822 and #1917.

sradco force-pushed the multi_cluster_alert_managment_enhancment branch 2 times, most recently from 4e9725c to 61e639b Compare

March 11, 2026 13:56

simonpasquier reviewed

View reviewed changes

Contributor

simonpasquier left a comment

Did a quick pass but the proposal would benefit from being split into different parts because it's quite impossible to review in the current state. I'd recommend to focus on 1 part at a time like visualization of spoke/hub alerts in the console.
I'd expect also some inputs from the ACM observability folks about the recommended approach for alert silencing.

enhancements/monitoring/multi-cluster-alerts-ui-management.md


		Hub Alertmanager (transient, in-memory)

		Hub AM holds active alert instances in memory, backed by a local PVC for silences and notification state. It is not a persistent store — resolved alerts are dropped after a configurable grace period (`resolve_timeout: 5m`). Hub AM can answer "what is firing right now?" but cannot answer "what was firing yesterday?" Once an alert resolves and the grace period passes, it is gone from hub AM.

Contributor

simonpasquier Mar 16, 2026

(nit) resolve_timeout has nothing to do with storage. It defines the duration that should be added to alerts without an end timestamp. In practice alerts emitted by Prometheus/Thanos always have a non-zero end.

enhancements/monitoring/multi-cluster-alerts-ui-management.md

+              - **No ARC-applied labels**: The `ALERTS` metric is produced by Prometheus rule evaluation, before ARCs are applied. It lacks `openshift_io_alert_rule_id`, `openshift_io_alert_rule_component`, and `openshift_io_alert_rule_layer`.
+              - **No silence awareness**: Silenced alerts still appear as `alertstate="firing"` in the `ALERTS` metric — Prometheus does not know about Alertmanager silences.
+              - **`managed_cluster` is stripped**: The metrics-collector strips the `managed_cluster` label during federation. Only the `cluster` label (added by MCOA addon write relabel configs) is available on hub Thanos.
+              - **No disabled alert awareness**: ARC-dropped alerts never fire, so they are absent from `ALERTS`, but there is no way to distinguish "never fired" from "disabled by ARC."

Contributor

simonpasquier Mar 16, 2026

Hmm RelabelConfig don't change the alerting rules evaluated by Prometheus but only the alerts sent to Alertmanager.

enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated


		Future iteration — `HubAlertingRule` CRD as single source of truth:

		To address MVP limitations, introduce a `HubAlertingRule` CRD in `open-cluster-management-observability`. All hub rules — both operator defaults and user-created custom rules — are represented as CRDs. A reconciler watches these CRDs and generates the ConfigMaps that ThanosRuler reads.

Contributor

simonpasquier Mar 16, 2026

Why introduce yet another rule CRD? We already have the PrometheusRule CRD.

enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated

+              The controller periodically polls each spoke Alertmanager (`GET /api/v2/silences` via ManagedClusterProxy) and reconciles the state on hub AM:
+              - **Create**: when a new active silence is found on a spoke, the controller creates a replica on hub AM. The replica includes all original matchers plus an additional `managed_cluster=<cluster-name>` matcher to scope it to that spoke's alerts. A label or annotation `sync.source=<cluster-name>/<silence-id>` is added to the hub silence comment for traceability and to prevent conflicts with user-created hub silences.
+              - **Update**: if a spoke silence's `endsAt` is extended or matchers change, the controller expires the old hub replica and creates a new one.

Contributor

simonpasquier Mar 16, 2026

note that a request to update a silence means "expire the silence" then "create new silence". Similarly deleting a silence = expiring it.

enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated


		For MVP, the UI focuses on the real-time alerts page (hub AM). Historical alert views are a future enhancement that depends on the `alerts_effective_*` metric being deployed and federated.

		### Silence Sync Controller

Contributor

simonpasquier Mar 16, 2026

I've got concerns about the whole approach around silences. Silences in a "single Alertmanager cluster" situation are replicated using an approach which favors availability over consistency (using Conflict-free replicated data types under the hood). It means that we have no real guarantee that 2 Alertmanager instances in the same spoke have a consistent state for silences.

enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated


		### Hub Rule Management

		Hub alerting rules are evaluated by MCOA ThanosRuler over federated data from hub Thanos. ThanosRuler uses ConfigMap-based rule files, not PrometheusRule CRDs.

Contributor

simonpasquier Mar 16, 2026

IMHO the first action should be to move away from configmap rules and adopt PrometheusRule to have a consistent approach between single cluster and multi cluster.

enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated

+              Hub ThanosRuler has no ARC (AlertRelabelConfig) pipeline. The disable mechanism differs from single-cluster:
+              - On spokes: ARC `action: Drop` prevents the alert from firing while keeping the rule definition visible in the API.
+              - On hub (MVP): no per-rule disable. Removing a rule from `thanos-ruler-custom-rules` deletes it. Users can silence individual alerts via the hub Alertmanager as a workaround. Default hub rules cannot be modified — use silences.
+              - On hub (future CRD): setting `spec.enabled: false` on the CRD removes the rule from the generated ConfigMap. ThanosRuler stops evaluating it. The rule definition remains visible in the API via the CRD.

Contributor

simonpasquier Mar 16, 2026

There's no need for a spec field. If there's a need for rule disablement (and that a big "if" for me), it should work from the resource's labels.

enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated


		Hub Alertmanager receives alerts from all spoke clusters (via `additionalAlertmanagerConfigs`) and from ThanosRuler. By default, hub AM is configured with a `null` receiver — it accepts and stores alert state but does not send notifications. However, users can customize the hub AM configuration to add real notification receivers (Slack, PagerDuty, email, webhooks, etc.) and routing rules.

		This makes hub AM a potential centralized notification hub for the fleet: instead of configuring receivers on each individual spoke cluster, users can configure them once on hub AM and receive notifications about alerts from all managed clusters in a single place. The `managed_cluster` label on spoke alerts enables routing by cluster (e.g., production clusters to PagerDuty, dev clusters to Slack).

Contributor

simonpasquier Mar 16, 2026

Why is it highlighted as "potential"? Isn't this a configuration already supported today for ACM users?

enhancements/monitoring/multi-cluster-alerts-ui-management.md

+              - Hub AM can serve as a centralized notification hub for spoke alerts. Users can configure receivers (Slack, PagerDuty, email) on hub AM and route notifications by `managed_cluster` label — enabling fleet-wide notification management from a single configuration point instead of configuring receivers on each spoke individually.
+              - The hub AM config Secret uses `skip-creation-if-exist: "true"`, so user customizations are preserved across operator reconciliation.
+              - Future UI improvements could include managing hub AM receivers and routes from the console, multi‑cluster routing by cluster labels (region, team), notifications by impact group and component, and team‑scoped subscriptions honoring RBAC.
+              - The silence sync controller is essential for notification consistency: spoke silences must be replicated to hub AM so that both spoke-local and hub-centralized notifications are suppressed for silenced alerts.

Contributor

simonpasquier Mar 16, 2026

This assumes that users configure both Alertmanager. For context, we added the possibility to disable Alertmanager in the spoke clusters at the request of ACM long time ago already so that alert notifications would be only managed at the hub level.


          Enhancement proposal for multi-cluster alerts management

35ff2f6

Signed-off-by: Shirly Radco <sradco@redhat.com>

sradco force-pushed the multi_cluster_alert_managment_enhancment branch from 61e639b to 35ff2f6 Compare

March 18, 2026 11:40

Contributor

openshift-ci bot commented Mar 18, 2026

@sradco: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`35ff2f6`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet