Enhancement proposal for multi-cluster alerts management#1921
Enhancement proposal for multi-cluster alerts management#1921sradco wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@coleenquadros @jacobbaungard @simonpasquier @jgbernalp @jan--f @moadz I would appreciate your review of this proposal for multi-cluster alerting management UI. |
7e23e3d to
4749f64
Compare
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
|
/remove-lifecycle rotten |
90d1528 to
53326b6
Compare
|
Hi @jacobbaungard , Please review this enhancement proposal. |
4e9725c to
61e639b
Compare
simonpasquier
left a comment
There was a problem hiding this comment.
Did a quick pass but the proposal would benefit from being split into different parts because it's quite impossible to review in the current state. I'd recommend to focus on 1 part at a time like visualization of spoke/hub alerts in the console.
I'd expect also some inputs from the ACM observability folks about the recommended approach for alert silencing.
|
|
||
| **Hub Alertmanager (transient, in-memory)** | ||
|
|
||
| Hub AM holds active alert instances in memory, backed by a local PVC for silences and notification state. It is not a persistent store — resolved alerts are dropped after a configurable grace period (`resolve_timeout: 5m`). Hub AM can answer "what is firing right now?" but cannot answer "what was firing yesterday?" Once an alert resolves and the grace period passes, it is gone from hub AM. |
There was a problem hiding this comment.
(nit) resolve_timeout has nothing to do with storage. It defines the duration that should be added to alerts without an end timestamp. In practice alerts emitted by Prometheus/Thanos always have a non-zero end.
| - **No ARC-applied labels**: The `ALERTS` metric is produced by Prometheus rule evaluation, before ARCs are applied. It lacks `openshift_io_alert_rule_id`, `openshift_io_alert_rule_component`, and `openshift_io_alert_rule_layer`. | ||
| - **No silence awareness**: Silenced alerts still appear as `alertstate="firing"` in the `ALERTS` metric — Prometheus does not know about Alertmanager silences. | ||
| - **`managed_cluster` is stripped**: The metrics-collector strips the `managed_cluster` label during federation. Only the `cluster` label (added by MCOA addon write relabel configs) is available on hub Thanos. | ||
| - **No disabled alert awareness**: ARC-dropped alerts never fire, so they are absent from `ALERTS`, but there is no way to distinguish "never fired" from "disabled by ARC." |
There was a problem hiding this comment.
Hmm RelabelConfig don't change the alerting rules evaluated by Prometheus but only the alerts sent to Alertmanager.
|
|
||
| **Future iteration — `HubAlertingRule` CRD as single source of truth:** | ||
|
|
||
| To address MVP limitations, introduce a `HubAlertingRule` CRD in `open-cluster-management-observability`. All hub rules — both operator defaults and user-created custom rules — are represented as CRDs. A reconciler watches these CRDs and generates the ConfigMaps that ThanosRuler reads. |
There was a problem hiding this comment.
Why introduce yet another rule CRD? We already have the PrometheusRule CRD.
| The controller periodically polls each spoke Alertmanager (`GET /api/v2/silences` via ManagedClusterProxy) and reconciles the state on hub AM: | ||
|
|
||
| - **Create**: when a new active silence is found on a spoke, the controller creates a replica on hub AM. The replica includes all original matchers plus an additional `managed_cluster=<cluster-name>` matcher to scope it to that spoke's alerts. A label or annotation `sync.source=<cluster-name>/<silence-id>` is added to the hub silence comment for traceability and to prevent conflicts with user-created hub silences. | ||
| - **Update**: if a spoke silence's `endsAt` is extended or matchers change, the controller expires the old hub replica and creates a new one. |
There was a problem hiding this comment.
note that a request to update a silence means "expire the silence" then "create new silence". Similarly deleting a silence = expiring it.
|
|
||
| For MVP, the UI focuses on the real-time alerts page (hub AM). Historical alert views are a future enhancement that depends on the `alerts_effective_*` metric being deployed and federated. | ||
|
|
||
| ### Silence Sync Controller |
There was a problem hiding this comment.
I've got concerns about the whole approach around silences. Silences in a "single Alertmanager cluster" situation are replicated using an approach which favors availability over consistency (using Conflict-free replicated data types under the hood). It means that we have no real guarantee that 2 Alertmanager instances in the same spoke have a consistent state for silences.
|
|
||
| ### Hub Rule Management | ||
|
|
||
| Hub alerting rules are evaluated by MCOA ThanosRuler over federated data from hub Thanos. ThanosRuler uses ConfigMap-based rule files, not PrometheusRule CRDs. |
There was a problem hiding this comment.
IMHO the first action should be to move away from configmap rules and adopt PrometheusRule to have a consistent approach between single cluster and multi cluster.
| Hub ThanosRuler has no ARC (AlertRelabelConfig) pipeline. The disable mechanism differs from single-cluster: | ||
| - On spokes: ARC `action: Drop` prevents the alert from firing while keeping the rule definition visible in the API. | ||
| - On hub (MVP): no per-rule disable. Removing a rule from `thanos-ruler-custom-rules` deletes it. Users can silence individual alerts via the hub Alertmanager as a workaround. Default hub rules cannot be modified — use silences. | ||
| - On hub (future CRD): setting `spec.enabled: false` on the CRD removes the rule from the generated ConfigMap. ThanosRuler stops evaluating it. The rule definition remains visible in the API via the CRD. |
There was a problem hiding this comment.
There's no need for a spec field. If there's a need for rule disablement (and that a big "if" for me), it should work from the resource's labels.
|
|
||
| Hub Alertmanager receives alerts from all spoke clusters (via `additionalAlertmanagerConfigs`) and from ThanosRuler. By default, hub AM is configured with a `null` receiver — it accepts and stores alert state but does not send notifications. However, users can customize the hub AM configuration to add real notification receivers (Slack, PagerDuty, email, webhooks, etc.) and routing rules. | ||
|
|
||
| This makes hub AM a potential **centralized notification hub** for the fleet: instead of configuring receivers on each individual spoke cluster, users can configure them once on hub AM and receive notifications about alerts from all managed clusters in a single place. The `managed_cluster` label on spoke alerts enables routing by cluster (e.g., production clusters to PagerDuty, dev clusters to Slack). |
There was a problem hiding this comment.
Why is it highlighted as "potential"? Isn't this a configuration already supported today for ACM users?
| - Hub AM can serve as a centralized notification hub for spoke alerts. Users can configure receivers (Slack, PagerDuty, email) on hub AM and route notifications by `managed_cluster` label — enabling fleet-wide notification management from a single configuration point instead of configuring receivers on each spoke individually. | ||
| - The hub AM config Secret uses `skip-creation-if-exist: "true"`, so user customizations are preserved across operator reconciliation. | ||
| - Future UI improvements could include managing hub AM receivers and routes from the console, multi‑cluster routing by cluster labels (region, team), notifications by impact group and component, and team‑scoped subscriptions honoring RBAC. | ||
| - The silence sync controller is essential for notification consistency: spoke silences must be replicated to hub AM so that both spoke-local and hub-centralized notifications are suppressed for silenced alerts. |
There was a problem hiding this comment.
This assumes that users configure both Alertmanager. For context, we added the possibility to disable Alertmanager in the spoke clusters at the request of ACM long time ago already so that alert notifications would be only managed at the hub level.
Signed-off-by: Shirly Radco <sradco@redhat.com>
61e639b to
35ff2f6
Compare
|
@sradco: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR includes the enhancement proposal for a new Multi-Cluster Alerts Management UI.