2121 description : >-
2222 Controller {{ $labels.controller }} has had a non-zero reconcile
2323 error rate for the last 5 minutes (current: {{ $value | humanize }}/s).
24- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresClusterReconcileErrors.md"
24+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresClusterReconcileErrors.md"
2525
2626 - alert : MultigresClusterDegraded
2727 expr : multigres_operator_cluster_info{phase!="Healthy"} == 1
3333 description : >-
3434 Cluster {{ $labels.name }} in namespace {{ $labels.namespace }}
3535 has been in phase "{{ $labels.phase }}" for more than 10 minutes.
36- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresClusterDegraded.md"
36+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresClusterDegraded.md"
3737
3838 - alert : MultigresCellGatewayUnavailable
3939 expr : multigres_operator_cell_gateway_replicas{state="ready"} == 0
4646 Cell {{ $labels.cell }} in namespace {{ $labels.namespace }} has
4747 had zero ready MultiGateway replicas for 5 minutes. Traffic
4848 cannot be routed to this cell.
49- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresCellGatewayUnavailable.md"
49+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresCellGatewayUnavailable.md"
5050
5151 - alert : MultigresShardPoolDegraded
5252 expr : >-
6262 Pool {{ $labels.pool }} of shard {{ $labels.shard }} in namespace
6363 {{ $labels.namespace }} has had fewer ready replicas than desired
6464 for more than 10 minutes.
65- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresShardPoolDegraded.md"
65+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresShardPoolDegraded.md"
6666
6767 - alert : MultigresWebhookErrors
6868 expr : rate(multigres_operator_webhook_request_total{result="error"}[5m]) > 0
7575 The {{ $labels.operation }} webhook for {{ $labels.resource }}
7676 has had a non-zero error rate for the last 5 minutes
7777 (current: {{ $value | humanize }}/s).
78- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresWebhookErrors.md"
78+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresWebhookErrors.md"
7979
8080 # ── Backup & Drain ────────────────────────────────────────
8181
9090 The most recent backup for shard {{ $labels.shard }} in cluster
9191 {{ $labels.cluster }} (namespace {{ $labels.namespace }}) is
9292 {{ $value | humanizeDuration }} old. Check backup job health.
93- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresBackupStale.md"
93+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresBackupStale.md"
9494
9595 - alert : MultigresRollingUpdateStuck
9696 expr : multigres_operator_rolling_update_in_progress == 1
@@ -104,7 +104,7 @@ spec:
104104 {{ $labels.cell }} (namespace {{ $labels.namespace }}) has been
105105 performing a rolling update for more than 30 minutes. Check for
106106 pods stuck in pending or crash-looping state.
107- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresRollingUpdateStuck.md"
107+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresRollingUpdateStuck.md"
108108
109109 - alert : MultigresDrainTimeout
110110 expr : rate(multigres_operator_drain_operations_total{result="timeout"}[5m]) > 0
@@ -117,7 +117,7 @@ spec:
117117 Shard {{ $labels.shard }} in cluster {{ $labels.cluster }} has
118118 had drain timeouts for the last 10 minutes. Pods may be stuck
119119 in a drain state. Check pod logs and topology server connectivity.
120- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresDrainTimeout.md"
120+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresDrainTimeout.md"
121121
122122 # ── Latency ───────────────────────────────────────────────
123123
@@ -132,7 +132,7 @@ spec:
132132 Controller {{ $labels.controller }} p99 reconcile duration has
133133 exceeded 30 seconds for the last 5 minutes
134134 (current: {{ $value | humanize }}s).
135- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresReconcileSlow.md"
135+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresReconcileSlow.md"
136136
137137 # ── Saturation ────────────────────────────────────────────
138138
@@ -147,4 +147,4 @@ spec:
147147 The {{ $labels.name }} controller work queue has been deeper than
148148 50 items for more than 10 minutes. The operator cannot keep up
149149 with incoming events. Consider increasing MaxConcurrentReconciles.
150- runbook_url : " https://github.com/numtide /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresControllerSaturated.md"
150+ runbook_url : " https://github.com/multigres /multigres-operator/blob/main/docs/monitoring/runbooks/MultigresControllerSaturated.md"
0 commit comments