Skip to content

Commit 177c558

Browse files
authored
feat: multiple Patroni primaries alert (#1460)
* feat: multiple Patroni primaries alert Signed-off-by: Deezzir <yurii.kondrakov@canonical.com> * fix: add job_name to patroni scrape config Signed-off-by: Deezzir <yurii.kondrakov@canonical.com> * feat: add PatroniPrimaryAndStandbyLeader alert Signed-off-by: Deezzir <yurii.kondrakov@canonical.com> * Revert "fix: add job_name to patroni scrape config" This reverts commit 457fca9. * fix: use juju topology for patroni alerts Signed-off-by: Deezzir <yurii.kondrakov@canonical.com> --------- Signed-off-by: Deezzir <yurii.kondrakov@canonical.com>
1 parent 818e37c commit 177c558

File tree

3 files changed

+150
-2
lines changed

3 files changed

+150
-2
lines changed

docs/reference/alert-rules.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@ This page contains a markdown version of the alert rules described in the `postg
4646
| Alert | Severity | Notes |
4747
|------|----------|-------|
4848
| PatroniPostgresqlDown | ![critical] | Patroni PostgreSQL instance is down.<br>Check for errors in the Loki logs. |
49+
| PatroniMultipleLeaders | ![critical] | Patroni cluster has multiple leader nodes.<br>More than one leader node (primary or standby) is detected inside a cluster.<br>This may indicate split-brain; check Patroni/Loki logs and network/quorum state. |
50+
| PatroniPrimaryAndStandbyLeader | ![critical] | Patroni cluster has both primary and standby leaders.<br>A primary leader and a standby leader are simultaneously detected inside a cluster.<br>Check for errors in the Loki logs. |
4951
| PatroniHasNoLeader | ![critical] | Patroni instance has no leader node.<br>A leader node (neither primary nor standby) cannot be found inside a cluster.<br>Check for errors in the Loki logs. |
5052

5153
## `PgbackrestExporter`

src/prometheus_alert_rules/patroni_rules.yaml

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,38 @@ groups:
1717
Check for errors in the Loki logs.
1818
LABELS = {{ $labels }}
1919
20+
- alert: PatroniMultipleLeaders
21+
expr: 'sum by (juju_model,juju_application,juju_model_uuid,scope) (patroni_master) > 1 or sum by (juju_model,juju_application,juju_model_uuid,scope) (patroni_standby_leader) > 1'
22+
for: 0m
23+
labels:
24+
severity: critical
25+
annotations:
26+
summary: Patroni cluster {{ $labels.scope }} has multiple leader nodes.
27+
description: |
28+
More than one leader node (primary or standby) is detected inside the cluster {{ $labels.scope }}.
29+
Check for errors in the Loki logs.
30+
LABELS = {{ $labels }}
31+
32+
- alert: PatroniPrimaryAndStandbyLeader
33+
expr: 'sum by (juju_model,juju_application,juju_model_uuid,scope) (patroni_master) == 1 and sum by (juju_model,juju_application,juju_model_uuid,scope) (patroni_standby_leader) == 1'
34+
for: 0m
35+
labels:
36+
severity: critical
37+
annotations:
38+
summary: Patroni cluster {{ $labels.scope }} has both primary and standby leaders.
39+
description: |
40+
A primary leader and a standby leader are simultaneously detected inside the cluster {{ $labels.scope }}.
41+
Check for errors in the Loki logs.
42+
LABELS = {{ $labels }}
43+
2044
# 2.4.1
2145
- alert: PatroniHasNoLeader
22-
expr: '(max by (scope) (patroni_master) < 1) and (max by (scope) (patroni_standby_leader) < 1)'
46+
expr: '(max by (juju_model,juju_application,juju_model_uuid,scope) (patroni_master) < 1) and (max by (juju_model,juju_application,juju_model_uuid,scope) (patroni_standby_leader) < 1)'
2347
for: 0m
2448
labels:
2549
severity: critical
2650
annotations:
27-
summary: Patroni instance {{ $labels.instance }} has no leader node.
51+
summary: Patroni instance {{ $labels.instance }} has no leader node.
2852
description: |
2953
A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}.
3054
Check for errors in the Loki logs.

tests/alerts/test_patroni_rules.yaml

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,3 +78,125 @@ tests:
7878
- alertname: PatroniHasNoLeader
7979
eval_time: 1m
8080
exp_alerts: []
81+
82+
- name: PatroniMultipleLeaders does not fire if master=1 and standby_leader=0
83+
interval: 1m
84+
input_series:
85+
- series: 'patroni_master{scope="cluster1"}'
86+
values: '1'
87+
- series: 'patroni_standby_leader{scope="cluster1"}'
88+
values: '0'
89+
alert_rule_test:
90+
- alertname: PatroniMultipleLeaders
91+
eval_time: 1m
92+
exp_alerts: []
93+
94+
- name: PatroniMultipleLeaders does not fire if master=0 and standby_leader=1
95+
interval: 1m
96+
input_series:
97+
- series: 'patroni_master{scope="cluster1"}'
98+
values: '0'
99+
- series: 'patroni_standby_leader{scope="cluster1"}'
100+
values: '1'
101+
alert_rule_test:
102+
- alertname: PatroniMultipleLeaders
103+
eval_time: 1m
104+
exp_alerts: []
105+
106+
- name: PatroniMultipleLeaders does not fire if master=1 and standby_leader=1
107+
interval: 1m
108+
input_series:
109+
- series: 'patroni_master{scope="cluster1"}'
110+
values: '1'
111+
- series: 'patroni_standby_leader{scope="cluster1"}'
112+
values: '1'
113+
alert_rule_test:
114+
- alertname: PatroniMultipleLeaders
115+
eval_time: 1m
116+
exp_alerts: []
117+
118+
- name: PatroniMultipleLeaders fires if two masters exist in one scope
119+
interval: 1m
120+
input_series:
121+
- series: 'patroni_master{scope="cluster1",instance="pg1"}'
122+
values: '1'
123+
- series: 'patroni_master{scope="cluster1",instance="pg2"}'
124+
values: '1'
125+
- series: 'patroni_standby_leader{scope="cluster1",instance="pg1"}'
126+
values: '0'
127+
- series: 'patroni_standby_leader{scope="cluster1",instance="pg2"}'
128+
values: '0'
129+
alert_rule_test:
130+
- alertname: PatroniMultipleLeaders
131+
eval_time: 0m
132+
exp_alerts:
133+
- exp_labels:
134+
alertname: PatroniMultipleLeaders
135+
severity: critical
136+
scope: cluster1
137+
exp_annotations:
138+
summary: Patroni cluster cluster1 has multiple leader nodes.
139+
description: |
140+
More than one leader node (primary or standby) is detected inside the cluster cluster1.
141+
Check for errors in the Loki logs.
142+
LABELS = map[scope:cluster1]
143+
144+
- name: PatroniMultipleLeaders fires if two standby leaders exist in one scope
145+
interval: 1m
146+
input_series:
147+
- series: 'patroni_master{scope="cluster1",instance="pg1"}'
148+
values: '0'
149+
- series: 'patroni_master{scope="cluster1",instance="pg2"}'
150+
values: '0'
151+
- series: 'patroni_standby_leader{scope="cluster1",instance="pg1"}'
152+
values: '1'
153+
- series: 'patroni_standby_leader{scope="cluster1",instance="pg2"}'
154+
values: '1'
155+
alert_rule_test:
156+
- alertname: PatroniMultipleLeaders
157+
eval_time: 0m
158+
exp_alerts:
159+
- exp_labels:
160+
alertname: PatroniMultipleLeaders
161+
severity: critical
162+
scope: cluster1
163+
exp_annotations:
164+
summary: Patroni cluster cluster1 has multiple leader nodes.
165+
description: |
166+
More than one leader node (primary or standby) is detected inside the cluster cluster1.
167+
Check for errors in the Loki logs.
168+
LABELS = map[scope:cluster1]
169+
170+
- name: PatroniPrimaryAndStandbyLeader does not fire if master=1 and standby_leader=0
171+
interval: 1m
172+
input_series:
173+
- series: 'patroni_master{scope="cluster1"}'
174+
values: '1'
175+
- series: 'patroni_standby_leader{scope="cluster1"}'
176+
values: '0'
177+
alert_rule_test:
178+
- alertname: PatroniPrimaryAndStandbyLeader
179+
eval_time: 1m
180+
exp_alerts: []
181+
182+
- name: PatroniPrimaryAndStandbyLeader fires if master=1 and standby_leader=1
183+
interval: 1m
184+
input_series:
185+
- series: 'patroni_master{scope="cluster1"}'
186+
values: '1'
187+
- series: 'patroni_standby_leader{scope="cluster1"}'
188+
values: '1'
189+
alert_rule_test:
190+
- alertname: PatroniPrimaryAndStandbyLeader
191+
eval_time: 0m
192+
exp_alerts:
193+
- exp_labels:
194+
alertname: PatroniPrimaryAndStandbyLeader
195+
severity: critical
196+
scope: cluster1
197+
exp_annotations:
198+
summary: Patroni cluster cluster1 has both primary and standby leaders.
199+
description: |
200+
A primary leader and a standby leader are simultaneously detected inside the cluster cluster1.
201+
Check for errors in the Loki logs.
202+
LABELS = map[scope:cluster1]

0 commit comments

Comments
 (0)