Skip to content

Commit 00f982e

Browse files
authored
Merge pull request #187 from vshn/cnpg-runbooks
Add CNPG runbooks
2 parents 0cf0db2 + d0815d0 commit 00f982e

16 files changed

+646
-1
lines changed
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
= Alert rule: CNPGClusterHAWarning
2+
3+
== icon:glasses[] Overview
4+
5+
This alert triggers when a CNPG PostgreSQL HA cluster has fewer than 2 streaming replicas connected to the primary.
6+
The cluster is still operational but redundancy is reduced - if the primary fails, recovery depends on fewer standbys than expected.
7+
8+
This alert only fires for clusters that have replicas configured. Standalone single-instance deployments are excluded.
9+
10+
== icon:bug[] Steps for Debugging
11+
12+
Step one:: Identify the affected namespace from the alert. Set it as a variable.
13+
+
14+
[source,bash]
15+
----
16+
INSTANCE_NAMESPACE='<instance-namespace>'
17+
----
18+
+
19+
Step two:: Check the overall cluster status and how many instances are ready.
20+
+
21+
[source,bash]
22+
----
23+
kubectl get cluster postgresql -n $INSTANCE_NAMESPACE
24+
----
25+
+
26+
Step three:: Check which pods are running and their roles.
27+
+
28+
[source,bash]
29+
----
30+
kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql -L cnpg.io/instanceRole
31+
----
32+
+
33+
Step four:: Resolve the primary pod name and check the replication status.
34+
+
35+
[source,bash]
36+
----
37+
PRIMARY=$(kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.status.currentPrimary}')
38+
kubectl exec -n $INSTANCE_NAMESPACE $PRIMARY -- psql -U postgres -c "SELECT application_name, state, sync_state FROM pg_stat_replication;"
39+
----
40+
+
41+
Step five:: Describe and check logs for all replica pods to identify scheduling or crash issues.
42+
+
43+
[source,bash]
44+
----
45+
kubectl describe pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql,cnpg.io/instanceRole=replica
46+
kubectl logs -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql,cnpg.io/instanceRole=replica --prefix | grep -i "error\|fatal\|replication\|receiver"
47+
----
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
= Alert rule: CNPGClusterInstancesOnSameNode
2+
3+
== icon:glasses[] Overview
4+
5+
This alert triggers when multiple PostgreSQL pods from the same cluster are scheduled on the same Kubernetes node.
6+
This defeats node-level redundancy: if the node fails, multiple instances go down simultaneously.
7+
8+
== icon:bug[] Steps for Debugging
9+
10+
Step one:: Identify the affected namespace from the alert. Set it as a variable.
11+
+
12+
[source,bash]
13+
----
14+
INSTANCE_NAMESPACE='<instance-namespace>'
15+
----
16+
+
17+
Step two:: List all PostgreSQL pods and their nodes.
18+
+
19+
[source,bash]
20+
----
21+
kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql -o wide
22+
----
23+
+
24+
Step three:: Check if the node has any issues preventing other nodes from accepting pods (taints, resource pressure). Use the node name from the output above.
25+
+
26+
[source,bash]
27+
----
28+
NODE_NAME='<node-name>'
29+
kubectl describe node $NODE_NAME | grep -E -A5 "Taints|Conditions|Allocatable"
30+
----
31+
+
32+
Step four:: Check if a pod anti-affinity rule is configured on the cluster.
33+
+
34+
[source,bash]
35+
----
36+
kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.spec.affinity}'
37+
----
38+
+
39+
Step five:: If the cluster is undersized (fewer nodes than instances), the scheduler has no choice but to co-locate pods.
40+
Check the number of available nodes vs. the number of PostgreSQL instances.
41+
+
42+
[source,bash]
43+
----
44+
kubectl get nodes
45+
kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.spec.instances}'
46+
----
47+
48+
NOTE: CNPG sets pod anti-affinity by default but it is a soft preference (`preferredDuringSchedulingIgnoredDuringExecution`). Co-location can still occur when the cluster has fewer nodes than instances.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
= Alert rule: CNPGClusterOffline
2+
3+
== icon:glasses[] Overview
4+
5+
This alert triggers when all CNPG collectors in a namespace report `cnpg_collector_up == 0`, indicating that the cluster is unhealthy or the monitoring exporter has failed on all instances.
6+
7+
WARNING: This alert relies on `cnpg_collector_up` being present. If all pods are killed and the metric is absent entirely, this alert will not fire. In that case, investigate manually.
8+
9+
== icon:bug[] Steps for Debugging
10+
11+
Step one:: Identify the affected namespace from the alert. Set it as a variable.
12+
+
13+
[source,bash]
14+
----
15+
INSTANCE_NAMESPACE='<instance-namespace>'
16+
----
17+
+
18+
Step two:: Check if any PostgreSQL pods are running.
19+
+
20+
[source,bash]
21+
----
22+
kubectl get pods -n $INSTANCE_NAMESPACE
23+
----
24+
+
25+
Step three:: Check the CNPG cluster resource for status conditions.
26+
+
27+
[source,bash]
28+
----
29+
kubectl describe cluster postgresql -n $INSTANCE_NAMESPACE
30+
----
31+
+
32+
Step four:: If pods exist but the collector is not healthy, check the pod logs for errors.
33+
+
34+
[source,bash]
35+
----
36+
kubectl logs -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql --prefix | grep -i "collector\|error\|fatal"
37+
----
38+
+
39+
Step five:: If no pods are running, check recent events and node pressure.
40+
+
41+
[source,bash]
42+
----
43+
kubectl get events -n $INSTANCE_NAMESPACE --sort-by='.lastTimestamp'
44+
kubectl describe nodes | grep -A5 "Conditions:"
45+
----
46+
+
47+
Step six:: If the cluster is hibernated, check for the hibernation annotation.
48+
+
49+
[source,bash]
50+
----
51+
kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.metadata.annotations}'
52+
----
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
= Alert rule: CNPGClusterZoneSpreadWarning
2+
3+
== icon:glasses[] Overview
4+
5+
This alert triggers when PostgreSQL pods are not evenly spread across availability zones - specifically when the number of pods exceeds the number of distinct zones hosting them.
6+
This means at least one zone contains multiple instances, reducing availability zone redundancy.
7+
8+
NOTE: This alert can fire at the same time as `CNPGClusterInstancesOnSameNode`. Zone co-location is a broader condition: pods may be on different nodes but still in the same zone.
9+
10+
== icon:bug[] Steps for Debugging
11+
12+
Step one:: Identify the affected namespace from the alert. Set it as a variable.
13+
+
14+
[source,bash]
15+
----
16+
INSTANCE_NAMESPACE='<instance-namespace>'
17+
----
18+
+
19+
Step two:: List all PostgreSQL pods and the node they are running on.
20+
+
21+
[source,bash]
22+
----
23+
kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql -o wide
24+
----
25+
+
26+
Step three:: Check which availability zone each node belongs to.
27+
+
28+
[source,bash]
29+
----
30+
kubectl get nodes -L topology.kubernetes.io/zone
31+
----
32+
+
33+
Step four:: Cross-reference the pod nodes from step two with the zone output from step three to identify which zone has multiple instances.
34+
Step five:: Check how many nodes are available per zone.
35+
+
36+
[source,bash]
37+
----
38+
kubectl get nodes -L topology.kubernetes.io/zone --no-headers | awk '{print $NF}' | sort | uniq -c
39+
----
40+
+
41+
Step six:: If a zone has only one node and the cluster has more instances than zones, the scheduler cannot spread pods evenly.
42+
Check the number of instances configured.
43+
+
44+
[source,bash]
45+
----
46+
kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.spec.instances}'
47+
----
48+
49+
NOTE: CNPG uses a soft topology spread constraint by default. Uneven spread occurs when zone capacity is insufficient to host one instance per zone.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
= Alert rule: CNPGPostgreSQLArchiveFailing
2+
3+
== icon:glasses[] Overview
4+
5+
This alert triggers when WAL archiving has been failing - specifically when the last failed archive attempt is more recent than the last successful one.
6+
Failing archiving puts point-in-time recovery (PITR) and backups at risk.
7+
CNPG will not recycle WAL segments that have not been successfully archived - instead they accumulate in `pg_wal/`, which can fill up the disk and crash the instance.
8+
9+
== icon:bug[] Steps for Debugging
10+
11+
Step one:: Identify the affected namespace from the alert. Set it as a variable.
12+
+
13+
[source,bash]
14+
----
15+
INSTANCE_NAMESPACE='<instance-namespace>'
16+
----
17+
+
18+
Step two:: Find the primary pod and check the archive status.
19+
+
20+
[source,bash]
21+
----
22+
PRIMARY=$(kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.status.currentPrimary}')
23+
kubectl exec -n $INSTANCE_NAMESPACE $PRIMARY -- psql -U postgres -c "SELECT archived_count, last_archived_wal, last_archived_time, failed_count, last_failed_wal, last_failed_time FROM pg_stat_archiver;"
24+
----
25+
+
26+
Step three:: Check the primary pod logs for archiving errors.
27+
+
28+
[source,bash]
29+
----
30+
kubectl logs $PRIMARY -n $INSTANCE_NAMESPACE | grep -i "archive\|barman\|error\|fatal"
31+
----
32+
+
33+
Step four:: Check the object store secret and connectivity. The barman plugin logs the archive command output.
34+
+
35+
[source,bash]
36+
----
37+
kubectl get secret -n $INSTANCE_NAMESPACE | grep backup
38+
kubectl get scheduledbackup -n $INSTANCE_NAMESPACE
39+
kubectl describe scheduledbackup -n $INSTANCE_NAMESPACE
40+
----
41+
+
42+
Step five:: Check the `ScheduledBackup` and `Backup` resources for recent status.
43+
+
44+
[source,bash]
45+
----
46+
kubectl get backup.postgresql.cnpg.io -n $INSTANCE_NAMESPACE --sort-by='.metadata.creationTimestamp'
47+
----
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
= Alert rule: CNPGPostgreSQLFencingOn
2+
3+
== icon:glasses[] Overview
4+
5+
This alert triggers when a PostgreSQL instance has fencing enabled (`cnpg_collector_fencing_on == 1`).
6+
A fenced instance is forcibly shut down and isolated - the operator will not restart it until fencing is removed.
7+
Fencing is always applied manually and always requires manual removal.
8+
9+
== icon:bug[] Steps for Debugging
10+
11+
Step one:: Identify the affected namespace from the alert. Set it as a variable.
12+
+
13+
[source,bash]
14+
----
15+
INSTANCE_NAMESPACE='<instance-namespace>'
16+
----
17+
+
18+
Step two:: Check which instance is fenced.
19+
+
20+
[source,bash]
21+
----
22+
kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.metadata.annotations}'
23+
----
24+
+
25+
Step three:: If fencing was applied intentionally (for maintenance or inspection), complete the required work and then unfence.
26+
Step four:: Remove fencing using the CNPG kubectl plugin.
27+
+
28+
[source,bash]
29+
----
30+
# Unfence a specific instance
31+
kubectl cnpg fencing off postgresql <instance-name> -n $INSTANCE_NAMESPACE
32+
33+
# Unfence all instances in the cluster
34+
kubectl cnpg fencing off postgresql "*" -n $INSTANCE_NAMESPACE
35+
----
36+
+
37+
Step five:: Verify the cluster recovers and all instances are running.
38+
+
39+
[source,bash]
40+
----
41+
kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql
42+
kubectl get cluster postgresql -n $INSTANCE_NAMESPACE
43+
----
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
= Alert rule: CNPGPostgreSQLManualSwitchoverRequired
2+
3+
== icon:glasses[] Overview
4+
5+
This alert triggers when the CNPG operator has set `cnpg_collector_manual_switchover_required == 1`, indicating that a primary switchover is needed but cannot proceed automatically.
6+
This typically occurs after a maintenance operation or when the `primaryUpdateStrategy` is set to `Supervised`.
7+
8+
== icon:bug[] Steps for Debugging
9+
10+
Step one:: Identify the affected namespace from the alert. Set it as a variable.
11+
+
12+
[source,bash]
13+
----
14+
INSTANCE_NAMESPACE='<instance-namespace>'
15+
----
16+
+
17+
Step two:: Check the cluster status and which instance is the current primary.
18+
+
19+
[source,bash]
20+
----
21+
kubectl get cluster postgresql -n $INSTANCE_NAMESPACE
22+
----
23+
+
24+
Step three:: Check the cluster conditions for the reason a manual switchover is required.
25+
+
26+
[source,bash]
27+
----
28+
kubectl describe cluster postgresql -n $INSTANCE_NAMESPACE | grep -A10 "Conditions:"
29+
----
30+
+
31+
Step four:: Perform the switchover using the CNPG kubectl plugin.
32+
+
33+
[source,bash]
34+
----
35+
kubectl cnpg promote postgresql <target-instance> -n $INSTANCE_NAMESPACE
36+
----
37+
38+
NOTE: Replace `<target-instance>` with the name of the replica pod to promote (for example, `postgresql-2`).
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
= Alert rule: CNPGPostgreSQLNoStreamingReplicas
2+
3+
== icon:glasses[] Overview
4+
5+
This alert triggers when the primary has zero streaming replicas connected, while replicas are expected to exist.
6+
HA is completely broken - if the primary fails now, there is no replica to promote.
7+
8+
== icon:bug[] Steps for Debugging
9+
10+
Step one:: Identify the affected namespace from the alert. Set it as a variable.
11+
+
12+
[source,bash]
13+
----
14+
INSTANCE_NAMESPACE='<instance-namespace>'
15+
----
16+
+
17+
Step two:: Check which pods are running and their roles.
18+
+
19+
[source,bash]
20+
----
21+
kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql -L cnpg.io/instanceRole
22+
----
23+
+
24+
Step three:: Resolve the primary pod name and check replication status.
25+
+
26+
[source,bash]
27+
----
28+
PRIMARY=$(kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.status.currentPrimary}')
29+
kubectl exec -n $INSTANCE_NAMESPACE $PRIMARY -- psql -U postgres -c "SELECT application_name, state, sync_state FROM pg_stat_replication;"
30+
----
31+
+
32+
Step four:: Check logs for all replica pods for connection errors.
33+
+
34+
[source,bash]
35+
----
36+
kubectl logs -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql,cnpg.io/instanceRole=replica --prefix | grep -i "error\|fatal\|replication\|receiver"
37+
----
38+
+
39+
Step five:: Check the overall cluster status for conditions or events.
40+
+
41+
[source,bash]
42+
----
43+
kubectl describe cluster postgresql -n $INSTANCE_NAMESPACE
44+
kubectl get events -n $INSTANCE_NAMESPACE --sort-by='.lastTimestamp'
45+
----

0 commit comments

Comments
 (0)