vshn
diff --git a/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGClusterHAWarning.adoc‎
Lines changed: 47 additions & 0 deletions b/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGClusterHAWarning.adoc‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGClusterInstancesOnSameNode.adoc‎
Lines changed: 48 additions & 0 deletions b/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGClusterInstancesOnSameNode.adoc‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGClusterOffline.adoc‎
Lines changed: 52 additions & 0 deletions b/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGClusterOffline.adoc‎
Lines changed: 52 additions & 0 deletions
diff --git a/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGClusterZoneSpreadWarning.adoc‎
Lines changed: 49 additions & 0 deletions b/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGClusterZoneSpreadWarning.adoc‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGPostgreSQLArchiveFailing.adoc‎
Lines changed: 47 additions & 0 deletions b/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGPostgreSQLArchiveFailing.adoc‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGPostgreSQLFencingOn.adoc‎
Lines changed: 43 additions & 0 deletions b/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGPostgreSQLFencingOn.adoc‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGPostgreSQLManualSwitchoverRequired.adoc‎
Lines changed: 38 additions & 0 deletions b/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGPostgreSQLManualSwitchoverRequired.adoc‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGPostgreSQLNoStreamingReplicas.adoc‎
Lines changed: 45 additions & 0 deletions b/‎docs/modules/ROOT/pages/service/postgresql/runbooks/system/CNPGPostgreSQLNoStreamingReplicas.adoc‎
Lines changed: 45 additions & 0 deletions
@@ -0,0 +1,47 @@
+= Alert rule: CNPGClusterHAWarning
+
+== icon:glasses[] Overview
+
+This alert triggers when a CNPG PostgreSQL HA cluster has fewer than 2 streaming replicas connected to the primary.
+The cluster is still operational but redundancy is reduced - if the primary fails, recovery depends on fewer standbys than expected.
+
+This alert only fires for clusters that have replicas configured. Standalone single-instance deployments are excluded.
+
+== icon:bug[] Steps for Debugging
+
+Step one:: Identify the affected namespace from the alert. Set it as a variable.
++
+[source,bash]
+----
+INSTANCE_NAMESPACE='<instance-namespace>'
+----
++
+Step two:: Check the overall cluster status and how many instances are ready.
++
+[source,bash]
+----
+kubectl get cluster postgresql -n $INSTANCE_NAMESPACE
+----
++
+Step three:: Check which pods are running and their roles.
++
+[source,bash]
+----
+kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql -L cnpg.io/instanceRole
+----
++
+Step four:: Resolve the primary pod name and check the replication status.
++
+[source,bash]
+----
+PRIMARY=$(kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.status.currentPrimary}')
+kubectl exec -n $INSTANCE_NAMESPACE $PRIMARY -- psql -U postgres -c "SELECT application_name, state, sync_state FROM pg_stat_replication;"
+----
++
+Step five:: Describe and check logs for all replica pods to identify scheduling or crash issues.
++
+[source,bash]
+----
+kubectl describe pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql,cnpg.io/instanceRole=replica
+kubectl logs -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql,cnpg.io/instanceRole=replica --prefix | grep -i "error\|fatal\|replication\|receiver"
+----
@@ -0,0 +1,48 @@
+= Alert rule: CNPGClusterInstancesOnSameNode
+
+== icon:glasses[] Overview
+
+This alert triggers when multiple PostgreSQL pods from the same cluster are scheduled on the same Kubernetes node.
+This defeats node-level redundancy: if the node fails, multiple instances go down simultaneously.
+
+== icon:bug[] Steps for Debugging
+
+Step one:: Identify the affected namespace from the alert. Set it as a variable.
++
+[source,bash]
+----
+INSTANCE_NAMESPACE='<instance-namespace>'
+----
++
+Step two:: List all PostgreSQL pods and their nodes.
++
+[source,bash]
+----
+kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql -o wide
+----
++
+Step three:: Check if the node has any issues preventing other nodes from accepting pods (taints, resource pressure). Use the node name from the output above.
++
+[source,bash]
+----
+NODE_NAME='<node-name>'
+kubectl describe node $NODE_NAME | grep -E -A5 "Taints|Conditions|Allocatable"
+----
++
+Step four:: Check if a pod anti-affinity rule is configured on the cluster.
++
+[source,bash]
+----
+kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.spec.affinity}'
+----
++
+Step five:: If the cluster is undersized (fewer nodes than instances), the scheduler has no choice but to co-locate pods.
+Check the number of available nodes vs. the number of PostgreSQL instances.
++
+[source,bash]
+----
+kubectl get nodes
+kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.spec.instances}'
+----
+
+NOTE: CNPG sets pod anti-affinity by default but it is a soft preference (`preferredDuringSchedulingIgnoredDuringExecution`). Co-location can still occur when the cluster has fewer nodes than instances.
@@ -0,0 +1,52 @@
+= Alert rule: CNPGClusterOffline
+
+== icon:glasses[] Overview
+
+This alert triggers when all CNPG collectors in a namespace report `cnpg_collector_up == 0`, indicating that the cluster is unhealthy or the monitoring exporter has failed on all instances.
+
+WARNING: This alert relies on `cnpg_collector_up` being present. If all pods are killed and the metric is absent entirely, this alert will not fire. In that case, investigate manually.
+
+== icon:bug[] Steps for Debugging
+
+Step one:: Identify the affected namespace from the alert. Set it as a variable.
++
+[source,bash]
+----
+INSTANCE_NAMESPACE='<instance-namespace>'
+----
++
+Step two:: Check if any PostgreSQL pods are running.
++
+[source,bash]
+----
+kubectl get pods -n $INSTANCE_NAMESPACE
+----
++
+Step three:: Check the CNPG cluster resource for status conditions.
++
+[source,bash]
+----
+kubectl describe cluster postgresql -n $INSTANCE_NAMESPACE
+----
++
+Step four:: If pods exist but the collector is not healthy, check the pod logs for errors.
++
+[source,bash]
+----
+kubectl logs -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql --prefix | grep -i "collector\|error\|fatal"
+----
++
+Step five:: If no pods are running, check recent events and node pressure.
++
+[source,bash]
+----
+kubectl get events -n $INSTANCE_NAMESPACE --sort-by='.lastTimestamp'
+kubectl describe nodes | grep -A5 "Conditions:"
+----
++
+Step six:: If the cluster is hibernated, check for the hibernation annotation.
++
+[source,bash]
+----
+kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.metadata.annotations}'
+----
@@ -0,0 +1,49 @@
+= Alert rule: CNPGClusterZoneSpreadWarning
+
+== icon:glasses[] Overview
+
+This alert triggers when PostgreSQL pods are not evenly spread across availability zones - specifically when the number of pods exceeds the number of distinct zones hosting them.
+This means at least one zone contains multiple instances, reducing availability zone redundancy.
+
+NOTE: This alert can fire at the same time as `CNPGClusterInstancesOnSameNode`. Zone co-location is a broader condition: pods may be on different nodes but still in the same zone.
+
+== icon:bug[] Steps for Debugging
+
+Step one:: Identify the affected namespace from the alert. Set it as a variable.
++
+[source,bash]
+----
+INSTANCE_NAMESPACE='<instance-namespace>'
+----
++
+Step two:: List all PostgreSQL pods and the node they are running on.
++
+[source,bash]
+----
+kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql -o wide
+----
++
+Step three:: Check which availability zone each node belongs to.
++
+[source,bash]
+----
+kubectl get nodes -L topology.kubernetes.io/zone
+----
++
+Step four:: Cross-reference the pod nodes from step two with the zone output from step three to identify which zone has multiple instances.
+Step five:: Check how many nodes are available per zone.
++
+[source,bash]
+----
+kubectl get nodes -L topology.kubernetes.io/zone --no-headers | awk '{print $NF}' | sort | uniq -c
+----
++
+Step six:: If a zone has only one node and the cluster has more instances than zones, the scheduler cannot spread pods evenly.
+Check the number of instances configured.
++
+[source,bash]
+----
+kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.spec.instances}'
+----
+
+NOTE: CNPG uses a soft topology spread constraint by default. Uneven spread occurs when zone capacity is insufficient to host one instance per zone.
@@ -0,0 +1,47 @@
+= Alert rule: CNPGPostgreSQLArchiveFailing
+
+== icon:glasses[] Overview
+
+This alert triggers when WAL archiving has been failing - specifically when the last failed archive attempt is more recent than the last successful one.
+Failing archiving puts point-in-time recovery (PITR) and backups at risk.
+CNPG will not recycle WAL segments that have not been successfully archived - instead they accumulate in `pg_wal/`, which can fill up the disk and crash the instance.
+
+== icon:bug[] Steps for Debugging
+
+Step one:: Identify the affected namespace from the alert. Set it as a variable.
++
+[source,bash]
+----
+INSTANCE_NAMESPACE='<instance-namespace>'
+----
++
+Step two:: Find the primary pod and check the archive status.
++
+[source,bash]
+----
+PRIMARY=$(kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.status.currentPrimary}')
+kubectl exec -n $INSTANCE_NAMESPACE $PRIMARY -- psql -U postgres -c "SELECT archived_count, last_archived_wal, last_archived_time, failed_count, last_failed_wal, last_failed_time FROM pg_stat_archiver;"
+----
++
+Step three:: Check the primary pod logs for archiving errors.
++
+[source,bash]
+----
+kubectl logs $PRIMARY -n $INSTANCE_NAMESPACE | grep -i "archive\|barman\|error\|fatal"
+----
++
+Step four:: Check the object store secret and connectivity. The barman plugin logs the archive command output.
++
+[source,bash]
+----
+kubectl get secret -n $INSTANCE_NAMESPACE | grep backup
+kubectl get scheduledbackup -n $INSTANCE_NAMESPACE
+kubectl describe scheduledbackup -n $INSTANCE_NAMESPACE
+----
++
+Step five:: Check the `ScheduledBackup` and `Backup` resources for recent status.
++
+[source,bash]
+----
+kubectl get backup.postgresql.cnpg.io -n $INSTANCE_NAMESPACE --sort-by='.metadata.creationTimestamp'
+----
@@ -0,0 +1,43 @@
+= Alert rule: CNPGPostgreSQLFencingOn
+
+== icon:glasses[] Overview
+
+This alert triggers when a PostgreSQL instance has fencing enabled (`cnpg_collector_fencing_on == 1`).
+A fenced instance is forcibly shut down and isolated - the operator will not restart it until fencing is removed.
+Fencing is always applied manually and always requires manual removal.
+
+== icon:bug[] Steps for Debugging
+
+Step one:: Identify the affected namespace from the alert. Set it as a variable.
++
+[source,bash]
+----
+INSTANCE_NAMESPACE='<instance-namespace>'
+----
++
+Step two:: Check which instance is fenced.
++
+[source,bash]
+----
+kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.metadata.annotations}'
+----
++
+Step three:: If fencing was applied intentionally (for maintenance or inspection), complete the required work and then unfence.
+Step four:: Remove fencing using the CNPG kubectl plugin.
++
+[source,bash]
+----
+# Unfence a specific instance
+kubectl cnpg fencing off postgresql <instance-name> -n $INSTANCE_NAMESPACE
+
+# Unfence all instances in the cluster
+kubectl cnpg fencing off postgresql "*" -n $INSTANCE_NAMESPACE
+----
++
+Step five:: Verify the cluster recovers and all instances are running.
++
+[source,bash]
+----
+kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql
+kubectl get cluster postgresql -n $INSTANCE_NAMESPACE
+----
@@ -0,0 +1,38 @@
+= Alert rule: CNPGPostgreSQLManualSwitchoverRequired
+
+== icon:glasses[] Overview
+
+This alert triggers when the CNPG operator has set `cnpg_collector_manual_switchover_required == 1`, indicating that a primary switchover is needed but cannot proceed automatically.
+This typically occurs after a maintenance operation or when the `primaryUpdateStrategy` is set to `Supervised`.
+
+== icon:bug[] Steps for Debugging
+
+Step one:: Identify the affected namespace from the alert. Set it as a variable.
++
+[source,bash]
+----
+INSTANCE_NAMESPACE='<instance-namespace>'
+----
++
+Step two:: Check the cluster status and which instance is the current primary.
++
+[source,bash]
+----
+kubectl get cluster postgresql -n $INSTANCE_NAMESPACE
+----
++
+Step three:: Check the cluster conditions for the reason a manual switchover is required.
++
+[source,bash]
+----
+kubectl describe cluster postgresql -n $INSTANCE_NAMESPACE | grep -A10 "Conditions:"
+----
++
+Step four:: Perform the switchover using the CNPG kubectl plugin.
++
+[source,bash]
+----
+kubectl cnpg promote postgresql <target-instance> -n $INSTANCE_NAMESPACE
+----
+
+NOTE: Replace `<target-instance>` with the name of the replica pod to promote (for example, `postgresql-2`).
@@ -0,0 +1,45 @@
+= Alert rule: CNPGPostgreSQLNoStreamingReplicas
+
+== icon:glasses[] Overview
+
+This alert triggers when the primary has zero streaming replicas connected, while replicas are expected to exist.
+HA is completely broken - if the primary fails now, there is no replica to promote.
+
+== icon:bug[] Steps for Debugging
+
+Step one:: Identify the affected namespace from the alert. Set it as a variable.
++
+[source,bash]
+----
+INSTANCE_NAMESPACE='<instance-namespace>'
+----
++
+Step two:: Check which pods are running and their roles.
++
+[source,bash]
+----
+kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql -L cnpg.io/instanceRole
+----
++
+Step three:: Resolve the primary pod name and check replication status.
++
+[source,bash]
+----
+PRIMARY=$(kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.status.currentPrimary}')
+kubectl exec -n $INSTANCE_NAMESPACE $PRIMARY -- psql -U postgres -c "SELECT application_name, state, sync_state FROM pg_stat_replication;"
+----
++
+Step four:: Check logs for all replica pods for connection errors.
++
+[source,bash]
+----
+kubectl logs -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql,cnpg.io/instanceRole=replica --prefix | grep -i "error\|fatal\|replication\|receiver"
+----
++
+Step five:: Check the overall cluster status for conditions or events.
++
+[source,bash]
+----
+kubectl describe cluster postgresql -n $INSTANCE_NAMESPACE
+kubectl get events -n $INSTANCE_NAMESPACE --sort-by='.lastTimestamp'
+----