|
| 1 | +# CNPGClusterHighPhysicalReplicationLagWarning |
| 2 | + |
| 3 | +## Description |
| 4 | + |
| 5 | +The `CNPGClusterHighPhysicalReplicationLagWarning` alert is triggered when physical replication lag in the CloudNativePG cluster exceeds 1 second. |
| 6 | + |
| 7 | +## Impact |
| 8 | + |
| 9 | +High physical replication lag can cause the cluster replicas to become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data. In the event of a failover, the data that has not yet been replicated from the primary to the replicas may be lost during failover.. |
| 10 | + |
| 11 | +## Diagnosis |
| 12 | + |
| 13 | +Check replication status in the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/) or by running: |
| 14 | + |
| 15 | +```bash |
| 16 | +kubectl exec --namespace <namespace> --stdin --tty services/<cluster_name>-rw -- psql -c "SELECT * FROM pg_stat_replication;" |
| 17 | +``` |
| 18 | + |
| 19 | +High physical replication lag can be caused by a number of factors, including: |
| 20 | + |
| 21 | +- Network congestion on the node interface |
| 22 | + |
| 23 | +Inspect the network interface statistics using the `Kubernetes Cluster` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). |
| 24 | + |
| 25 | +- High CPU or memory load on primary or replicas |
| 26 | + |
| 27 | +Inspect the CPU and Memory usage of the CloudNativePG cluster instances using the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). |
| 28 | + |
| 29 | +- Disk I/O bottlenecks on replicas |
| 30 | + |
| 31 | +Inspect the disk IO statistics using the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). |
| 32 | + |
| 33 | +- Long-running queries |
| 34 | + |
| 35 | +Inspect the `Stat Activity` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). |
| 36 | + |
| 37 | +- Suboptimal PostgreSQL configuration, e.g. too `few max_wal_senders`. Set this to at least the number of cluster instances (default 10 is usually sufficient). |
| 38 | + |
| 39 | +Inspect the `PostgreSQL Parameters` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). |
| 40 | + |
| 41 | +## Mitigation |
| 42 | + |
| 43 | +- Terminate long-running transactions that generate excessive changes. |
| 44 | + |
| 45 | +```bash |
| 46 | +kubectl exec -it services/cluster-rw --namespace <namespace> -- psql |
| 47 | +``` |
| 48 | + |
| 49 | +- Increase the Memory and CPU resources of the instances under heavy load. This can be done by setting `cluster.resources.requests` and `cluster.resources.limits` in your Helm values. Set both `requests` and `limits` to the same value to achieve QoS Guaranteed. This will require a restart of the CloudNativePG cluster instances and a primary switchover, which will cause a brief service disruption. |
| 50 | + |
| 51 | +- Enable `wal_compression` by setting the `cluster.postgresql.parameters.wal_compression` parameter to `on`. Doing so will reduce the size of the WAL files and can help reduce replication lag in a congested network. Changing `wal_compression` does not require a restart of the CloudNativePG cluster. |
| 52 | + |
| 53 | +- Increase IOPS or throughput of the storage used by the cluster to alleviate disk I/O bottlenecks. This requires creating a new storage class with higher IOPS/throughput and rebuilding cluster instances and their PVCs one by one using the new storage class. This is a slow process that will also affect the cluster's availability. |
| 54 | + |
| 55 | +If you decide to go this route: |
| 56 | + |
| 57 | +1. Start by creating a new storage class. Storage classes are immutable, so you cannot change the storage class of existing Persistent Volume Claims (PVCs). |
| 58 | + |
| 59 | +2. Make sure to only replace one instance at a time to avoid service disruption. |
| 60 | + |
| 61 | +3. Double check you are deleting the correct pod. |
| 62 | + |
| 63 | +4. Don't start with the active primary instance. Delete one of the standby replicas first. |
| 64 | + |
| 65 | +```bash |
| 66 | +kubectl delete --namespace <namespace> pod/<pod-name> pvc/<pod-name> pvc/<pod-name>-wal |
| 67 | +``` |
| 68 | + |
| 69 | +- In the event that the cluster has 9+ instances, ensure that the `max_wal_senders` parameter is set to a value greater than or equal to the total number of instances in your cluster. |
0 commit comments