Skip to content

Commit 2c85b15

Browse files
docs( cluster ): Improved Runbooks and monitoring (#774)
Signed-off-by: Philippe Noël <philippemnoel@gmail.com> Signed-off-by: Itay Grudev <itay@verito.digital>
1 parent a7f4522 commit 2c85b15

19 files changed

+1729
-2
lines changed

charts/cluster/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,7 @@ refer to the [CloudNativePG Documentation](https://cloudnative-pg.io/documentat
168168
| cluster.monitoring.customQueriesSecret | list | `[]` | The list of secrets containing the custom queries |
169169
| cluster.monitoring.disableDefaultQueries | bool | `false` | Whether the default queries should be injected. Set it to true if you don't want to inject default queries into the cluster. |
170170
| cluster.monitoring.enabled | bool | `false` | Whether to enable monitoring |
171+
| cluster.monitoring.instrumentation.logicalReplication | bool | `true` | Enable logical replication metrics |
171172
| cluster.monitoring.podMonitor.enabled | bool | `true` | Whether to enable the PodMonitor |
172173
| cluster.monitoring.podMonitor.metricRelabelings | list | `[]` | The list of metric relabelings for the PodMonitor. Applied to samples before ingestion. |
173174
| cluster.monitoring.podMonitor.relabelings | list | `[]` | The list of relabelings for the PodMonitor. Applied to samples before scraping. |
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# CNPGClusterHighPhysicalReplicationLagWarning
2+
3+
## Description
4+
5+
The `CNPGClusterHighPhysicalReplicationLagWarning` alert is triggered when physical replication lag in the CloudNativePG cluster exceeds 1 second.
6+
7+
## Impact
8+
9+
High physical replication lag can cause the cluster replicas to become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data. In the event of a failover, the data that has not yet been replicated from the primary to the replicas may be lost during failover..
10+
11+
## Diagnosis
12+
13+
Check replication status in the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/) or by running:
14+
15+
```bash
16+
kubectl exec --namespace <namespace> --stdin --tty services/<cluster_name>-rw -- psql -c "SELECT * FROM pg_stat_replication;"
17+
```
18+
19+
High physical replication lag can be caused by a number of factors, including:
20+
21+
- Network congestion on the node interface
22+
23+
Inspect the network interface statistics using the `Kubernetes Cluster` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
24+
25+
- High CPU or memory load on primary or replicas
26+
27+
Inspect the CPU and Memory usage of the CloudNativePG cluster instances using the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
28+
29+
- Disk I/O bottlenecks on replicas
30+
31+
Inspect the disk IO statistics using the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
32+
33+
- Long-running queries
34+
35+
Inspect the `Stat Activity` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
36+
37+
- Suboptimal PostgreSQL configuration, e.g. too `few max_wal_senders`. Set this to at least the number of cluster instances (default 10 is usually sufficient).
38+
39+
Inspect the `PostgreSQL Parameters` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
40+
41+
## Mitigation
42+
43+
- Terminate long-running transactions that generate excessive changes.
44+
45+
```bash
46+
kubectl exec -it services/cluster-rw --namespace <namespace> -- psql
47+
```
48+
49+
- Increase the Memory and CPU resources of the instances under heavy load. This can be done by setting `cluster.resources.requests` and `cluster.resources.limits` in your Helm values. Set both `requests` and `limits` to the same value to achieve QoS Guaranteed. This will require a restart of the CloudNativePG cluster instances and a primary switchover, which will cause a brief service disruption.
50+
51+
- Enable `wal_compression` by setting the `cluster.postgresql.parameters.wal_compression` parameter to `on`. Doing so will reduce the size of the WAL files and can help reduce replication lag in a congested network. Changing `wal_compression` does not require a restart of the CloudNativePG cluster.
52+
53+
- Increase IOPS or throughput of the storage used by the cluster to alleviate disk I/O bottlenecks. This requires creating a new storage class with higher IOPS/throughput and rebuilding cluster instances and their PVCs one by one using the new storage class. This is a slow process that will also affect the cluster's availability.
54+
55+
If you decide to go this route:
56+
57+
1. Start by creating a new storage class. Storage classes are immutable, so you cannot change the storage class of existing Persistent Volume Claims (PVCs).
58+
59+
2. Make sure to only replace one instance at a time to avoid service disruption.
60+
61+
3. Double check you are deleting the correct pod.
62+
63+
4. Don't start with the active primary instance. Delete one of the standby replicas first.
64+
65+
```bash
66+
kubectl delete --namespace <namespace> pod/<pod-name> pvc/<pod-name> pvc/<pod-name>-wal
67+
```
68+
69+
- In the event that the cluster has 9+ instances, ensure that the `max_wal_senders` parameter is set to a value greater than or equal to the total number of instances in your cluster.

0 commit comments

Comments
 (0)