diff --git a/charts/cluster/README.md b/charts/cluster/README.md index b78ce0d9fc..99d70a2e00 100644 --- a/charts/cluster/README.md +++ b/charts/cluster/README.md @@ -168,6 +168,7 @@ refer to the [CloudNativePG Documentation](https://cloudnative-pg.io/documentat | cluster.monitoring.customQueriesSecret | list | `[]` | The list of secrets containing the custom queries | | cluster.monitoring.disableDefaultQueries | bool | `false` | Whether the default queries should be injected. Set it to true if you don't want to inject default queries into the cluster. | | cluster.monitoring.enabled | bool | `false` | Whether to enable monitoring | +| cluster.monitoring.instrumentation.logicalReplication | bool | `true` | Enable logical replication metrics | | cluster.monitoring.podMonitor.enabled | bool | `true` | Whether to enable the PodMonitor | | cluster.monitoring.podMonitor.metricRelabelings | list | `[]` | The list of metric relabelings for the PodMonitor. Applied to samples before ingestion. | | cluster.monitoring.podMonitor.relabelings | list | `[]` | The list of relabelings for the PodMonitor. Applied to samples before scraping. | diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighPhysicalReplicationLagWarning.md b/charts/cluster/docs/runbooks/CNPGClusterHighPhysicalReplicationLagWarning.md new file mode 100644 index 0000000000..757754fbf7 --- /dev/null +++ b/charts/cluster/docs/runbooks/CNPGClusterHighPhysicalReplicationLagWarning.md @@ -0,0 +1,69 @@ +# CNPGClusterHighPhysicalReplicationLagWarning + +## Description + +The `CNPGClusterHighPhysicalReplicationLagWarning` alert is triggered when physical replication lag in the CloudNativePG cluster exceeds 1 second. + +## Impact + +High physical replication lag can cause the cluster replicas to become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data. In the event of a failover, the data that has not yet been replicated from the primary to the replicas may be lost during failover.. + +## Diagnosis + +Check replication status in the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/) or by running: + +```bash +kubectl exec --namespace --stdin --tty services/-rw -- psql -c "SELECT * FROM pg_stat_replication;" +``` + +High physical replication lag can be caused by a number of factors, including: + +- Network congestion on the node interface + +Inspect the network interface statistics using the `Kubernetes Cluster` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). + +- High CPU or memory load on primary or replicas + +Inspect the CPU and Memory usage of the CloudNativePG cluster instances using the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). + +- Disk I/O bottlenecks on replicas + +Inspect the disk IO statistics using the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). + +- Long-running queries + +Inspect the `Stat Activity` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). + +- Suboptimal PostgreSQL configuration, e.g. too `few max_wal_senders`. Set this to at least the number of cluster instances (default 10 is usually sufficient). + +Inspect the `PostgreSQL Parameters` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). + +## Mitigation + +- Terminate long-running transactions that generate excessive changes. + +```bash +kubectl exec -it services/cluster-rw --namespace -- psql +``` + +- Increase the Memory and CPU resources of the instances under heavy load. This can be done by setting `cluster.resources.requests` and `cluster.resources.limits` in your Helm values. Set both `requests` and `limits` to the same value to achieve QoS Guaranteed. This will require a restart of the CloudNativePG cluster instances and a primary switchover, which will cause a brief service disruption. + +- Enable `wal_compression` by setting the `cluster.postgresql.parameters.wal_compression` parameter to `on`. Doing so will reduce the size of the WAL files and can help reduce replication lag in a congested network. Changing `wal_compression` does not require a restart of the CloudNativePG cluster. + +- Increase IOPS or throughput of the storage used by the cluster to alleviate disk I/O bottlenecks. This requires creating a new storage class with higher IOPS/throughput and rebuilding cluster instances and their PVCs one by one using the new storage class. This is a slow process that will also affect the cluster's availability. + +If you decide to go this route: + +1. Start by creating a new storage class. Storage classes are immutable, so you cannot change the storage class of existing Persistent Volume Claims (PVCs). + +2. Make sure to only replace one instance at a time to avoid service disruption. + +3. Double check you are deleting the correct pod. + +4. Don't start with the active primary instance. Delete one of the standby replicas first. + +```bash +kubectl delete --namespace pod/ pvc/ pvc/-wal +``` + +- In the event that the cluster has 9+ instances, ensure that the `max_wal_senders` parameter is set to a value greater than or equal to the total number of instances in your cluster. diff --git a/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationErrors.md b/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationErrors.md new file mode 100644 index 0000000000..13f7eb9dd1 --- /dev/null +++ b/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationErrors.md @@ -0,0 +1,381 @@ +# CNPGClusterLogicalReplicationErrors + +## Description + +The `CNPGClusterLogicalReplicationErrors` alert indicates that a logical replication subscription is experiencing errors during data replication. This includes: + +1. **Apply Errors**: Errors that occur when applying received changes from the publisher +2. **Sync Errors**: Errors that occur during the initial table synchronization phase + +- **Warning level**: Any error detected in the last 5 minutes +- **Critical level**: 5 or more errors in the last 5 minutes + +## Impact + +- **Data Inconsistency**: The subscriber may have missing or incorrect data +- **Replication Paused**: Depending on configuration, replication might stop on errors +- **Growing Lag**: Errors can cause replication to fall behind +- **Critical**: Persistent errors may lead to complete replication failure + +## Diagnosis + +### Step 1: Check Error Details + +```bash +# Connect to the subscriber and check subscription status +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT + subname, + subenabled, + apply_error_count, + sync_error_count, + stats_reset +FROM pg_stat_subscription +WHERE apply_error_count > 0 OR sync_error_count > 0; +" + +# Check the last error message +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT + subname, + last_msg_receipt_time, + latest_end_time, + CASE + WHEN apply_error_count > 0 THEN 'Apply errors detected' + WHEN sync_error_count > 0 THEN 'Sync errors detected' + END as error_type +FROM pg_stat_subscription; +" +``` + +### Step 2: Check PostgreSQL Logs + +```bash +# Get the pod name +POD=$(kubectl get pods -n NAMESPACE -l app=postgresql -o name | head -1 | cut -d/ -f2) + +# Check recent logs for errors +kubectl logs -n NAMESPACE $POD --tail=100 | grep -i "replication\|subscription\|error" + +# Stream logs for real-time monitoring +kubectl logs -n NAMESPACE $POD -f | grep -i "replication\|subscription\|error" +``` + +### Step 3: Identify Common Error Patterns + +1. **Constraint Violations**: + ```bash + kubectl logs -n NAMESPACE $POD | grep "violates.*constraint" + ``` + +2. **Permission Issues**: + ```bash + kubectl logs -n NAMESPACE $POD | grep "permission denied\|role" + ``` + +3. **Data Type Mismatches**: + ```bash + kubectl logs -n NAMESPACE $POD | grep "invalid input syntax\|datatype" + ``` + +4. **Connection Issues**: + ```bash + kubectl logs -n NAMESPACE $POD | grep "connection\|timeout" + ``` + +### Step 4: Verify Publication/Subscription Configuration + +```bash +# On publisher - check publication tables +kubectl exec -it svc/PUBLISHER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT pubname, puballtables, pubinsert, pubupdate, pubdelete +FROM pg_publication; +" + +# On subscriber - check subscription details +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT + subname, + srconninfo, + srschema, + srslotname, + srsynccommit +FROM pg_subscription; +" + +# Check which tables are being replicated +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT + relid::regclass as table_name, + srsubstate as state +FROM pg_subscription_rel +JOIN pg_class ON relid = oid +WHERE srsubstate NOT IN ('r', 's'); -- Not ready or synchronizing +" +``` + +### Step 5: Check for Data Conflicts + +```bash +# Check for conflicting primary keys +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT schemaname, tablename, attname, n_distinct, correlation +FROM pg_stats +WHERE schemaname NOT IN ('pg_catalog', 'information_schema') +ORDER BY schemaname, tablename; +" +``` + +## Resolution + +### For Constraint Violations + +1. **Identify the Constraint**: + ```sql + -- Find the violated constraint + SELECT conname, contype, pg_get_constraintdef(oid) + FROM pg_constraint + WHERE conrelid = 'table_name'::regclass; + ``` + +2. **Resolve Data Conflicts**: + ```sql + -- Option 1: Remove conflicting data on subscriber + DELETE FROM table_name WHERE id = conflicting_id; + + -- Option 2: Update conflicting data + UPDATE table_name + SET conflicting_column = new_value + WHERE id = conflicting_id; + + -- Option 3: Temporarily disable constraint (use with caution) + ALTER TABLE table_name DISABLE TRIGGER ALL; + -- After sync, re-enable + ALTER TABLE table_name ENABLE TRIGGER ALL; + ``` + +### For Permission Issues + +1. **Check Subscription Owner**: + ```sql + SELECT usename, usesuper, usecreatedb + FROM pg_user + WHERE usename = current_user; + ``` + +2. **Grant Necessary Permissions**: + ```sql + -- On subscriber, ensure subscription owner has rights + GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO subscription_user; + GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO subscription_user; + ``` + +### For Data Type Mismatches + +1. **Verify Schema Consistency**: + ```sql + -- On publisher + \d+ table_name + + -- On subscriber + \d+ table_name + + -- Compare columns and types + SELECT column_name, data_type, is_nullable + FROM information_schema.columns + WHERE table_name = 'table_name'; + ``` + +2. **Fix Schema Issues**: + ```sql + -- Alter table to match publisher schema + ALTER TABLE table_name ALTER COLUMN column_name TYPE new_type; + ``` + +### For Initial Sync Errors + +1. **Check if Tables Exist**: + ```sql + -- On subscriber, ensure tables exist + SELECT tablename FROM pg_tables WHERE schemaname = 'public'; + ``` + +2. **Create Missing Tables**: + ```sql + -- Export schema from publisher + pg_dump -h PUBLISHER-HOST -U postgres -s -t table_name database_name + + -- Import into subscriber + psql -h SUBSCRIBER-HOST -U postgres -d database_name < schema_dump.sql + ``` + +3. **Reset Subscription**: + ```bash + # WARNING: This will resync all data + kubectl cnpg subscription restart SUBSCRIPTION-NAME -n NAMESPACE + + # Or completely recreate + kubectl cnpg subscription delete SUBSCRIPTION-NAME -n NAMESPACE + # Recreate with proper configuration + ``` + +### For Connection/Timeout Issues + +1. **Check Connectivity**: + ```bash + # Test connection from subscriber to publisher + kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- \ + psql -h PUBLISHER-HOST -U postgres -d database_name -c "SELECT 1;" + ``` + +2. **Increase Timeout Values**: + ```yaml + # In subscription configuration + spec: + parameters: + application_name: "my_subscription" + synchronous_commit: "off" + # Increase timeout for slow networks + ``` + +## Recovery Procedures + +Choose one of the following approaches based on your situation: + +### Option 1: Resolve Data Conflict (Recommended - Lets Replication Retry Automatically) + +**When to use**: When you have a specific constraint violation (e.g., duplicate key) and want to let the publisher's data replicate correctly. + +The most common cause of replication errors is conflicting data between publisher and subscriber. PostgreSQL's logical replication **stops** when it encounters a conflict and requires manual intervention. + +#### Step 1: Identify the conflicting data + +Check the PostgreSQL logs for the conflict details: + +```bash +kubectl logs -n NAMESPACE $POD | grep "conflict detected\|duplicate key" +``` + +You'll see something like: +``` +ERROR: duplicate key value violates unique constraint "test_pkey" +DETAIL: Key (c)=(1) already exists. +CONTEXT: processing remote data for replication origin "pg_16395" during "INSERT" +for replication target relation "public.test" in transaction 725 finished at 0/14C0378 +``` + +This tells you which table and key is causing the conflict. + +#### Step 2: Remove or fix the conflicting row on the subscriber + +```sql +-- For INSERT conflicts: Delete the conflicting row to let publisher's data replicate +DELETE FROM table_name WHERE id = conflicting_id; +``` + +**That's it!** Once you remove the conflicting row, logical replication will **automatically retry** the transaction and apply the publisher's data. You do NOT need to manually skip the transaction. + +**Important**: Only delete the subscriber's data if you're certain the publisher's version should win. + +### Option 2: Skip Transaction Without Applying Publisher's Data (Use With Caution) + +**When to use**: When you want to keep the subscriber's version of the data and permanently ignore what the publisher tried to send. This causes data divergence. + +If you've decided that the subscriber's conflicting data is correct and you want to ignore the publisher's transaction: + +```sql +-- Using ALTER SUBSCRIPTION SKIP +-- The subscription must be enabled for this to work +ALTER SUBSCRIPTION your_subscription SKIP (lsn = '0/14C0378'); +``` + +**WARNING**: This permanently skips the transaction and causes the subscriber to differ from the publisher. Document what was skipped. + +### Option 3: Full Resynchronization (For Multiple Conflicts or Unknown State) + +**When to use**: When you have many conflicts, corrupted data, or prefer to start fresh rather than manually fixing individual rows. + +**WARNING**: This will re-copy all table data and may take a long time for large tables. + +```bash +# Mark subscription for full refresh +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +ALTER SUBSCRIPTION your_subscription REFRESH PUBLICATION WITH (copy_data = true); +" + +# Or restart the subscription +kubectl cnpg subscription restart your_subscription -n NAMESPACE +``` + +## Important Notes + +- **PostgreSQL logical replication automatically retries after you fix the conflict** - Just delete or fix the conflicting row, and replication will resume on its own +- **Only use SKIP if you want to ignore the publisher's data** - Skipping means you're choosing to keep the subscriber's version and create data divergence +- **For typical constraint violations** - Delete the subscriber's conflicting row (Option 1), don't skip the transaction + +## Prevention + +1. **Schema Changes**: + - Always test schema changes in staging first + - Use DDL replication tools or manually sync schemas + - Coordinate schema changes between publisher and subscriber + +2. **Data Validation**: + ```sql + -- Regular data consistency checks + SELECT COUNT(*) FROM table_name; + -- Compare counts between publisher and subscriber + ``` + +3. **Monitoring**: + - Set up alerts for error rates + - Monitor pg_stat_subscription regularly + - Log error details for faster troubleshooting + +4. **Best Practices**: + - Don't modify subscriber data directly (unless bidirectional replication) + - Use consistent character sets and collations + - Ensure sufficient disk space for WAL retention + +## Common Error Scenarios + +### Primary Key Conflicts +```sql +-- Find duplicates +SELECT id, COUNT(*) +FROM table_name +GROUP BY id +HAVING COUNT(*) > 1; + +-- Resolve by updating or removing duplicates +``` + +### Missing Sequences +```sql +-- Check sequence ownership +SELECT relname, seqrelid::regclass +FROM pg_depend +WHERE refobjid = 'table_name'::regclass + AND deptype = 'a'; + +-- Sync sequence values +SELECT setval('sequence_name', (SELECT max(id) FROM table_name)); +``` + +### Trigger Conflicts +```sql +-- Disable problematic triggers during sync +ALTER TABLE table_name DISABLE TRIGGER trigger_name; + +-- Re-enable after sync +ALTER TABLE table_name ENABLE TRIGGER trigger_name; +``` + +## When to Escalate + +- Contact support if: + - Errors persist after all troubleshooting steps + - You encounter frequent constraint violations + - The schema cannot be synchronized + - You need to skip transactions repeatedly + - Error rate is increasing despite fixes \ No newline at end of file diff --git a/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationLagging.md b/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationLagging.md new file mode 100644 index 0000000000..66356cdcb4 --- /dev/null +++ b/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationLagging.md @@ -0,0 +1,233 @@ +# CNPGClusterLogicalReplicationLagging + +## Description + +The `CNPGClusterLogicalReplicationLagging` alert indicates that a CloudNativePG cluster with a logical replication subscription is falling behind its publisher. This alert aggregates three types of lag: + +1. **Receipt Lag** (`cnpg_pg_stat_subscription_receipt_lag_seconds`): Time since the last WAL message was received from the publisher +2. **Apply Lag** (`cnpg_pg_stat_subscription_apply_lag_seconds`): Time delay between receiving and actually applying changes +3. **LSN Distance** (`cnpg_pg_stat_subscription_buffered_lag_bytes`): Amount of WAL data buffered but not yet applied (measured in bytes) + +- **Warning level**: Any lag metric exceeds 60s or 1GB +- **Critical level**: Any lag metric exceeds 300s or 4GB + +## Impact + +The cluster remains operational, but: +- Queries to the subscriber will return stale data +- Data inconsistency between publisher and subscriber +- In critical cases, disk space on the publisher may fill up with unapplied WAL +- Recovery time increases with lag duration + +## Diagnosis + +### Step 1: Identify the Lag Type + +Connect to the subscriber and check the current state: + +```bash +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT + subname, + enabled, + EXTRACT(EPOCH FROM (NOW() - last_msg_receipt_time)) as receipt_lag_seconds, + EXTRACT(EPOCH FROM (NOW() - latest_end_time)) as apply_lag_seconds, + pg_wal_lsn_diff(received_lsn, latest_end_lsn) as pending_bytes, + CASE + WHEN EXTRACT(EPOCH FROM (NOW() - last_msg_receipt_time)) > 60 THEN 'High receipt lag' + WHEN EXTRACT(EPOCH FROM (NOW() - latest_end_time)) > 60 THEN 'High apply lag' + WHEN pg_wal_lsn_diff(received_lsn, latest_end_lsn) > 1024^3 THEN 'High LSN distance' + END as primary_issue +FROM pg_stat_subscription; +" +``` + +### Step 2: Check Network Connectivity + +For **receipt lag** issues: + +```bash +# Check network latency between publisher and subscriber +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- \ + ping -c 10 PUBLISHER-HOSTNAME + +# Check bandwidth (if tools are available) +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- \ + nc -zv PUBLISHER-HOSTNAME 5432 +``` + +### Step 3: Check Resource Utilization + +For **apply lag** issues: + +```bash +# Check CPU/Memory usage on subscriber +kubectl top pod -n NAMESPACE -l app=postgresql + +# Check disk I/O +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- \ + iostat -x 1 5 + +# Check for long-running queries +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT pid, now() - pg_stat_activity.query_start AS duration, query +FROM pg_stat_activity +WHERE state = 'active' AND now() - query_start > interval '5 minutes' +ORDER BY duration DESC; +" +``` + +### Step 4: Check Configuration + +```bash +# Verify replication worker settings +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SHOW max_worker_processes; +SHOW max_logical_replication_workers; +SHOW max_parallel_workers; +" + +# Ensure adequate worker processes: +# max_worker_processes >= max_parallel_workers + max_logical_replication_workers +``` + +### Step 5: Monitor Trends + +Use the CloudNativePG Grafana Dashboard: +- Navigate to the Logical Replication section +- Examine all lag graphs over time +- Check if lag is stable, increasing, or fluctuating +- Correlate with workload spikes + +## Resolution + +### For Receipt Lag (Network Issues) + +1. **Check Network Latency**: + - Verify network connectivity between clusters + - Consider placing clusters in the same region/availability zone + - Check for network congestion or throttling + +2. **Optimize Network Configuration**: + ```yaml + # In the subscriber's postgresql configuration + postgresql: + parameters: + wal_sender_timeout: '60s' + wal_receiver_status_interval: '10s' + ``` + +### For Apply Lag (Resource Issues) + +1. **Scale Up Resources**: + ```yaml + # Increase CPU/memory for the subscriber + resources: + requests: + cpu: 2 + memory: 8Gi + limits: + cpu: 4 + memory: 16Gi + ``` + +2. **Optimize Disk I/O**: + - Use faster storage (SSD if not already) + - Consider increasing storage IOPS + - Check for disk bottlenecks + +3. **Tune PostgreSQL Settings**: + ```yaml + postgresql: + parameters: + # Increase for better write performance + wal_buffers: '16MB' + checkpoint_completion_target: 0.9 + # Reduce checkpoint frequency + max_wal_size: '4GB' + min_wal_size: '1GB' + ``` + +### For High Transaction Volume + +1. **Batch Large Transactions**: + - Break large transactions into smaller ones + - Use `COPY` instead of many INSERT statements + +2. **Consider Row Filtering**: + ```sql + -- Only replicate needed data + ALTER PUBLICATION publication_name SET (publish = 'insert, update, delete'); + ALTER PUBLICATION publication_name ADD TABLE table_name WHERE (condition); + ``` + +3. **Temporarily Disable Triggers**: + ```sql + -- On subscriber for performance-critical periods + ALTER TABLE table_name DISABLE TRIGGER ALL; + -- Remember to re-enable after + ``` + +### General Tuning + +1. **Increase Replication Slots**: + ```yaml + # If multiple publications + postgresql: + parameters: + max_replication_slots: 10 + max_wal_senders: 10 + ``` + +2. **Monitor and Restart**: + ```bash + # If subscriber is stuck + kubectl cnpg subscription restart SUBSCRIPTION-NAME -n NAMESPACE + + # Or restart the entire cluster + kubectl cnpg restart SUBSCRIBER-CLUSTER -n NAMESPACE + ``` + +## Prevention + +1. **Right-size Resources**: + - Allocate adequate CPU, memory, and storage IOPS + - Monitor resource utilization regularly + +2. **Network Optimization**: + - Place publisher and subscriber close to each other + - Use dedicated network connections if possible + +3. **Regular Monitoring**: + - Set up proactive monitoring before issues become critical + - Review lag trends regularly + - Set up automated scaling based on metrics + +4. **Maintenance Windows**: + - Schedule large data operations during low-traffic periods + - Consider pausing replication during major maintenance + +## Additional Commands + +```bash +# Check replication slot status +kubectl exec -it svc/PUBLISHER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag_bytes +FROM pg_replication_slots +WHERE slot_type = 'logical'; +" + +# Force sync (if needed) +kubectl cnpg subscription enable SUBSCRIPTION-NAME -n NAMESPACE + +# Check subscription details +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c "\dRs+" +``` + +## When to Escalate + +- Contact support if: + - Lag continues to increase despite optimization + - Network issues persist between clusters + - Resource utilization is at maximum but lag continues + - You experience frequent replication failures \ No newline at end of file diff --git a/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationStopped.md b/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationStopped.md new file mode 100644 index 0000000000..de2190e22b --- /dev/null +++ b/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationStopped.md @@ -0,0 +1,336 @@ +# CNPGClusterLogicalReplicationStopped + +## Description + +The `CNPGClusterLogicalReplicationStopped` alert indicates that a logical replication subscription is not actively replicating data. This can occur in two scenarios: + +1. **Disabled Subscription**: The subscription has been explicitly disabled (`enabled = false`) +2. **Stuck Subscription**: The subscription is enabled but has no active worker process (no PID) with pending data + +- **Warning level**: Subscription stopped for 5 minutes +- **Critical level**: Subscription stopped for 15 minutes + +## Impact + +- **No Data Replication**: The subscriber will not receive any updates from the publisher +- **Data Divergence**: The subscriber data becomes increasingly stale +- **Disk Space**: WAL files may accumulate on the publisher +- **Critical**: Extended downtime may require full resynchronization + +## Diagnosis + +### Step 1: Check Subscription Status + +```bash +# Check all subscriptions and their status +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT + pg_subscription.subname, + pg_subscription.enabled, + CASE + WHEN pg_subscription.enabled = false THEN 'Explicitly disabled' + WHEN pid IS NULL AND buffered_lag_bytes > 0 THEN 'Stuck (no worker)' + WHEN pid IS NOT NULL THEN 'Active' + ELSE 'Unknown' + END as status, + pg_wal_lsn_diff(received_lsn, latest_end_lsn) as pending_bytes, + pid IS NOT NULL as has_worker +FROM pg_subscription +LEFT JOIN pg_stat_subscription ON pg_subscription.oid = pg_stat_subscription.subid; +" +``` + +### Step 2: Check Worker Process + +```bash +# Check if replication worker is running +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT + pid, + application_name, + state, + backend_type, + query_start +FROM pg_stat_activity +WHERE application_name LIKE '%subscription%' OR backend_type = 'logical replication worker'; +" +``` + +### Step 3: Verify Subscription Details + +```bash +# Get subscription configuration +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT + subname, + srconninfo, + srsynccommit, + srslotname, + srsyncstate as sync_state +FROM pg_subscription; +" +``` + +### Step 4: Check PostgreSQL Logs + +```bash +# Get the pod name +POD=$(kubectl get pods -n NAMESPACE -l app=postgresql -o name | head -1 | cut -d/ -f2) + +# Check for subscription-related errors +kubectl logs -n NAMESPACE $POD --tail=200 | grep -i "subscription\|replication\|worker" +``` + +### Step 5: Test Connectivity to Publisher + +```bash +# Extract connection info from subscription +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT srconninfo FROM pg_subscription WHERE subname = 'your_subscription_name'; +" | grep -o "host=[^ ]*" | cut -d= -f2 + +# Test connection +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- \ + psql "host=PUBLISHER-HOST port=5432 dbname=DATABASE user=USER" -c "SELECT version();" +``` + +## Resolution + +### If Subscription is Disabled + +1. **Check if Disable Was Intentional**: + ```bash + # Check recent activity + kubectl get events -n NAMESPACE --field-selector reason=SubscriptionDisabled + + # Check audit logs if RBAC is enabled + kubectl auth can-i create subscriptions + ``` + +2. **Enable the Subscription**: + ```sql + -- Enable the subscription + ALTER SUBSCRIPTION subscription_name ENABLE; + ``` + + Or using kubectl: + ```bash + kubectl cnpg subscription enable subscription_name -n NAMESPACE + ``` + +### If Subscription is Stuck + +1. **Check for Worker Resource Limits**: + ```bash + # Check max_logical_replication_workers + kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " + SHOW max_logical_replication_workers; + SHOW max_worker_processes; + " + + # Count active replication workers + kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " + SELECT COUNT(*) FROM pg_stat_activity WHERE backend_type = 'logical replication worker'; + " + ``` + +2. **Increase Worker Limits if Needed**: + ```yaml + # In the CNPG cluster configuration + postgresql: + parameters: + max_logical_replication_workers: 10 + max_worker_processes: 20 + max_replication_slots: 10 + ``` + +3. **Restart the Subscription**: + ```bash + # First try to restart just the subscription + kubectl cnpg subscription restart subscription_name -n NAMESPACE + + # If that doesn't work, restart the entire cluster + kubectl cnpg restart subscriber-cluster -n NAMESPACE + ``` + +4. **Check for Stuck Transactions**: + ```sql + -- Check for long-running transactions that might block replication + SELECT pid, now() - pg_stat_activity.query_start AS duration, query + FROM pg_stat_activity + WHERE state = 'active' + AND now() - query_start > interval '10 minutes' + AND pid NOT IN (SELECT pid FROM pg_stat_activity WHERE application_name LIKE '%subscription%'); + + -- Terminate blocking transactions if necessary + SELECT pg_terminate_backend(pid); + ``` + +### If Connection Issues + +1. **Verify Publication Exists**: + ```bash + # On publisher + kubectl exec -it svc/PUBLISHER-CLUSTER-rw -n NAMESPACE -- psql -c " + SELECT pubname FROM pg_publication; + " + ``` + +2. **Check Replication Slot Status**: + ```bash + # On publisher + kubectl exec -it svc/PUBLISHER-CLUSTER-rw -n NAMESPACE -- psql -c " + SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag + FROM pg_replication_slots + WHERE slot_type = 'logical'; + " + ``` + +3. **Recreate Subscription**: + ```sql + -- Drop and recreate the subscription + DROP SUBSCRIPTION IF EXISTS subscription_name; + + CREATE SUBSCRIPTION subscription_name + CONNECTION 'host=publisher-host port=5432 dbname=database_name user=replication_user password=xxx' + PUBLICATION publication_name + WITH ( + copy_data = true, + synchronized_commit = 'off', + create_slot = true + ); + ``` + +### If WAL Retention Issues + +1. **Check WAL Retention**: + ```bash + # On publisher, check wal_keep_size + kubectl exec -it svc/PUBLISHER-CLUSTER-rw -n NAMESPACE -- psql -c "SHOW wal_keep_size;" + + # Check if WAL was removed before subscription could catch up + kubectl exec -it svc/PUBLISHER-CLUSTER-rw -n NAMESPACE -- psql -c " + SELECT slot_name, restart_lsn, pg_current_wal_lsn() + FROM pg_replication_slots; + " + ``` + +2. **Increase WAL Retention**: + ```yaml + # In publisher configuration + postgresql: + parameters: + wal_keep_size: '2GB' + max_slot_wal_keep_size: '4GB' + ``` + +## Advanced Troubleshooting + +### Manual Worker Creation + +```sql +-- If workers aren't starting automatically +SELECT pg_reload_conf(); + +-- Force subscription to start worker +ALTER SUBSCRIPTION subscription_name ENABLE; +ALTER SUBSCRIPTION subscription_name REFRESH PUBLICATION; +``` + +### Check System Resources + +```bash +# Check for OOM kills or resource constraints +kubectl describe pod -n NAMESPACE POD-NAME + +# Check if the pod was restarted +kubectl get pods -n NAMESPACE -l app=postgresql + +# Check node resources +kubectl top nodes +``` + +### Full Resync Procedure + +```bash +# Step 1: Mark all tables for resync +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +SELECT schemaname, tablename +FROM pg_tables +WHERE schemaname NOT IN ('pg_catalog', 'information_schema'); +" + +# Step 2: Disable subscription +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +ALTER SUBSCRIPTION subscription_name DISABLE; +" + +# Step 3: Truncate subscriber tables (if safe) +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +TRUNCATE TABLE table_name CASCADE; -- Repeat for each table +" + +# Step 4: Re-enable with full copy +kubectl exec -it svc/SUBSCRIBER-CLUSTER-rw -n NAMESPACE -- psql -c " +ALTER SUBSCRIPTION subscription_name ENABLE; +ALTER SUBSCRIPTION subscription_name REFRESH PUBLICATION WITH (copy_data = true); +" +``` + +## Prevention + +1. **Monitoring**: + - Set up alerts for disabled subscriptions + - Monitor worker process counts + - Track subscription state changes + +2. **Resource Planning**: + - Ensure adequate worker processes + - Monitor disk space for WAL retention + - Set appropriate timeouts + +3. **High Availability**: + ```yaml + # Configure subscription retry parameters + postgresql: + parameters: + wal_receiver_timeout: '60s' + wal_receiver_status_interval: '10s' + wal_retrieve_retry_interval: '5s' + ``` + +4. **Backup Strategy**: + - Regular backups of both publisher and subscriber + - Document subscription configurations + - Test recovery procedures + +## Quick Reference Commands + +```bash +# Check subscription status +kubectl exec -it svc/CLUSTER-rw -n NS -- psql -c "SELECT * FROM pg_stat_subscription;" + +# Enable subscription +kubectl exec -it svc/CLUSTER-rw -n NS -- psql -c "ALTER SUBSCRIPTION sub_name ENABLE;" + +# Restart subscription +kubectl cnpg subscription restart sub_name -n NS + +# Restart cluster +kubectl cnpg restart CLUSTER -n NS + +# Check replication slots +kubectl exec -it svc/PUBLISHER-rw -n NS -- psql -c "SELECT * FROM pg_replication_slots;" + +# Check workers +kubectl exec -it svc/CLUSTER-rw -n NS -- psql -c "SELECT * FROM pg_stat_activity WHERE backend_type = 'logical replication worker';" +``` + +## When to Escalate + +- Contact support if: + - Subscription remains stuck after multiple restarts + - Workers fail to start despite adequate resources + - WAL retention issues prevent catch-up + - Frequent disconnections occur + - Data cannot be resynchronized successfully \ No newline at end of file diff --git a/charts/cluster/docs/runbooks/CNPGClusterPhysicalReplicationLag.md b/charts/cluster/docs/runbooks/CNPGClusterPhysicalReplicationLag.md new file mode 100644 index 0000000000..677302b77f --- /dev/null +++ b/charts/cluster/docs/runbooks/CNPGClusterPhysicalReplicationLag.md @@ -0,0 +1,206 @@ +# CNPGClusterPhysicalReplicationLag + +## Description + +The `CNPGClusterPhysicalReplicationLag` alerts indicate that physical replication lag in the CloudNativePG cluster is exceeding acceptable thresholds. Physical replication lag measures how far behind the standby replicas are from the primary instance. + +- **Warning level**: Replication lag exceeds 1 second +- **Critical level**: Replication lag exceeds 15 seconds + +## Impact + +Physical replication lag can cause the cluster replicas to become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data. In the event of a failover, the data that has not yet been replicated from the primary to the replicas may be lost during failover. + +- **Warning**: Minor data staleness, acceptable for read-heavy workloads with some tolerance for outdated data +- **Critical**: Significant data loss risk during failover, stale data affecting business operations + +## Diagnosis + +### Step 1: Check Replication Status + +Check replication status in the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/) or by running: + +```bash +kubectl exec --namespace --stdin --tty services/-rw -- psql -c "SELECT * FROM pg_stat_replication;" +``` + +### Step 2: Identify Common Causes + +High physical replication lag can be caused by a number of factors: + +**Network Issues:** +- Network congestion on the node interface +- Insufficient bandwidth between primary and replicas + +```bash +# Inspect network interface statistics +kubectl exec -it -- ss -i +``` + +**Resource Contention:** +- High CPU or memory load on primary or replicas +- Disk I/O bottlenecks on replicas + +```bash +# Check resource usage +kubectl top pods -n -l cnpg.io/podRole=instance + +# Check disk I/O +kubectl exec -it -- iostat -x 1 5 +``` + +**Database Issues:** +- Long-running queries blocking replication +- Suboptimal PostgreSQL configuration + +```bash +# Check for long-running queries +kubectl exec -it services/-rw -- psql -c " +SELECT pid, now() - pg_stat_activity.query_start AS duration, query +FROM pg_stat_activity +WHERE state = 'active' + AND now() - query_start > interval '5 minutes' +ORDER BY duration DESC; +" +``` + +### Step 3: Check PostgreSQL Configuration + +Inspect the `PostgreSQL Parameters` section of the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/) or check directly: + +```bash +kubectl exec -it services/-rw -- psql -c " +SHOW max_wal_senders; +SHOW wal_compression; +SHOW max_replication_slots; +" +``` + +## Resolution + +### For Warning Level Alerts (1-15 seconds lag) + +1. **Monitor Resource Usage:** + - Check CPU and Memory usage of the CloudNativePG cluster instances + - Monitor network traffic between primary and replicas + - Review disk I/O statistics + +2. **Identify and Address Minor Issues:** + - Look for and optimize long-running queries + - Check for temporary resource spikes + - Ensure adequate network bandwidth + +### For Critical Level Alerts (>15 seconds lag) + +1. **Immediate Actions:** + ```bash + # Terminate long-running transactions that generate excessive changes + kubectl exec -it services/-rw -- psql -c " + SELECT pg_terminate_backend(pid) + FROM pg_stat_activity + WHERE state = 'active' + AND now() - query_start > interval '30 minutes' + AND query NOT LIKE '%autovacuum%'; + " + ``` + +2. **Scale Up Resources:** + Increase the Memory and CPU resources of the instances under heavy load. This can be done by setting `cluster.resources.requests` and `cluster.resources.limits` in your Helm values. Set both `requests` and `limits` to the same value to achieve QoS Guaranteed. + + ```yaml + cluster: + resources: + requests: + cpu: 4 + memory: 16Gi + limits: + cpu: 4 + memory: 16Gi + ``` + +3. **Enable WAL Compression:** + ```yaml + cluster: + postgresql: + parameters: + wal_compression: "on" + ``` + This will reduce the size of the WAL files and can help reduce replication lag in congested networks. Changing `wal_compression` does not require a restart. + +4. **Upgrade Storage Performance:** + Increase IOPS or throughput of the storage used by the cluster to alleviate disk I/O bottlenecks. + + **Process:** + 1. Create a new storage class with higher IOPS/throughput + 2. Replace cluster instances one by one using the new storage class + 3. Start with standby replicas, not the primary + 4. Delete and recreate each instance with new storage: + + ```bash + kubectl delete --namespace pod/ pvc/ pvc/-wal + ``` + +5. **Increase WAL Senders:** + For clusters with 9+ instances, ensure `max_wal_senders` is adequate: + ```yaml + cluster: + postgresql: + parameters: + max_wal_senders: 15 # Should be >= number of instances + ``` + +## Prevention + +1. **Resource Planning:** + - Allocate adequate CPU, memory, and storage IOPS + - Monitor resource utilization regularly + - Set appropriate resource limits and requests + +2. **Network Optimization:** + - Ensure sufficient network bandwidth between replicas + - Consider placing replicas in the same availability zone + - Monitor network latency and throughput + +3. **Configuration Tuning:** + - Enable WAL compression to reduce replication bandwidth + - Ensure adequate `max_wal_senders` for cluster size + - Monitor and tune checkpoint settings + +4. **Regular Maintenance:** + - Monitor replication lag trends + - Review long-running query patterns + - Plan capacity upgrades before reaching limits + +## Quick Reference Commands + +```bash +# Check replication status +kubectl exec -n services/-w -- psql -c "SELECT * FROM pg_stat_replication;" + +# Check resource usage +kubectl top pods -n -l cnpg.io/podRole=instance + +# Check long-running queries +kubectl exec -it services/-rw -- psql -c " +SELECT pid, now() - pg_stat_activity.query_start AS duration, query +FROM pg_stat_activity +WHERE state = 'active' + AND now() - query_start > interval '5 minutes' +ORDER BY duration DESC; +" + +# Restart a replica (if needed) +kubectl delete pod -n + +# Check PostgreSQL parameters +kubectl exec -it services/-rw -- psql -c "SHOW max_wal_senders; SHOW wal_compression;" +``` + +## When to Escalate + +- Contact support if: + - Replication lag continues to increase despite optimization + - Network issues persist between cluster instances + - Resource utilization is at maximum but lag continues + - You experience frequent replication failures + - Lag remains critical for more than 30 minutes \ No newline at end of file diff --git a/charts/cluster/monitoring/metrics-clusters_postgresql_cnpg_io.yaml b/charts/cluster/monitoring/metrics-clusters_postgresql_cnpg_io.yaml new file mode 100644 index 0000000000..be6a92726b --- /dev/null +++ b/charts/cluster/monitoring/metrics-clusters_postgresql_cnpg_io.yaml @@ -0,0 +1,193 @@ +kind: CustomResourceStateMetrics +spec: + resources: + - groupVersionKind: + group: postgresql.cnpg.io + version: "v1" + kind: "Cluster" + labelsFromPath: + cluster: [metadata, name] + namespace: [metadata, namespace] + metrics: + - name: "image_info" + help: "Image used by the cluster pods" + each: + type: Info + info: + labelsFromPath: + image: [status, image] + + - name: "phase_info" + help: "Current phase of the cluster" + each: + type: Info + info: + labelsFromPath: + phase: [status, phase] + phase_reason: [status, phaseReason] + + - name: "instances_total" + help: "Total number of PVC Groups detected in the cluster" + each: + type: Gauge + gauge: + path: [status, instances] + + - name: "instances_ready" + help: "Total number of ready instances in the cluster" + each: + type: Gauge + gauge: + path: [status, readyInstances] + + - name: "instances_status_healthy_info" + help: "Cluster instances that are healthy" + each: + type: Info + info: + path: [status, instancesStatus, "healthy"] + labelsFromPath: + instance: [] + + - name: "instances_status_replicating_info" + help: "Cluster instances that are replicating" + each: + type: Info + info: + path: [status, instancesStatus, "replicating"] + labelsFromPath: + instance: [] + + - name: "instances_status_failed_info" + help: "Cluster instances that are failed" + each: + type: Info + info: + path: [status, instancesStatus, "failed"] + labelsFromPath: + instance: [] + + - name: "primary_info" + help: "Information about the current primary instance" + each: + type: Info + info: + labelsFromPath: + current_primary: [status, currentPrimary] + target_primary: [status, targetPrimary] + + - name: "primary_promotion_time" + help: "The timestamp when the last actual promotion to primary has occurred" + each: + type: Gauge + gauge: + path: [status, currentPrimaryTimestamp] + labelsFromPath: + primary: [status, currentPrimary] + + - name: "primary_failing_since_time" + help: "The timestamp when the primary was detected to be unhealthy This field is reported when .spec.failoverDelay is populated or during online upgrades" + each: + type: Gauge + gauge: + path: [status, currentPrimaryFailingSinceTimestamp] + labelsFromPath: + primary: [status, currentPrimary] + + - name: "first_recoverability_point" + help: "First recoverability point timestamp by backup method" + each: + type: Gauge + gauge: + path: [status, firstRecoverabilityPointByMethod] + labelFromKey: method + + - name: "last_successful_backup" + help: "Last successful backup timestamp by method" + each: + type: Gauge + gauge: + path: [status, lastSuccessfulBackupByMethod] + labelFromKey: method + + - name: "last_failed_backup" + help: "Last failed backup timestamp" + each: + type: Gauge + gauge: + path: [status, lastFailedBackup] + + - name: "dangling_pvc_info" + help: "List of all the PVCs created by this cluster and still available which are not attached to a Pod" + each: + type: Info + info: + path: [status, danglingPVC] + labelsFromPath: + pvc: [] + + - name: "resizing_pvc_info" + help: "List of all the PVCs that have ResizingPVC condition" + each: + type: Info + info: + path: [status, resizingPVC] + labelsFromPath: + pvc: [] + + - name: "initializing_pvc_info" + help: "List of all the PVCs that are being initialized by this cluster" + each: + type: Info + info: + path: [status, initializingPVC] + labelsFromPath: + pvc: [] + + - name: "healthy_pvc_info" + help: "List of all the PVCs not dangling nor initializing" + each: + type: Info + info: + path: [status, healthyPVC] + labelsFromPath: + pvc: [] + + - name: "unusable_pvc_info" + help: "List of all the PVCs that are unusable because another PVC is missing" + each: + type: Info + info: + path: [status, unusablePVC] + labelsFromPath: + pvc: [] + + - name: "conditions" + help: "Cluster conditions" + each: + type: Gauge + gauge: + path: [status, conditions] + labelsFromPath: + type: [type] + reason: [reason] + status: [status] + message: [message] + observed_generation: [observedGeneration] + valueFrom: [lastTransitionTime] + + - name: "plugin_status_info" + help: "Status of loaded plugins" + each: + type: Info + info: + path: [status, pluginStatus] + labelsFromPath: + name: [name] + version: [version] + capabilities: [capabilities] + operator_capabilities: [operatorCapabilities] + wal_capabilities: [walCapabilities] + backup_capabilities: [backupCapabilities] + restore_job_hook_capabilities: [restoreJobHookCapabilities] + status: [status] diff --git a/charts/cluster/prometheus_rules/cluster-high_replication_lag.yaml b/charts/cluster/prometheus_rules/cluster-high_replication_lag.yaml index 660db254f1..795d571be2 100644 --- a/charts/cluster/prometheus_rules/cluster-high_replication_lag.yaml +++ b/charts/cluster/prometheus_rules/cluster-high_replication_lag.yaml @@ -5,12 +5,12 @@ annotations: summary: CNPG Cluster high replication lag description: |- CloudNativePG Cluster "{{ .namespace }}/{{ .cluster }}" is experiencing a high replication lag of - {{ .value }}ms. + {{ .value }}s. High replication lag indicates network issues, busy instances, slow queries or suboptimal configuration. runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md expr: | - max(cnpg_pg_replication_lag{namespace="{{ .namespace }}",pod=~"{{ .podSelector }}"}) * 1000 > 1000 + max(cnpg_pg_replication_lag{namespace="{{ .namespace }}",pod=~"{{ .podSelector }}"}) > 1 for: 5m labels: severity: warning diff --git a/charts/cluster/prometheus_rules/cluster-logical_replication_errors-critical.yaml b/charts/cluster/prometheus_rules/cluster-logical_replication_errors-critical.yaml new file mode 100644 index 0000000000..93f469bf9c --- /dev/null +++ b/charts/cluster/prometheus_rules/cluster-logical_replication_errors-critical.yaml @@ -0,0 +1,18 @@ +{{- $alert := "CNPGClusterLogicalReplicationErrorsCritical" -}} +{{- if not (has $alert .excludeRules) -}} +alert: {{ $alert }} +annotations: + summary: CNPG Cluster critical logical replication errors + description: |- + CloudNativePG Cluster's "{{ .namespace }}/{{ .cluster }}" "{{ "{{ .subname }}" }}" subscription has experienced {{ .value }} errors in the last 5 minutes. + + CRITICAL: High error rate indicates persistent replication issues requiring immediate attention. This could lead to significant data inconsistency or complete replication failure. Errors include both apply errors and sync errors. The subscription may stop working if errors continue. + runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationErrors.md +expr: | + label_replace(increase(max by (namespace, job, subname) (cnpg_pg_stat_subscription_apply_error_count + cnpg_pg_stat_subscription_sync_error_count)[5m]), "cluster", "$1", "job", ".+/(.+)") >= 5 +for: 0m +labels: + severity: critical + namespace: {{ .namespace }} + cnpg_cluster: {{ .cluster }} +{{- end -}} diff --git a/charts/cluster/prometheus_rules/cluster-logical_replication_errors.yaml b/charts/cluster/prometheus_rules/cluster-logical_replication_errors.yaml new file mode 100644 index 0000000000..d8987dc83d --- /dev/null +++ b/charts/cluster/prometheus_rules/cluster-logical_replication_errors.yaml @@ -0,0 +1,18 @@ +{{- $alert := "CNPGClusterLogicalReplicationErrors" -}} +{{- if not (has $alert .excludeRules) -}} +alert: {{ $alert }} +annotations: + summary: CNPG Cluster logical replication errors detected + description: |- + CloudNativePG Cluster's "{{ .namespace }}/{{ .cluster }}" "{{ "{{ .subname }}" }}" subscription has experienced {{ .value }} errors. + + This includes both apply errors (during normal replication) and sync errors (during initial table sync). Errors indicate data consistency issues that need immediate attention to prevent data divergence. + runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/{{ $alert }}.md +expr: | + label_replace(increase(max by (namespace, job, subname) (cnpg_pg_stat_subscription_apply_error_count + cnpg_pg_stat_subscription_sync_error_count)[5m]), "cluster", "$1", "job", ".+/(.+)") > 0 +for: 1m +labels: + severity: warning + namespace: {{ .namespace }} + cnpg_cluster: {{ .cluster }} +{{- end -}} diff --git a/charts/cluster/prometheus_rules/cluster-logical_replication_lagging-critical.yaml b/charts/cluster/prometheus_rules/cluster-logical_replication_lagging-critical.yaml new file mode 100644 index 0000000000..d80c82627c --- /dev/null +++ b/charts/cluster/prometheus_rules/cluster-logical_replication_lagging-critical.yaml @@ -0,0 +1,32 @@ +{{- $alert := "CNPGClusterLogicalReplicationLaggingCritical" -}} +{{- if not (has $alert .excludeRules) -}} +alert: {{ $alert }} +annotations: + summary: CNPG Cluster critical logical replication lag + description: |- + CloudNativePG Cluster's "{{ .namespace }}/{{ .cluster }}" "{{ "{{ .subname }}" }}" subscription is experiencing critical replication lag! + + {{- if .labels.lag_type }} + Lag type: {{ .labels.lag_type }} + {{- end }} + Current lag: {{ .value }}s + + CRITICAL: The subscriber is significantly behind the publisher. Immediate action required. This could lead to significant data inconsistency, disk space exhaustion on publisher, or extended recovery time. + runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationLagging.md +expr: | + ( + # Receipt lag - not receiving WAL data + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_receipt_lag_seconds), "cluster", "$1", "job", ".+/(.+)") > 300 + ) or ( + # Apply lag - not applying received data + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_apply_lag_seconds), "cluster", "$1", "job", ".+/(.+)") > 300 + ) or ( + # LSN distance - large amount of pending data + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_buffered_lag_bytes), "cluster", "$1", "job", ".+/(.+)") / 1024^3 > 4 + ) +for: 2m +labels: + severity: critical + namespace: {{ .namespace }} + cnpg_cluster: {{ .cluster }} +{{- end -}} diff --git a/charts/cluster/prometheus_rules/cluster-logical_replication_lagging.yaml b/charts/cluster/prometheus_rules/cluster-logical_replication_lagging.yaml new file mode 100644 index 0000000000..6994837cde --- /dev/null +++ b/charts/cluster/prometheus_rules/cluster-logical_replication_lagging.yaml @@ -0,0 +1,32 @@ +{{- $alert := "CNPGClusterLogicalReplicationLagging" -}} +{{- if not (has $alert .excludeRules) -}} +alert: {{ $alert }} +annotations: + summary: CNPG Cluster logical replication lagging + description: |- + CloudNativePG Cluster's "{{ .namespace }}/{{ .cluster }}" "{{ "{{ .subname }}" }}" subscription is experiencing replication lag. + + {{- if .labels.lag_type }} + Lag type: {{ .labels.lag_type }} + {{- end }} + Current lag: {{ .value }}s + + This alert indicates the subscriber is falling behind the publisher. The lag could be receipt lag (not receiving WAL data fast enough due to network issues), apply lag (receiving data but not applying it fast enough due to resource contention), or LSN distance (large amount of pending data to be applied). + runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/{{ $alert }}.md +expr: | + ( + # Receipt lag - time since last message received + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_receipt_lag_seconds), "cluster", "$1", "job", ".+/(.+)") > 60 + ) or ( + # Apply lag - time delay in applying changes + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_apply_lag_seconds), "cluster", "$1", "job", ".+/(.+)") > 60 + ) or ( + # LSN distance - bytes pending to be applied + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_buffered_lag_bytes), "cluster", "$1", "job", ".+/(.+)") / 1024^3 > 1 + ) +for: 5m +labels: + severity: warning + namespace: {{ .namespace }} + cnpg_cluster: {{ .cluster }} +{{- end -}} diff --git a/charts/cluster/prometheus_rules/cluster-logical_replication_stopped-critical.yaml b/charts/cluster/prometheus_rules/cluster-logical_replication_stopped-critical.yaml new file mode 100644 index 0000000000..0e2dd72418 --- /dev/null +++ b/charts/cluster/prometheus_rules/cluster-logical_replication_stopped-critical.yaml @@ -0,0 +1,29 @@ +{{- $alert := "CNPGClusterLogicalReplicationStoppedCritical" -}} +{{- if not (has $alert .excludeRules) -}} +alert: {{ $alert }} +annotations: + summary: CNPG Cluster logical replication subscription CRITICAL + description: |- + CloudNativePG Cluster's "{{ .namespace }}/{{ .cluster }}" "{{ "{{ .subname }}" }}" subscription is in a critical state. + + CRITICAL: The subscription has been stopped for more than 15 minutes. This will lead to significant data divergence and requires immediate intervention. + + Status: {{ .labels.stop_reason }} + Duration: {{ .value }}s + runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterLogicalReplicationStopped.md +expr: | + ( + # Subscription is explicitly disabled + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_enabled), "cluster", "$1", "job", ".+/(.+)") == 0 + ) or ( + # Subscription is enabled but stuck (no worker process with significant lag) + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_enabled), "cluster", "$1", "job", ".+/(.+)") == 1 + and label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_pid), "cluster", "$1", "job", ".+/(.+)") == "" + and label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_buffered_lag_bytes), "cluster", "$1", "job", ".+/(.+)") / 1024^3 > 0.1 + ) +for: 15m +labels: + severity: critical + namespace: {{ .namespace }} + cnpg_cluster: {{ .cluster }} +{{- end -}} diff --git a/charts/cluster/prometheus_rules/cluster-logical_replication_stopped.yaml b/charts/cluster/prometheus_rules/cluster-logical_replication_stopped.yaml new file mode 100644 index 0000000000..8865ae05de --- /dev/null +++ b/charts/cluster/prometheus_rules/cluster-logical_replication_stopped.yaml @@ -0,0 +1,28 @@ +{{- $alert := "CNPGClusterLogicalReplicationStopped" -}} +{{- if not (has $alert .excludeRules) -}} +alert: {{ $alert }} +annotations: + summary: CNPG Cluster logical replication subscription stopped + description: |- + CloudNativePG Cluster's "{{ .namespace }}/{{ .cluster }}" "{{ "{{ .subname }}" }}" subscription is stopped. + + Status: {{ .labels.stop_reason }} + + The subscription is not actively replicating data. This could be intentional (disabled) or due to an issue preventing the subscription from working. + runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/{{ $alert }}.md +expr: | + ( + # Subscription is explicitly disabled + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_enabled), "cluster", "$1", "job", ".+/(.+)") == 0 + ) or ( + # Subscription is enabled but stuck (no worker process with significant lag) + label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_enabled), "cluster", "$1", "job", ".+/(.+)") == 1 + and label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_pid), "cluster", "$1", "job", ".+/(.+)") == "" + and label_replace(max by (namespace, job, subname) (cnpg_pg_stat_subscription_buffered_lag_bytes), "cluster", "$1", "job", ".+/(.+)") / 1024^3 > 0.1 + ) +for: 5m +labels: + severity: warning + namespace: {{ .namespace }} + cnpg_cluster: {{ .cluster }} +{{- end -}} diff --git a/charts/cluster/prometheus_rules/cluster-physical_replication_lag-critical.yaml b/charts/cluster/prometheus_rules/cluster-physical_replication_lag-critical.yaml new file mode 100644 index 0000000000..4bcb58a6f5 --- /dev/null +++ b/charts/cluster/prometheus_rules/cluster-physical_replication_lag-critical.yaml @@ -0,0 +1,18 @@ +{{- $alert := "CNPGClusterPhysicalReplicationLagCritical" -}} +{{- if not (has $alert .excludeRules) -}} +alert: {{ $alert }} +annotations: + summary: CNPG Cluster very high physical replication lag + description: |- + CloudNativePG Cluster "{{ .namespace }}/{{ .cluster }}" is experiencing a very high physical replication lag of {{ .value }}ms. + + High replication lag indicates network issues, busy instances, slow queries or suboptimal configuration. + runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterPhysicalReplicationLag.md +expr: | + max(cnpg_pg_replication_lag{namespace="{{ .namespace }}",pod=~"{{ .podSelector }}"}) > 15 +for: 5m +labels: + severity: critical + namespace: {{ .namespace }} + cnpg_cluster: {{ .cluster }} +{{- end -}} diff --git a/charts/cluster/prometheus_rules/cluster-physical_replication_lag-warning.yaml b/charts/cluster/prometheus_rules/cluster-physical_replication_lag-warning.yaml new file mode 100644 index 0000000000..a6ee90bd35 --- /dev/null +++ b/charts/cluster/prometheus_rules/cluster-physical_replication_lag-warning.yaml @@ -0,0 +1,18 @@ +{{- $alert := "CNPGClusterPhysicalReplicationLagWarning" -}} +{{- if not (has $alert .excludeRules) -}} +alert: {{ $alert }} +annotations: + summary: CNPG Cluster high physical replication lag + description: |- + CloudNativePG Cluster "{{ .namespace }}/{{ .cluster }}" is experiencing a high physical replication lag of {{ .value }}ms. + + High replication lag indicates network issues, busy instances, slow queries or suboptimal configuration. + runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterPhysicalReplicationLag.md +expr: | + max(cnpg_pg_replication_lag{namespace="{{ .namespace }}",pod=~"{{ .podSelector }}"}) > 1 +for: 5m +labels: + severity: warning + namespace: {{ .namespace }} + cnpg_cluster: {{ .cluster }} +{{- end -}} diff --git a/charts/cluster/templates/monitoring-logical-replication.yaml b/charts/cluster/templates/monitoring-logical-replication.yaml new file mode 100644 index 0000000000..cb085ae132 --- /dev/null +++ b/charts/cluster/templates/monitoring-logical-replication.yaml @@ -0,0 +1,103 @@ +{{- if .Values.cluster.monitoring.instrumentation.logicalReplication }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ include "cluster.fullname" . }}-monitoring-logical-replication + namespace: {{ include "cluster.namespace" . }} + labels: + cnpg.io/reload: "" + {{- include "cluster.labels" . | nindent 4 }} +data: + custom-queries: | + pg_stat_subscription: + query: | + SELECT current_database() AS datname + , s.oid AS subid + , s.subname AS subname + , subenabled AS enabled + , worker_type + , relid + , pg_wal_lsn_diff(received_lsn, '0/0') AS received_lsn + , last_msg_send_time + , last_msg_receipt_time + , pg_wal_lsn_diff(latest_end_lsn, '0/0') AS latest_end_lsn + , latest_end_time + , apply_error_count + , sync_error_count + , stats_reset + , ss.pid + , CASE + WHEN received_lsn IS NOT NULL AND latest_end_lsn IS NOT NULL + THEN GREATEST(0, pg_wal_lsn_diff(received_lsn, latest_end_lsn)) + ELSE NULL + END AS buffered_lag_bytes + , CASE + WHEN last_msg_receipt_time IS NOT NULL + THEN EXTRACT(EPOCH FROM (NOW() - last_msg_receipt_time)) + ELSE NULL + END AS receipt_lag_seconds + , CASE + WHEN latest_end_time IS NOT NULL + THEN EXTRACT(EPOCH FROM (NOW() - latest_end_time)) + ELSE NULL + END AS apply_lag_seconds + FROM pg_subscription s + LEFT JOIN pg_stat_subscription ss ON s.oid = ss.subid + LEFT JOIN pg_stat_subscription_stats sss ON s.oid = sss.subid; + target_databases: ["*"] + metrics: + - datname: + description: Name of the database + usage: LABEL + - subid: + description: ID of the subscription + usage: LABEL + - subname: + description: Name of the subscription + usage: LABEL + - worker_type: + description: Type of the worker + usage: LABEL + - relid: + description: OID of the relation + usage: LABEL + - received_lsn: + description: Last written LSN received from the publisher + usage: GAUGE + - last_msg_send_time: + description: Timestamp of the last message sent + usage: GAUGE + - last_msg_receipt_time: + description: Timestamp of the last message receipt + usage: GAUGE + - latest_end_lsn: + description: Latest end LSN received + usage: GAUGE + - latest_end_time: + description: Timestamp of the latest end LSN processed + usage: GAUGE + - enabled: + description: Subscription status (enabled/disabled) + usage: GAUGE + - apply_error_count: + description: Number of times an error occurred while applying changes + usage: GAUGE + - sync_error_count: + description: Number of times an error occurred during the initial table synchronization + usage: GAUGE + - stats_reset: + description: Time at which these statistics were last reset + usage: GAUGE + - pid: + description: Process ID of the subscription worker process + usage: GAUGE + - buffered_lag_bytes: + description: Bytes buffered but not yet applied (received_lsn - latest_end_lsn) + usage: GAUGE + - receipt_lag_seconds: + description: Seconds since last message receipt + usage: GAUGE + - apply_lag_seconds: + description: Seconds since last apply operation + usage: GAUGE +{{- end }} diff --git a/charts/cluster/values.schema.json b/charts/cluster/values.schema.json index 1edcd45982..eda57fe67b 100644 --- a/charts/cluster/values.schema.json +++ b/charts/cluster/values.schema.json @@ -235,6 +235,14 @@ "enabled": { "type": "boolean" }, + "instrumentation": { + "type": "object", + "properties": { + "logicalReplication": { + "type": "boolean" + } + } + }, "podMonitor": { "type": "object", "properties": { diff --git a/charts/cluster/values.yaml b/charts/cluster/values.yaml index 9c5ca0202e..964ef2934e 100644 --- a/charts/cluster/values.yaml +++ b/charts/cluster/values.yaml @@ -321,6 +321,10 @@ cluster: # -- Exclude specified rules excludeRules: [] # - CNPGClusterZoneSpreadWarning + # Additional instrumentation via custom metrics + instrumentation: + # -- Enable logical replication metrics + logicalReplication: true # -- Whether the default queries should be injected. # Set it to true if you don't want to inject default queries into the cluster. disableDefaultQueries: false