Skip to content

Commit 0424517

Browse files
authored
Add Postgres(self-hosted) critical HA upstream failure detection rule (prequel-dev#51)
* add cre-2024-0077 * add test.log * fix format issues * update categories * update tags for consistency with existing ones * Update categories.yaml * update tags * Update postgres-self-hosted.yaml
1 parent 42b0216 commit 0424517

File tree

4 files changed

+148
-0
lines changed

4 files changed

+148
-0
lines changed
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
rules:
2+
- metadata:
3+
kind: prequel
4+
id: 5UD1RZxGC5LJQnVmAkV11A
5+
gen: 1
6+
cre:
7+
id: CRE-2025-0072
8+
severity: 1
9+
title: "Self-hosted PostgreSQL HA: WAL Streaming & HA Controller Crisis (Replication Slot Loss, Disk Full, Etcd Quorum Failure)"
10+
category: "postgres-ha"
11+
author: Prequel
12+
description: |
13+
Detects high-severity failures in self-hosted PostgreSQL high-availability clusters managed by Patroni, Zalando, or similar HA controllers.
14+
This rule targets catastrophic conditions that break replication or cluster consensus:
15+
- WAL streaming failures due to missing replication slots (usually after disk full or crash events)
16+
- Persistent errors resolving HA controller endpoints (etcd/consul) and loss of HA controller quorum
17+
- Disk saturation leading to WAL write errors and replication breakage
18+
cause: |
19+
- Replication slot(s) "patroniN" missing or cannot be created due to disk full or corruption
20+
- PostgreSQL unable to stream WAL (Write-Ahead Log) to replicas, causing FATAL errors
21+
- HA controller (etcd/consul) DNS/name resolution failures or full cluster outage (quorum lost)
22+
- Disk full on primary prevents WAL writes or checkpointing
23+
tags:
24+
- ha
25+
- patroni
26+
- zalando
27+
- etcd
28+
- replication
29+
- wal
30+
- storage
31+
- quorum
32+
- crash
33+
- data-loss
34+
- timeout
35+
mitigation: |
36+
PREVENTION:
37+
- Monitor disk usage on all PostgreSQL nodes, especially WAL and archive directories
38+
- Set up alerting for replication lag and missing replication slots
39+
- Ensure HA controllers (etcd/consul) are running on redundant, reliable nodes
40+
RESPONSE:
41+
- Restore or recreate missing replication slots
42+
- Free up disk space and restart affected PostgreSQL instances
43+
- Restore etcd/consul cluster quorum; check container/network status
44+
- Perform manual failover if automatic recovery fails
45+
references:
46+
- https://patroni.readthedocs.io/en/latest/
47+
- https://www.postgresql.org/docs/current/warm-standby.html
48+
- https://etcd.io/docs/latest/op-guide/clustering/
49+
applications:
50+
- name: postgresql
51+
impact: |
52+
- Replication breakage; secondary/standby nodes cannot receive WAL
53+
- Potential for split-brain, data loss, or full cluster outage
54+
- Cluster may lose HA/failover capability; clients disconnected
55+
impactScore: 10
56+
mitigationScore: 6
57+
reports: 2
58+
rule:
59+
set:
60+
event:
61+
source: cre.log.postgresql
62+
match:
63+
- regex: 'FATAL.*could not start WAL streaming: (replication slot|ERROR: replication slot) "patroni[0-9]+" does not exist|ERROR.*replication slot "patroni[0-9]+" does not exist|ERROR.*dd: error writing.*No space left on device|failed to resolve host etcd[0-9]: \[Errno -3\] Temporary failure in name resolution|Failed to get list of machines from http://etcd[0-9]:2379/v3beta: MaxRetryError|etcd\.EtcdConnectionFailed: No more machines in the cluster|Request to server http://[0-9.]+:2379 failed: (ReadTimeoutError|MaxRetryError)|watchprefix failed: ProtocolError.*InvalidChunkLength'

rules/rules/cre-2025-0077/test.log

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
2025-06-03T00:40:29Z FATAL postgres could not start WAL streaming: replication slot "patroni1" does not exist
2+
2025-06-03T00:40:29Z FATAL postgres could not start WAL streaming: replication slot "patroni1" does not exist
3+
2025-06-03T00:40:34Z FATAL postgres could not start WAL streaming: replication slot "patroni1" does not exist
4+
2025-06-03T00:40:29Z FATAL postgres could not start WAL streaming: replication slot "patroni3" does not exist
5+
2025-06-03T00:40:29Z ERROR postgres dd: error writing '/var/lib/postgresql/data/bigfile': No space left on device
6+
2025-06-03T00:40:29Z FATAL postgres could not start WAL streaming: replication slot "patroni3" does not exist
7+
2025-06-03T00:40:34Z FATAL postgres could not start WAL streaming: replication slot "patroni3" does not exist
8+
2025-06-03T00:40:29Z ERROR postgres replication slot "patroni3" does not exist
9+
2025-06-03T00:40:29Z ERROR postgres replication slot "patroni1" does not exist
10+
2025-06-03T00:40:29Z ERROR postgres replication slot "patroni1" does not exist
11+
2025-06-03T00:40:34Z ERROR postgres replication slot "patroni3" does not exist
12+
2025-06-03T00:40:34Z ERROR postgres replication slot "patroni1" does not exist
13+
14+
2025-06-03T00:50:00Z WARNING etcd failed to resolve host etcd2: [Errno -3] Temporary failure in name resolution
15+
2025-06-03T00:50:00Z ERROR etcd Failed to get list of machines from http://etcd2:2379/v3beta: MaxRetryError("Connection refused")
16+
2025-06-03T00:50:08Z WARNING etcd failed to resolve host etcd1: [Errno -3] Temporary failure in name resolution
17+
2025-06-03T00:50:08Z ERROR etcd Failed to get list of machines from http://etcd1:2379/v3beta: MaxRetryError("Connection refused")
18+
2025-06-03T00:50:12Z ERROR etcd Request to server http://192.168.80.6:2379 failed: ReadTimeoutError
19+
2025-06-03T00:50:13Z ERROR etcd Request to server http://192.168.80.8:2379 failed: MaxRetryError("ConnectTimeout")
20+
2025-06-03T00:50:15Z ERROR etcd Request to server http://192.168.80.7:2379 failed: MaxRetryError("ConnectTimeout")
21+
2025-06-03T00:50:23Z WARNING etcd failed to resolve host etcd1: [Errno -3] Temporary failure in name resolution
22+
2025-06-03T00:50:23Z ERROR etcd Failed to get list of machines from http://etcd1:2379/v3beta: MaxRetryError("Connection refused")
23+
24+
2025-06-02T18:28:52Z ERROR etcd watchprefix failed: <FailedPrecondition error: “grpc: the client connection is closing”>
25+
2025-06-02T18:28:52Z ERROR etcd Request to server http://172.23.0.4:2379 failed: MaxRetryError("Connection refused")
26+
2025-06-02T18:28:52Z INFO patroni Reconnection allowed, trying another etcd node
27+
2025-06-02T18:28:52Z INFO patroni Retrying on http://172.23.0.5:2379
28+
2025-06-02T18:28:52Z INFO patroni Selected new etcd server http://172.23.0.5:2379
29+
2025-06-02T18:28:52Z ERROR etcd Failed to get list of machines from http://172.23.0.4:2379/v3beta: MaxRetryError("Connection refused")
30+
2025-06-02T18:28:52Z ERROR etcd Failed to get list of machines from http://172.23.0.3:2379/v3beta: MaxRetryError("Connection refused")
31+
2025-06-02T18:28:52Z WARNING etcd Connected to Etcd node with term 2. Old known term 3. Switching again.
32+
2025-06-02T18:28:52Z ERROR etcd Request to server http://172.23.0.5:2379 failed: StaleEtcdNode()
33+
2025-06-02T18:28:52Z INFO patroni Reconnection allowed, trying yet another etcd node
34+
2025-06-02T18:28:52Z ERROR etcd Failed to get list of machines from http://172.23.0.4:2379/v3beta: MaxRetryError("Connection refused")
35+
2025-06-02T18:28:52Z ERROR etcd Failed to get list of machines from http://etcd2:2379/v3beta: MaxRetryError("Connection refused")
36+
2025-06-02T18:28:52Z ERROR etcd Failed to get list of machines from http://172.23.0.3:2379/v3beta: MaxRetryError("Connection refused")
37+
2025-06-02T18:28:54Z ERROR etcd Request to server http://172.23.0.4:2379 failed: MaxRetryError("ConnectTimeout")
38+
2025-06-02T18:28:56Z ERROR etcd Request to server http://172.23.0.3:2379 failed: MaxRetryError("ConnectTimeout")
39+
2025-06-02T18:28:57Z ERROR etcd Failed to get list of machines from http://172.23.0.4:2379/v3beta: MaxRetryError("ConnectTimeout")
40+
2025-06-02T18:28:59Z ERROR etcd Failed to get list of machines from http://172.23.0.3:2379/v3beta: MaxRetryError("ConnectTimeout")
41+
2025-06-02T18:29:02Z ERROR etcd Request to server http://172.23.0.5:2379 failed: ReadTimeoutError
42+
2025-06-02T18:29:02Z INFO patroni Reconnection allowed, trying another etcd node
43+
2025-06-02T18:29:02Z INFO patroni Retrying on http://172.23.0.3:2379
44+
45+
2025-06-02T14:47:18Z FATAL postgres could not start WAL streaming: replication slot "patroni1" does not exist
46+
2025-06-02T14:47:18Z FATAL postgres could not start WAL streaming: replication slot "patroni1" does not exist
47+
2025-06-02T14:47:23Z FATAL postgres could not start WAL streaming: replication slot "patroni1" does not exist
48+
2025-06-02T14:47:19Z FATAL postgres could not start WAL streaming: replication slot "patroni2" does not exist
49+
2025-06-02T14:47:19Z FATAL postgres could not start WAL streaming: replication slot "patroni2" does not exist
50+
2025-06-02T14:47:24Z FATAL postgres could not start WAL streaming: replication slot "patroni2" does not exist
51+
2025-06-02T14:47:18Z ERROR postgres replication slot "patroni1" does not exist
52+
2025-06-02T14:47:18Z ERROR postgres replication slot "patroni1" does not exist
53+
2025-06-02T14:47:19Z ERROR postgres replication slot "patroni2" does not exist
54+
2025-06-02T14:47:19Z ERROR postgres replication slot "patroni2" does not exist
55+
2025-06-02T14:47:23Z ERROR postgres replication slot "patroni1" does not exist
56+
2025-06-02T14:47:24Z ERROR postgres replication slot "patroni2" does not exist
57+
58+
2025-06-03T00:15:00Z NOTICE postgres relation "big_data" already exists, skipping
59+
2025-06-03T00:15:04Z INFO patroni Paused replica demo-patroni2
60+
2025-06-03T00:15:14Z INFO patroni Waiting 10s for WAL to build up on leader
61+
2025-06-03T00:15:15Z INFO patroni Resumed replica demo-patroni2

rules/tags/categories.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,9 @@ categories:
129129
- name: in-memory-database-problem
130130
displayName: In-Memory Database Problems
131131
description: Problems specific to in-memory data stores (e.g. Redis, Memcached)
132+
- name: postgres-ha
133+
displayName: PostgreSQL High Availability
134+
description: High-severity problems related to PostgreSQL in high-availability (HA) clusters, including replication, failover, WAL streaming, and HA controller outages.
132135
- name: kubernetes-storage-problems
133136
displayName: Kubernetes Storage Problems
134137
description: Problems related to container storage in Kubernetes

rules/tags/tags.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -507,6 +507,27 @@ tags:
507507
- name: cluster-degradation
508508
displayName: Cluster Degradation
509509
description: Problems related to cluster availability
510+
- name: etcd
511+
displayName: Etcd
512+
description: Issues involving etcd clusters or consensus, especially in HA setups.
513+
- name: patroni
514+
displayName: Patroni
515+
description: Issues related to Patroni high-availability controller for PostgreSQL.
516+
- name: zalando
517+
displayName: Zalando
518+
description: Issues related to the Zalando Postgres Operator for HA Postgres.
519+
- name: ha
520+
displayName: High Availability
521+
description: Problems or incidents involving high-availability clusters, failover, or consensus.
522+
- name: replication
523+
displayName: Replication
524+
description: Replication failures, lag, or divergence in stateful systems.
525+
- name: wal
526+
displayName: WAL
527+
description: Issues with Write-Ahead Logging in databases.
528+
- name: quorum
529+
displayName: Quorum
530+
description: Loss or degradation of cluster quorum in distributed systems.
510531
- name: load-balancer-problem
511532
displayName: Load Balancer Problem
512533
description: Problems related to load balancers, such as misrouting, unhealthy backends, or configuration faults

0 commit comments

Comments
 (0)