|
| 1 | +rules: |
| 2 | + - metadata: |
| 3 | + kind: prequel |
| 4 | + id: QsYzSA81AJSgnVqaQt4XGS |
| 5 | + version: "0.1.0" |
| 6 | + cre: |
| 7 | + id: CRE-2025-0082 |
| 8 | + severity: 1 |
| 9 | + title: "NATS JetStream HA failures: monitor goroutine, consumer stalls and unsynced replicas" |
| 10 | + category: "message-queue-problem" |
| 11 | + author: Prequel |
| 12 | + description: | |
| 13 | + Detects high-availability failures in NATS JetStream clusters due to: |
| 14 | + |
| 15 | + 1. **Monitor goroutine failure** — after node restarts, Raft group fails to elect a leader |
| 16 | + 2. **Consumer deadlock** — using DeliverPolicy=LastPerSubject + AckPolicy=Explicit with low MaxAckPending |
| 17 | + 3. **Unsynced replicas** — object store replication appears healthy but data is lost or inconsistent between nodes |
| 18 | +
|
| 19 | + These issues lead to invisible data loss, stalled consumers, or stream unavailability. |
| 20 | + impact: | |
| 21 | + - **Scenario 1**: Stream becomes unusable (publishes/read fail) due to no Raft leader |
| 22 | + - **Scenario 2**: Consumer stalls with `context deadline exceeded`, ACKs no longer move floor |
| 23 | + - **Scenario 3**: Object Store data loss occurs silently across restarts despite healthy status |
| 24 | + All scenarios disrupt reliability of JetStream-based systems and violate consistency expectations. |
| 25 | + cause: | |
| 26 | + - [Monitor failure]: JetStream monitor goroutine did not start after server restart |
| 27 | + - [Consumer stall]: ACK/sequence tracking inconsistency under `LastPerSubject + Explicit ACK + low MaxAckPending` |
| 28 | + - [Replica drift]: Raft replicas fall out of sync silently (especially during cold restart or recovery), leading to inconsistent object store contents |
| 29 | + mitigation: | |
| 30 | + - Always enable JetStream before ReadyForConnections |
| 31 | + - Use ProcessConfigString instead of on-the-fly JS enablement |
| 32 | + - Avoid MaxAckPending < 100 with DeliverPolicy=LastPerSubject |
| 33 | + - Run regular `nats stream-check --unsynced` checks |
| 34 | + - To recover object store: |
| 35 | + - Scale stream to replicas=1 and back |
| 36 | + - Or remove faulty replica via `nats stream cluster ... peer-remove` |
| 37 | + - Monitor for raftz and jsz inconsistencies in tooling |
| 38 | + mitigationScore: 8 |
| 39 | + references: |
| 40 | + - "https://github.com/nats-io/nats-server/issues/6890" |
| 41 | + - "https://github.com/nats-io/nats-server/issues/6921" |
| 42 | + - "https://github.com/nats-io/nats-server/issues/6929" |
| 43 | + reports: 3 |
| 44 | + version: "0.1.0" |
| 45 | + tags: |
| 46 | + - nats |
| 47 | + - jetstream |
| 48 | + - raft |
| 49 | + - ack-deadlock |
| 50 | + - unsynced-replica |
| 51 | + applications: |
| 52 | + - name: nats-server |
| 53 | + version: ">=2.11.3" |
| 54 | + rule: |
| 55 | + set: |
| 56 | + event: |
| 57 | + source: cre.log.nats |
| 58 | + match: |
| 59 | + - regex: "monitor goroutine not running|Fetch error: context deadline exceeded|UNSYNCED" |
| 60 | + negate: |
| 61 | + - "server shutdown" |
| 62 | + - "shutting down" |
0 commit comments