Skip to content

Commit e0cf716

Browse files
authored
Add NATS critical upstream failures detection rules (#86)
* feat: add NATS Jetstream CRE * fix format * add negate conditions * Update nats-jetstream-ha.yaml * Update tags.yaml
1 parent 95fc5b5 commit e0cf716

File tree

4 files changed

+81
-1
lines changed

4 files changed

+81
-1
lines changed
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
rules:
2+
- metadata:
3+
kind: prequel
4+
id: QsYzSA81AJSgnVqaQt4XGS
5+
version: "0.1.0"
6+
cre:
7+
id: CRE-2025-0082
8+
severity: 1
9+
title: "NATS JetStream HA failures: monitor goroutine, consumer stalls and unsynced replicas"
10+
category: "message-queue-problem"
11+
author: Prequel
12+
description: |
13+
Detects high-availability failures in NATS JetStream clusters due to:
14+
15+
1. **Monitor goroutine failure** — after node restarts, Raft group fails to elect a leader
16+
2. **Consumer deadlock** — using DeliverPolicy=LastPerSubject + AckPolicy=Explicit with low MaxAckPending
17+
3. **Unsynced replicas** — object store replication appears healthy but data is lost or inconsistent between nodes
18+
19+
These issues lead to invisible data loss, stalled consumers, or stream unavailability.
20+
impact: |
21+
- **Scenario 1**: Stream becomes unusable (publishes/read fail) due to no Raft leader
22+
- **Scenario 2**: Consumer stalls with `context deadline exceeded`, ACKs no longer move floor
23+
- **Scenario 3**: Object Store data loss occurs silently across restarts despite healthy status
24+
All scenarios disrupt reliability of JetStream-based systems and violate consistency expectations.
25+
cause: |
26+
- [Monitor failure]: JetStream monitor goroutine did not start after server restart
27+
- [Consumer stall]: ACK/sequence tracking inconsistency under `LastPerSubject + Explicit ACK + low MaxAckPending`
28+
- [Replica drift]: Raft replicas fall out of sync silently (especially during cold restart or recovery), leading to inconsistent object store contents
29+
mitigation: |
30+
- Always enable JetStream before ReadyForConnections
31+
- Use ProcessConfigString instead of on-the-fly JS enablement
32+
- Avoid MaxAckPending < 100 with DeliverPolicy=LastPerSubject
33+
- Run regular `nats stream-check --unsynced` checks
34+
- To recover object store:
35+
- Scale stream to replicas=1 and back
36+
- Or remove faulty replica via `nats stream cluster ... peer-remove`
37+
- Monitor for raftz and jsz inconsistencies in tooling
38+
mitigationScore: 8
39+
references:
40+
- "https://github.com/nats-io/nats-server/issues/6890"
41+
- "https://github.com/nats-io/nats-server/issues/6921"
42+
- "https://github.com/nats-io/nats-server/issues/6929"
43+
reports: 3
44+
version: "0.1.0"
45+
tags:
46+
- nats
47+
- jetstream
48+
- raft
49+
- ack-deadlock
50+
- unsynced-replica
51+
applications:
52+
- name: nats-server
53+
version: ">=2.11.3"
54+
rule:
55+
set:
56+
event:
57+
source: cre.log.nats
58+
match:
59+
- regex: "monitor goroutine not running|Fetch error: context deadline exceeded|UNSYNCED"
60+
negate:
61+
- "server shutdown"
62+
- "shutting down"

rules/cre-2025-0082/test.log

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
2025-06-15 23:01:56 JetStream stream 'app > some-stream' is not current: monitor goroutine not running
2+
2025-06-15 23:08:15 [INFO] attempting to create stream `some-stream`
3+
2025-06-15 23:21:36 [ERROR] failed to create stream: context deadline exceeded
4+
2025-06-15 23:21:42 [ERROR] Fetch error: context deadline exceeded
5+
2025-06-15 23:48:11 [INFO] Running stream-check --unsynced on cluster...
6+
2025-06-15 23:48:17 [INFO] Found unsynced object store replicas:
7+
│ OBJ_task-documents │ S-R3F-qGDUF5RO │ 5094964000 │ ABE56KWF6GW4LIMZI723WJMPKODITQIM3OYVACQAQWXRHYNHQCQY2LT3 │ nats-0* │ 29 │ 2061405 │ 16 │ 6 │ 0 │ 1 │ 35 │ UNSYNCED │ nats-0 │ nats-2(current=true ,offline=false) nats-1(current=true ,offline=false) │ │
8+
│ OBJ_task-documents │ S-R3F-qGDUF5RO │ 5094964000 │ ABE56KWF6GW4LIMZI723WJMPKODITQIM3OYVACQAQWXRHYNHQCQY2LT3 │ nats-1 │ 29 │ 2061405 │ 16 │ 6 │ 0 │ 1 │ 35 │ UNSYNCED │ nats-0 │ nats-2(current=false,offline=false) nats-0(current=true ,offline=false) │ │
9+
│ OBJ_task-documents │ S-R3F-qGDUF5RO │ 5094964000 │ ABE56KWF6GW4LIMZI723WJMPKODITQIM3OYVACQAQWXRHYNHQCQY2LT3 │ nats-2 │ 23 │ 1676277 │ 12 │ 6 │ 0 │ 1 │ 29 │ UNSYNCED │ nats-0 │ nats-1(current=false,offline=false) nats-0(current=true ,offline=false) │ │

rules/tags/categories.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ categories:
88
description: Problems related to well-known external APIs
99
- name: message-queue-problem
1010
displayName: Message Queue Problems
11-
description: Problems related to message queues, like Kafka, RabbitMQ, NATS, and others
11+
description: Problems related to message queues, like Kafka, RabbitMQ, NATS and others
1212
- name: asynchronous-task-problem
1313
displayName: Asynchronous Task Problems
1414
description: Problems related to asynchronous tasks, like Celery, Sidekiq, and others

rules/tags/tags.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -672,6 +672,15 @@ tags:
672672
- name: sigkill
673673
displayName: SIGKILL
674674
description: Failures caused by processes being terminated with a SIGKILL signal.
675+
- name: jetstream
676+
displayName: JetStream
677+
description: NATS JetStream persistence & streaming subsystem issues.
678+
- name: ack-deadlock
679+
displayName: Ack Deadlock
680+
description: Deadlocks caused by unacknowledged messages or backpressure in JetStream acks.
681+
- name: unsynced-replica
682+
displayName: Unsynced Replica
683+
description: JetStream replicas that fail to synchronize state with the leader after restart or failover.
675684
- name: connection-exhaustion
676685
displayName: Connection Exhaustion
677686
description: Problems where systems reach their maximum connection limits, preventing new connections and causing service degradation

0 commit comments

Comments
 (0)