Skip to content

Commit 1a703d6

Browse files
committed
docs: add FAQ entry for detecting unreachable nodes
Covers two detection methods: - Monitoring replication lag via RaftMetrics::replication - Monitoring heartbeat timestamps via RaftMetrics::heartbeat
1 parent 8cbc7c2 commit 1a703d6

File tree

2 files changed

+30
-0
lines changed

2 files changed

+30
-0
lines changed

openraft/src/docs/faq/faq-toc.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
* [Why is log id a tuple of `(term, node_id, log_index)`?](#why-is-log-id-a-tuple-of-term-node_id-log_index)
77
- [Replication](#replication)
88
* [How to minimize error logging when a follower is offline](#how-to-minimize-error-logging-when-a-follower-is-offline)
9+
* [How to detect which nodes are currently down or unreachable?](#how-to-detect-which-nodes-are-currently-down-or-unreachable)
910
- [Node management](#node-management)
1011
* [How to customize snapshot building policy?](#how-to-customize-snapshot-building-policy)
1112
- [Cluster management](#cluster-management)

openraft/src/docs/faq/faq.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,32 @@ See: [`leader-id`](`crate::docs::data::leader_id`) for details.
8686
Excessive error logging, like `ERROR openraft::replication: 248: RPCError err=NetworkError: ...`, occurs when a follower node becomes unresponsive. To alleviate this, implement a mechanism within [`RaftNetwork`][] that returns a [`Unreachable`][] error instead of a [`NetworkError`][] when immediate replication retries to the affected node are not advised.
8787

8888

89+
### How to detect which nodes are currently down or unreachable?
90+
91+
To monitor node availability in your Raft cluster, use [`RaftMetrics`][] from
92+
the leader node via [`Raft::metrics()`][]. This provides real-time visibility
93+
into node reachability without requiring membership changes.
94+
95+
There are two primary approaches to detect unreachable nodes:
96+
97+
**Method 1: Monitor replication lag**
98+
Check the field [`RaftMetrics::replication`][], which contains a
99+
`BTreeMap<NodeId, Option<LogId>>` showing the last replicated log for each node.
100+
If a node's replication significantly lags behind
101+
[`RaftMetrics::last_log_index`][], it indicates replication issues and the node
102+
may be down.
103+
104+
**Method 2: Monitor heartbeat timestamps (since OpenRaft 0.10)**
105+
Use the field [`RaftMetrics::heartbeat`][], which stores `BTreeMap<NodeId, Option<SerdeInstant>>`
106+
containing the timestamp of the last acknowledgment from each node. If a
107+
timestamp is significantly behind the current time, the node is likely
108+
unreachable.
109+
110+
Both methods provide "unreachable from leader" perspective, which is typically
111+
what matters for cluster health monitoring. This approach allows you to maintain
112+
a list of active nodes without modifying cluster membership.
113+
114+
89115
## Node management
90116

91117

@@ -341,6 +367,9 @@ OpenRaft intentionally supports this behavior because:
341367

342368

343369
[`RaftMetrics`]: `crate::metrics::RaftMetrics`
370+
[`RaftMetrics::heartbeat`]: `crate::metrics::RaftMetrics::heartbeat`
371+
[`RaftMetrics::last_log_index`]: `crate::metrics::RaftMetrics::last_log_index`
372+
[`RaftMetrics::replication`]: `crate::metrics::RaftMetrics::replication`
344373
[`RaftServerMetrics`]: `crate::metrics::RaftServerMetrics`
345374
[`RaftDataMetrics`]: `crate::metrics::RaftDataMetrics`
346375
[`Raft::metrics()`]: `crate::Raft::metrics`

0 commit comments

Comments
 (0)