docs: add FAQ entry for detecting unreachable nodes

drmingdrmer · drmingdrmer · commit 1a703d6d47be · 2025-08-03T11:03:24.000+08:00
Covers two detection methods:
- Monitoring replication lag via RaftMetrics::replication
- Monitoring heartbeat timestamps via RaftMetrics::heartbeat
diff --git a/openraft/src/docs/faq/faq-toc.md b/openraft/src/docs/faq/faq-toc.md
@@ -6,6 +6,7 @@
   * [Why is log id a tuple of `(term, node_id, log_index)`?](#why-is-log-id-a-tuple-of-term-node_id-log_index)
 - [Replication](#replication)
   * [How to minimize error logging when a follower is offline](#how-to-minimize-error-logging-when-a-follower-is-offline)
+  * [How to detect which nodes are currently down or unreachable?](#how-to-detect-which-nodes-are-currently-down-or-unreachable)
 - [Node management](#node-management)
   * [How to customize snapshot building policy?](#how-to-customize-snapshot-building-policy)
 - [Cluster management](#cluster-management)
diff --git a/openraft/src/docs/faq/faq.md b/openraft/src/docs/faq/faq.md
@@ -86,6 +86,32 @@ See: [`leader-id`](`crate::docs::data::leader_id`) for details.
 Excessive error logging, like `ERROR openraft::replication: 248: RPCError err=NetworkError: ...`, occurs when a follower node becomes unresponsive. To alleviate this, implement a mechanism within [`RaftNetwork`][] that returns a [`Unreachable`][] error instead of a [`NetworkError`][] when immediate replication retries to the affected node are not advised.
 
 
+### How to detect which nodes are currently down or unreachable?
+
+To monitor node availability in your Raft cluster, use [`RaftMetrics`][] from
+the leader node via [`Raft::metrics()`][]. This provides real-time visibility
+into node reachability without requiring membership changes.
+
+There are two primary approaches to detect unreachable nodes:
+
+**Method 1: Monitor replication lag**
+Check the field [`RaftMetrics::replication`][], which contains a
+`BTreeMap<NodeId, Option<LogId>>` showing the last replicated log for each node.
+If a node's replication significantly lags behind
+[`RaftMetrics::last_log_index`][], it indicates replication issues and the node
+may be down.
+
+**Method 2: Monitor heartbeat timestamps (since OpenRaft 0.10)**
+Use the field [`RaftMetrics::heartbeat`][], which stores `BTreeMap<NodeId, Option<SerdeInstant>>`
+containing the timestamp of the last acknowledgment from each node. If a
+timestamp is significantly behind the current time, the node is likely
+unreachable.
+
+Both methods provide "unreachable from leader" perspective, which is typically
+what matters for cluster health monitoring. This approach allows you to maintain
+a list of active nodes without modifying cluster membership.
+
+
 ## Node management
 
 
@@ -341,6 +367,9 @@ OpenRaft intentionally supports this behavior because:
 
 
 [`RaftMetrics`]: `crate::metrics::RaftMetrics`
+[`RaftMetrics::heartbeat`]: `crate::metrics::RaftMetrics::heartbeat`
+[`RaftMetrics::last_log_index`]: `crate::metrics::RaftMetrics::last_log_index`
+[`RaftMetrics::replication`]: `crate::metrics::RaftMetrics::replication`
 [`RaftServerMetrics`]: `crate::metrics::RaftServerMetrics`
 [`RaftDataMetrics`]: `crate::metrics::RaftDataMetrics`
 [`Raft::metrics()`]: `crate::Raft::metrics`