More detail in discovery troubleshooting docs (#86930) (#87409)

DaveCTurner · Adam Locke · web-flow · commit d9e1e8d6f85f · 2022-06-06T03:46:51.000-04:00
In #85074 we added docs on discovery troubleshooting that really only talked about troubleshooting master elections. There's also the case where the master is elected fine but some other node can't join it. This commit adds troubleshooting docs about that too. Co-authored-by: Adam Locke <adam.locke@elastic.co> Co-authored-by: Adam Locke <adam.locke@elastic.co>
diff --git a/docs/reference/modules/discovery/discovery.asciidoc b/docs/reference/modules/discovery/discovery.asciidoc
@@ -36,48 +36,88 @@ process again.
 [[modules-discovery-troubleshooting]]
 ==== Troubleshooting discovery
 
-In most cases, the discovery process completes quickly, and the master node
-remains elected for a long period of time. If the cluster has no master for
-more than a few seconds or the master is unstable, the logs for each node will
-contain information explaining why:
+In most cases, the discovery and election process completes quickly, and the
+master node remains elected for a long period of time.
 
-* All nodes repeatedly log messages indicating that a master cannot be
-discovered or elected using a logger called
+If your cluster doesn't have a stable master, many of its features won't work
+correctly and {es} will report errors to clients and in its logs. You must fix
+the master node's instability before addressing these other issues. It will not
+be possible to solve any other issues while there is no elected master node or
+the elected master node is unstable.
+
+If your cluster has a stable master but some nodes can't discover or join it,
+these nodes will report errors to clients and in their logs. You must address
+the obstacles preventing these nodes from joining the cluster before addressing
+other issues. It will not be possible to solve any other issues reported by
+these nodes while they are unable to join the cluster.
+
+If the cluster has no elected master node for more than a few seconds, the
+master is unstable, or some nodes are unable to discover or join a stable
+master, then {es} will record information in its logs explaining why. If the
+problems persist for more than a few minutes, {es} will record additional
+information in its logs. To properly troubleshoot discovery and election
+problems, collect and analyse logs covering at least five minutes from all
+nodes.
+
+The following sections describe some common discovery and election problems.
+
+===== No master is elected
+
+When a node wins the master election, it logs a message containing
+`elected-as-master` and all nodes log a message containing
+`master node changed` identifying the new elected master node.
+
+If there is no elected master node and no node can win an election, all
+nodes will repeatedly log messages about the problem using a logger called
 `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
 default, this happens every 10 seconds.
 
-* If a node wins the election, it logs a message containing
-`elected-as-master`. If this happens repeatedly, the master node is unstable.
+Master elections only involve master-eligible nodes, so focus on the logs from
+master-eligible nodes in this situation. These nodes' logs will indicate the
+requirements for a master election, such as the discovery of a certain set of
+nodes.
 
-* When a node discovers the master or believes the master to have failed, it
-logs a message containing `master node changed`.
+If the logs indicate that {es} can't discover enough nodes to form a quorum,
+you must address the reasons preventing {es} from discovering the missing
+nodes. The missing nodes are needed to reconstruct the cluster metadata.
+Without the cluster metadata, the data in your cluster is meaningless. The
+cluster metadata is stored on a subset of the master-eligible nodes in the
+cluster. If a quorum can't be discovered, the missing nodes were the ones
+holding the cluster metadata.
 
-* If a node is unable to discover or elect a master for several minutes, it
-starts to report additional details about the failures in its logs. Be sure to
-capture log messages covering at least five minutes of discovery problems.
+Ensure there are enough nodes running to form a quorum and that every node can
+communicate with every other node over the network. {es} will report additional
+details about network connectivity if the election problems persist for more
+than a few minutes. If you can't start enough nodes to form a quorum, start a
+new cluster and restore data from a recent snapshot. Refer to
+<<modules-discovery-quorums>> for more information.
+
+If the logs indicate that {es} _has_ discovered a possible quorum of nodes, the
+typical reason that the cluster can't elect a master is that one of the other
+nodes can't discover a quorum. Inspect the logs on the other master-eligible
+nodes and ensure that they have all discovered enough nodes to form a quorum.
+
+===== Master is elected but unstable
+
+When a node wins the master election, it logs a message containing
+`elected-as-master`. If this happens repeatedly, the elected master node is
+unstable. In this situation, focus on the logs from the master-eligible nodes
+to understand why the election winner stops being the master and triggers
+another election.
+
+===== Node cannot discover or join stable master
+
+If there is a stable elected master but a node can't discover or join its
+cluster, it will repeatedly log messages about the problem using the
+`ClusterFormationFailureHelper` logger. Other log messages on the affected node
+and the elected master may provide additional information about the problem.
+
+===== Node joins cluster and leaves again
+
+If a node joins the cluster but {es} determines it to be faulty then it will be
+removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
+for more information.
 
-If your cluster doesn't have a stable master, many of its features won't work
-correctly. The cluster may report many kinds of error to clients and in its
-logs. You must fix the master node's instability before addressing these other
-issues. It will not be possible to solve any other issues while the master node
-is unstable.
-
-The logs from the `ClusterFormationFailureHelper` may indicate that a master
-election requires a certain set of nodes and that it has not discovered enough
-nodes to form a quorum. If so, you must address the reason preventing {es} from
-discovering the missing nodes. The missing nodes are needed to reconstruct the
-cluster metadata. Without the cluster metadata, the data in your cluster is
-meaningless. The cluster metadata is stored on a subset of the master-eligible
-nodes in the cluster. If a quorum cannot be discovered then the missing nodes
-were the ones holding the cluster metadata. If you cannot bring the missing
-nodes back into the cluster, start a new cluster and restore data from a recent
-snapshot. Refer to <<modules-discovery-quorums>> for more information.
-
-The logs from the `ClusterFormationFailureHelper` may also indicate that it has
-discovered a possible quorum of master-eligible nodes. If so, the usual reason
-that the cluster cannot elect a master is that one of the other nodes cannot
-discover a quorum. Inspect the logs on the other master-eligible nodes and
-ensure that every node has discovered a quorum.
 
 [[built-in-hosts-providers]]
 ==== Seed hosts providers