Skip to content

Commit d9e1e8d

Browse files
DaveCTurnerAdam Locke
andauthored
More detail in discovery troubleshooting docs (#86930) (#87409)
In #85074 we added docs on discovery troubleshooting that really only talked about troubleshooting master elections. There's also the case where the master is elected fine but some other node can't join it. This commit adds troubleshooting docs about that too. Co-authored-by: Adam Locke <[email protected]> Co-authored-by: Adam Locke <[email protected]>
1 parent 44a3c97 commit d9e1e8d

File tree

1 file changed

+75
-35
lines changed

1 file changed

+75
-35
lines changed

docs/reference/modules/discovery/discovery.asciidoc

Lines changed: 75 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -36,48 +36,88 @@ process again.
3636
[[modules-discovery-troubleshooting]]
3737
==== Troubleshooting discovery
3838

39-
In most cases, the discovery process completes quickly, and the master node
40-
remains elected for a long period of time. If the cluster has no master for
41-
more than a few seconds or the master is unstable, the logs for each node will
42-
contain information explaining why:
39+
In most cases, the discovery and election process completes quickly, and the
40+
master node remains elected for a long period of time.
4341

44-
* All nodes repeatedly log messages indicating that a master cannot be
45-
discovered or elected using a logger called
42+
If your cluster doesn't have a stable master, many of its features won't work
43+
correctly and {es} will report errors to clients and in its logs. You must fix
44+
the master node's instability before addressing these other issues. It will not
45+
be possible to solve any other issues while there is no elected master node or
46+
the elected master node is unstable.
47+
48+
If your cluster has a stable master but some nodes can't discover or join it,
49+
these nodes will report errors to clients and in their logs. You must address
50+
the obstacles preventing these nodes from joining the cluster before addressing
51+
other issues. It will not be possible to solve any other issues reported by
52+
these nodes while they are unable to join the cluster.
53+
54+
If the cluster has no elected master node for more than a few seconds, the
55+
master is unstable, or some nodes are unable to discover or join a stable
56+
master, then {es} will record information in its logs explaining why. If the
57+
problems persist for more than a few minutes, {es} will record additional
58+
information in its logs. To properly troubleshoot discovery and election
59+
problems, collect and analyse logs covering at least five minutes from all
60+
nodes.
61+
62+
The following sections describe some common discovery and election problems.
63+
64+
===== No master is elected
65+
66+
When a node wins the master election, it logs a message containing
67+
`elected-as-master` and all nodes log a message containing
68+
`master node changed` identifying the new elected master node.
69+
70+
If there is no elected master node and no node can win an election, all
71+
nodes will repeatedly log messages about the problem using a logger called
4672
`org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
4773
default, this happens every 10 seconds.
4874

49-
* If a node wins the election, it logs a message containing
50-
`elected-as-master`. If this happens repeatedly, the master node is unstable.
75+
Master elections only involve master-eligible nodes, so focus on the logs from
76+
master-eligible nodes in this situation. These nodes' logs will indicate the
77+
requirements for a master election, such as the discovery of a certain set of
78+
nodes.
5179

52-
* When a node discovers the master or believes the master to have failed, it
53-
logs a message containing `master node changed`.
80+
If the logs indicate that {es} can't discover enough nodes to form a quorum,
81+
you must address the reasons preventing {es} from discovering the missing
82+
nodes. The missing nodes are needed to reconstruct the cluster metadata.
83+
Without the cluster metadata, the data in your cluster is meaningless. The
84+
cluster metadata is stored on a subset of the master-eligible nodes in the
85+
cluster. If a quorum can't be discovered, the missing nodes were the ones
86+
holding the cluster metadata.
5487

55-
* If a node is unable to discover or elect a master for several minutes, it
56-
starts to report additional details about the failures in its logs. Be sure to
57-
capture log messages covering at least five minutes of discovery problems.
88+
Ensure there are enough nodes running to form a quorum and that every node can
89+
communicate with every other node over the network. {es} will report additional
90+
details about network connectivity if the election problems persist for more
91+
than a few minutes. If you can't start enough nodes to form a quorum, start a
92+
new cluster and restore data from a recent snapshot. Refer to
93+
<<modules-discovery-quorums>> for more information.
94+
95+
If the logs indicate that {es} _has_ discovered a possible quorum of nodes, the
96+
typical reason that the cluster can't elect a master is that one of the other
97+
nodes can't discover a quorum. Inspect the logs on the other master-eligible
98+
nodes and ensure that they have all discovered enough nodes to form a quorum.
99+
100+
===== Master is elected but unstable
101+
102+
When a node wins the master election, it logs a message containing
103+
`elected-as-master`. If this happens repeatedly, the elected master node is
104+
unstable. In this situation, focus on the logs from the master-eligible nodes
105+
to understand why the election winner stops being the master and triggers
106+
another election.
107+
108+
===== Node cannot discover or join stable master
109+
110+
If there is a stable elected master but a node can't discover or join its
111+
cluster, it will repeatedly log messages about the problem using the
112+
`ClusterFormationFailureHelper` logger. Other log messages on the affected node
113+
and the elected master may provide additional information about the problem.
114+
115+
===== Node joins cluster and leaves again
116+
117+
If a node joins the cluster but {es} determines it to be faulty then it will be
118+
removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
119+
for more information.
58120

59-
If your cluster doesn't have a stable master, many of its features won't work
60-
correctly. The cluster may report many kinds of error to clients and in its
61-
logs. You must fix the master node's instability before addressing these other
62-
issues. It will not be possible to solve any other issues while the master node
63-
is unstable.
64-
65-
The logs from the `ClusterFormationFailureHelper` may indicate that a master
66-
election requires a certain set of nodes and that it has not discovered enough
67-
nodes to form a quorum. If so, you must address the reason preventing {es} from
68-
discovering the missing nodes. The missing nodes are needed to reconstruct the
69-
cluster metadata. Without the cluster metadata, the data in your cluster is
70-
meaningless. The cluster metadata is stored on a subset of the master-eligible
71-
nodes in the cluster. If a quorum cannot be discovered then the missing nodes
72-
were the ones holding the cluster metadata. If you cannot bring the missing
73-
nodes back into the cluster, start a new cluster and restore data from a recent
74-
snapshot. Refer to <<modules-discovery-quorums>> for more information.
75-
76-
The logs from the `ClusterFormationFailureHelper` may also indicate that it has
77-
discovered a possible quorum of master-eligible nodes. If so, the usual reason
78-
that the cluster cannot elect a master is that one of the other nodes cannot
79-
discover a quorum. Inspect the logs on the other master-eligible nodes and
80-
ensure that every node has discovered a quorum.
81121

82122
[[built-in-hosts-providers]]
83123
==== Seed hosts providers

0 commit comments

Comments
 (0)