@@ -35,313 +35,30 @@ starting from the beginning of the cluster state update. Refer to
35
35
36
36
[[cluster-fault-detection-troubleshooting]]
37
37
==== Troubleshooting an unstable cluster
38
- //tag::troubleshooting[]
39
- Normally, a node will only leave a cluster if deliberately shut down. If a node
40
- leaves the cluster unexpectedly, it's important to address the cause. A cluster
41
- in which nodes leave unexpectedly is unstable and can create several issues.
42
- For instance:
43
38
44
- * The cluster health may be yellow or red.
45
-
46
- * Some shards will be initializing and other shards may be failing.
47
-
48
- * Search, indexing, and monitoring operations may fail and report exceptions in
49
- logs.
50
-
51
- * The `.security` index may be unavailable, blocking access to the cluster.
52
-
53
- * The master may appear busy due to frequent cluster state updates.
54
-
55
- To troubleshoot a cluster in this state, first ensure the cluster has a
56
- <<discovery-troubleshooting,stable master>>. Next, focus on the nodes
57
- unexpectedly leaving the cluster ahead of all other issues. It will not be
58
- possible to solve other issues until the cluster has a stable master node and
59
- stable node membership.
60
-
61
- Diagnostics and statistics are usually not useful in an unstable cluster. These
62
- tools only offer a view of the state of the cluster at a single point in time.
63
- Instead, look at the cluster logs to see the pattern of behaviour over time.
64
- Focus particularly on logs from the elected master. When a node leaves the
65
- cluster, logs for the elected master include a message like this (with line
66
- breaks added to make it easier to read):
67
-
68
- [source,text]
69
- ----
70
- [2022-03-21T11:02:35,513][INFO ][o.e.c.c.NodeLeftExecutor] [instance-0000000000]
71
- node-left: [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}]
72
- with reason [disconnected]
73
- ----
74
-
75
- This message says that the `NodeLeftExecutor` on the elected master
76
- (`instance-0000000000`) processed a `node-left` task, identifying the node that
77
- was removed and the reason for its removal. When the node joins the cluster
78
- again, logs for the elected master will include a message like this (with line
79
- breaks added to make it easier to read):
80
-
81
- [source,text]
82
- ----
83
- [2022-03-21T11:02:59,892][INFO ][o.e.c.c.NodeJoinExecutor] [instance-0000000000]
84
- node-join: [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{UNw_RuazQCSBskWZV8ID_w}{172.27.47.21}{172.27.47.21:19054}{m}]
85
- with reason [joining after restart, removed [24s] ago with reason [disconnected]]
86
- ----
87
-
88
- This message says that the `NodeJoinExecutor` on the elected master
89
- (`instance-0000000000`) processed a `node-join` task, identifying the node that
90
- was added to the cluster and the reason for the task.
91
-
92
- Other nodes may log similar messages, but report fewer details:
93
-
94
- [source,text]
95
- ----
96
- [2020-01-29T11:02:36,985][INFO ][o.e.c.s.ClusterApplierService]
97
- [instance-0000000001] removed {
98
- {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}
99
- {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv}
100
- }, term: 14, version: 1653415, reason: Publication{term=14, version=1653415}
101
- ----
102
-
103
- These messages are not especially useful for troubleshooting, so focus on the
104
- ones from the `NodeLeftExecutor` and `NodeJoinExecutor` which are only emitted
105
- on the elected master and which contain more details. If you don't see the
106
- messages from the `NodeLeftExecutor` and `NodeJoinExecutor`, check that:
107
-
108
- * You're looking at the logs for the elected master node.
109
-
110
- * The logs cover the correct time period.
111
-
112
- * Logging is enabled at `INFO` level.
113
-
114
- Nodes will also log a message containing `master node changed` whenever they
115
- start or stop following the elected master. You can use these messages to
116
- determine each node's view of the state of the master over time.
117
-
118
- If a node restarts, it will leave the cluster and then join the cluster again.
119
- When it rejoins, the `NodeJoinExecutor` will log that it processed a
120
- `node-join` task indicating that the node is `joining after restart`. If a node
121
- is unexpectedly restarting, look at the node's logs to see why it is shutting
122
- down.
123
-
124
- The <<health-api>> API on the affected node will also provide some useful
125
- information about the situation.
126
-
127
- If the node did not restart then you should look at the reason for its
128
- departure more closely. Each reason has different troubleshooting steps,
129
- described below. There are three possible reasons:
130
-
131
- * `disconnected`: The connection from the master node to the removed node was
132
- closed.
133
-
134
- * `lagging`: The master published a cluster state update, but the removed node
135
- did not apply it within the permitted timeout. By default, this timeout is 2
136
- minutes. Refer to <<modules-discovery-settings>> for information about the
137
- settings which control this mechanism.
138
-
139
- * `followers check retry count exceeded`: The master sent a number of
140
- consecutive health checks to the removed node. These checks were rejected or
141
- timed out. By default, each health check times out after 10 seconds and {es}
142
- removes the node removed after three consecutively failed health checks. Refer
143
- to <<modules-discovery-settings>> for information about the settings which
144
- control this mechanism.
39
+ See <<troubleshooting-unstable-cluster>>.
145
40
146
41
[discrete]
147
42
===== Diagnosing `disconnected` nodes
148
43
149
- Nodes typically leave the cluster with reason `disconnected` when they shut
150
- down, but if they rejoin the cluster without restarting then there is some
151
- other problem.
152
-
153
- {es} is designed to run on a fairly reliable network. It opens a number of TCP
154
- connections between nodes and expects these connections to remain open
155
- <<long-lived-connections,forever>>. If a connection is closed then {es} will
156
- try and reconnect, so the occasional blip may fail some in-flight operations
157
- but should otherwise have limited impact on the cluster. In contrast,
158
- repeatedly-dropped connections will severely affect its operation.
159
-
160
- The connections from the elected master node to every other node in the cluster
161
- are particularly important. The elected master never spontaneously closes its
162
- outbound connections to other nodes. Similarly, once an inbound connection is
163
- fully established, a node never spontaneously it unless the node is shutting
164
- down.
165
-
166
- If you see a node unexpectedly leave the cluster with the `disconnected`
167
- reason, something other than {es} likely caused the connection to close. A
168
- common cause is a misconfigured firewall with an improper timeout or another
169
- policy that's <<long-lived-connections,incompatible with {es}>>. It could also
170
- be caused by general connectivity issues, such as packet loss due to faulty
171
- hardware or network congestion. If you're an advanced user, configure the
172
- following loggers to get more detailed information about network exceptions:
173
-
174
- [source,yaml]
175
- ----
176
- logger.org.elasticsearch.transport.TcpTransport: DEBUG
177
- logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG
178
- ----
179
-
180
- If these logs do not show enough information to diagnose the problem, obtain a
181
- packet capture simultaneously from the nodes at both ends of an unstable
182
- connection and analyse it alongside the {es} logs from those nodes to determine
183
- if traffic between the nodes is being disrupted by another device on the
184
- network.
44
+ See <<troubleshooting-unstable-cluster-disconnected>>.
185
45
186
46
[discrete]
187
47
===== Diagnosing `lagging` nodes
188
48
189
- {es} needs every node to process cluster state updates reasonably quickly. If a
190
- node takes too long to process a cluster state update, it can be harmful to the
191
- cluster. The master will remove these nodes with the `lagging` reason. Refer to
192
- <<modules-discovery-settings>> for information about the settings which control
193
- this mechanism.
194
-
195
- Lagging is typically caused by performance issues on the removed node. However,
196
- a node may also lag due to severe network delays. To rule out network delays,
197
- ensure that `net.ipv4.tcp_retries2` is <<system-config-tcpretries,configured
198
- properly>>. Log messages that contain `warn threshold` may provide more
199
- information about the root cause.
200
-
201
- If you're an advanced user, you can get more detailed information about what
202
- the node was doing when it was removed by configuring the following logger:
203
-
204
- [source,yaml]
205
- ----
206
- logger.org.elasticsearch.cluster.coordination.LagDetector: DEBUG
207
- ----
208
-
209
- When this logger is enabled, {es} will attempt to run the
210
- <<cluster-nodes-hot-threads>> API on the faulty node and report the results in
211
- the logs on the elected master. The results are compressed, encoded, and split
212
- into chunks to avoid truncation:
213
-
214
- [source,text]
215
- ----
216
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 1]: H4sIAAAAAAAA/x...
217
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 2]: p7x3w1hmOQVtuV...
218
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 3]: v7uTboMGDbyOy+...
219
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 4]: 4tse0RnPnLeDNN...
220
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] (gzip compressed, base64-encoded, and split into 4 parts on preceding log lines)
221
- ----
222
-
223
- To reconstruct the output, base64-decode the data and decompress it using
224
- `gzip`. For instance, on Unix-like systems:
225
-
226
- [source,sh]
227
- ----
228
- cat lagdetector.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
229
- ----
49
+ See <<troubleshooting-unstable-cluster-lagging>>.
230
50
231
51
[discrete]
232
52
===== Diagnosing `follower check retry count exceeded` nodes
233
53
234
- Nodes sometimes leave the cluster with reason `follower check retry count
235
- exceeded` when they shut down, but if they rejoin the cluster without
236
- restarting then there is some other problem.
237
-
238
- {es} needs every node to respond to network messages successfully and
239
- reasonably quickly. If a node rejects requests or does not respond at all then
240
- it can be harmful to the cluster. If enough consecutive checks fail then the
241
- master will remove the node with reason `follower check retry count exceeded`
242
- and will indicate in the `node-left` message how many of the consecutive
243
- unsuccessful checks failed and how many of them timed out. Refer to
244
- <<modules-discovery-settings>> for information about the settings which control
245
- this mechanism.
246
-
247
- Timeouts and failures may be due to network delays or performance problems on
248
- the affected nodes. Ensure that `net.ipv4.tcp_retries2` is
249
- <<system-config-tcpretries,configured properly>> to eliminate network delays as
250
- a possible cause for this kind of instability. Log messages containing
251
- `warn threshold` may give further clues about the cause of the instability.
252
-
253
- If the last check failed with an exception then the exception is reported, and
254
- typically indicates the problem that needs to be addressed. If any of the
255
- checks timed out then narrow down the problem as follows.
256
-
257
- include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
258
-
259
- include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection]
260
-
261
- include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
262
-
263
- By default the follower checks will time out after 30s, so if node departures
264
- are unpredictable then capture stack dumps every 15s to be sure that at least
265
- one stack dump was taken at the right time.
54
+ See <<troubleshooting-unstable-cluster-follower-check>>.
266
55
267
56
[discrete]
268
57
===== Diagnosing `ShardLockObtainFailedException` failures
269
58
270
- If a node leaves and rejoins the cluster then {es} will usually shut down and
271
- re-initialize its shards. If the shards do not shut down quickly enough then
272
- {es} may fail to re-initialize them due to a `ShardLockObtainFailedException`.
273
-
274
- To gather more information about the reason for shards shutting down slowly,
275
- configure the following logger:
276
-
277
- [source,yaml]
278
- ----
279
- logger.org.elasticsearch.env.NodeEnvironment: DEBUG
280
- ----
281
-
282
- When this logger is enabled, {es} will attempt to run the
283
- <<cluster-nodes-hot-threads>> API whenever it encounters a
284
- `ShardLockObtainFailedException`. The results are compressed, encoded, and
285
- split into chunks to avoid truncation:
286
-
287
- [source,text]
288
- ----
289
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 1]: H4sIAAAAAAAA/x...
290
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 2]: p7x3w1hmOQVtuV...
291
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 3]: v7uTboMGDbyOy+...
292
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 4]: 4tse0RnPnLeDNN...
293
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] (gzip compressed, base64-encoded, and split into 4 parts on preceding log lines)
294
- ----
295
-
296
- To reconstruct the output, base64-decode the data and decompress it using
297
- `gzip`. For instance, on Unix-like systems:
298
-
299
- [source,sh]
300
- ----
301
- cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
302
- ----
59
+ See <<troubleshooting-unstable-cluster-shardlockobtainfailedexception>>.
303
60
304
61
[discrete]
305
62
===== Diagnosing other network disconnections
306
63
307
- {es} is designed to run on a fairly reliable network. It opens a number of TCP
308
- connections between nodes and expects these connections to remain open
309
- <<long-lived-connections,forever>>. If a connection is closed then {es} will
310
- try and reconnect, so the occasional blip may fail some in-flight operations
311
- but should otherwise have limited impact on the cluster. In contrast,
312
- repeatedly-dropped connections will severely affect its operation.
313
-
314
- {es} nodes will only actively close an outbound connection to another node if
315
- the other node leaves the cluster. See
316
- <<cluster-fault-detection-troubleshooting>> for further information about
317
- identifying and troubleshooting this situation. If an outbound connection
318
- closes for some other reason, nodes will log a message such as the following:
319
-
320
- [source,text]
321
- ----
322
- [INFO ][o.e.t.ClusterConnectionManager] [node-1] transport connection to [{node-2}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] closed by remote
323
- ----
324
-
325
- Similarly, once an inbound connection is fully established, a node never
326
- spontaneously closes it unless the node is shutting down.
327
-
328
- Therefore if you see a node report that a connection to another node closed
329
- unexpectedly, something other than {es} likely caused the connection to close.
330
- A common cause is a misconfigured firewall with an improper timeout or another
331
- policy that's <<long-lived-connections,incompatible with {es}>>. It could also
332
- be caused by general connectivity issues, such as packet loss due to faulty
333
- hardware or network congestion. If you're an advanced user, configure the
334
- following loggers to get more detailed information about network exceptions:
335
-
336
- [source,yaml]
337
- ----
338
- logger.org.elasticsearch.transport.TcpTransport: DEBUG
339
- logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG
340
- ----
341
-
342
- If these logs do not show enough information to diagnose the problem, obtain a
343
- packet capture simultaneously from the nodes at both ends of an unstable
344
- connection and analyse it alongside the {es} logs from those nodes to determine
345
- if traffic between the nodes is being disrupted by another device on the
346
- network.
347
- //end::troubleshooting[]
64
+ See <<troubleshooting-unstable-cluster-network>>.
0 commit comments