@@ -151,17 +151,17 @@ down, but if they rejoin the cluster without restarting then there is some
151
151
other problem.
152
152
153
153
{es} is designed to run on a fairly reliable network. It opens a number of TCP
154
- connections between nodes and expects these connections to remain open forever.
155
- If a connection is closed then {es} will try and reconnect, so the occasional
156
- blip should have limited impact on the cluster even if the affected node
157
- briefly leaves the cluster. In contrast, repeatedly-dropped connections will
158
- severely affect its operation.
154
+ connections between nodes and expects these connections to remain open
155
+ <<long-lived-connections,forever>>. If a connection is closed then {es} will
156
+ try and reconnect, so the occasional blip may fail some in-flight operations
157
+ but should otherwise have limited impact on the cluster. In contrast,
158
+ repeatedly-dropped connections will severely affect its operation.
159
159
160
160
The connections from the elected master node to every other node in the cluster
161
161
are particularly important. The elected master never spontaneously closes its
162
- outbound connections to other nodes. Similarly, once a connection is fully
163
- established, a node never spontaneously close its inbound connections unless
164
- the node is shutting down.
162
+ outbound connections to other nodes. Similarly, once an inbound connection is
163
+ fully established, a node never spontaneously it unless the node is shutting
164
+ down.
165
165
166
166
If you see a node unexpectedly leave the cluster with the `disconnected`
167
167
reason, something other than {es} likely caused the connection to close. A
@@ -301,3 +301,47 @@ To reconstruct the output, base64-decode the data and decompress it using
301
301
cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
302
302
----
303
303
//end::troubleshooting[]
304
+
305
+ [discrete]
306
+ ===== Diagnosing other network disconnections
307
+
308
+ {es} is designed to run on a fairly reliable network. It opens a number of TCP
309
+ connections between nodes and expects these connections to remain open
310
+ <<long-lived-connections,forever>>. If a connection is closed then {es} will
311
+ try and reconnect, so the occasional blip may fail some in-flight operations
312
+ but should otherwise have limited impact on the cluster. In contrast,
313
+ repeatedly-dropped connections will severely affect its operation.
314
+
315
+ {es} nodes will only actively close an outbound connection to another node if
316
+ the other node leaves the cluster. See
317
+ <<cluster-fault-detection-troubleshooting>> for further information about
318
+ identifying and troubleshooting this situation. If an outbound connection
319
+ closes for some other reason, nodes will log a message such as the following:
320
+
321
+ [source,text]
322
+ ----
323
+ [INFO ][o.e.t.ClusterConnectionManager] [node-1] transport connection to [{node-2}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] closed by remote
324
+ ----
325
+
326
+ Similarly, once an inbound connection is fully established, a node never
327
+ spontaneously closes it unless the node is shutting down.
328
+
329
+ Therefore if you see a node report that a connection to another node closed
330
+ unexpectedly, something other than {es} likely caused the connection to close.
331
+ A common cause is a misconfigured firewall with an improper timeout or another
332
+ policy that's <<long-lived-connections,incompatible with {es}>>. It could also
333
+ be caused by general connectivity issues, such as packet loss due to faulty
334
+ hardware or network congestion. If you're an advanced user, configure the
335
+ following loggers to get more detailed information about network exceptions:
336
+
337
+ [source,yaml]
338
+ ----
339
+ logger.org.elasticsearch.transport.TcpTransport: DEBUG
340
+ logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG
341
+ ----
342
+
343
+ If these logs do not show enough information to diagnose the problem, obtain a
344
+ packet capture simultaneously from the nodes at both ends of an unstable
345
+ connection and analyse it alongside the {es} logs from those nodes to determine
346
+ if traffic between the nodes is being disrupted by another device on the
347
+ network.
0 commit comments