Skip to content

Commit 7f06144

Browse files
committed
Add links to network disconnect troubleshooting
Makes the docs added in elastic#112271 more discoverable.
1 parent 59a42ed commit 7f06144

File tree

6 files changed

+33
-14
lines changed

6 files changed

+33
-14
lines changed

docs/reference/modules/discovery/fault-detection.asciidoc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,7 @@ to <<modules-discovery-settings>> for information about the settings which
144144
control this mechanism.
145145

146146
[discrete]
147+
[[cluster-fault-detection-troubleshooting-disconnected]]
147148
===== Diagnosing `disconnected` nodes
148149

149150
Nodes typically leave the cluster with reason `disconnected` when they shut
@@ -184,6 +185,7 @@ if traffic between the nodes is being disrupted by another device on the
184185
network.
185186

186187
[discrete]
188+
[[cluster-fault-detection-troubleshooting-lagging]]
187189
===== Diagnosing `lagging` nodes
188190

189191
{es} needs every node to process cluster state updates reasonably quickly. If a
@@ -229,6 +231,7 @@ cat lagdetector.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
229231
----
230232

231233
[discrete]
234+
[[cluster-fault-detection-troubleshooting-follower-check]]
232235
===== Diagnosing `follower check retry count exceeded` nodes
233236

234237
Nodes sometimes leave the cluster with reason `follower check retry count
@@ -265,6 +268,7 @@ are unpredictable then capture stack dumps every 15s to be sure that at least
265268
one stack dump was taken at the right time.
266269

267270
[discrete]
271+
[[cluster-fault-detection-troubleshooting-shardlockobtainfailedexception]]
268272
===== Diagnosing `ShardLockObtainFailedException` failures
269273

270274
If a node leaves and rejoins the cluster then {es} will usually shut down and
@@ -302,6 +306,7 @@ cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
302306
----
303307

304308
[discrete]
309+
[[cluster-fault-detection-troubleshooting-network]]
305310
===== Diagnosing other network disconnections
306311

307312
{es} is designed to run on a fairly reliable network. It opens a number of TCP

docs/reference/modules/transport.asciidoc

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -185,16 +185,18 @@ configured, and defaults otherwise to `transport.tcp.reuse_address`.
185185

186186
A transport connection between two nodes is made up of a number of long-lived
187187
TCP connections, some of which may be idle for an extended period of time.
188-
Nonetheless, Elasticsearch requires these connections to remain open, and it
189-
can disrupt the operation of your cluster if any inter-node connections are
190-
closed by an external influence such as a firewall. It is important to
191-
configure your network to preserve long-lived idle connections between
192-
Elasticsearch nodes, for instance by leaving `*.tcp.keep_alive` enabled and
193-
ensuring that the keepalive interval is shorter than any timeout that might
194-
cause idle connections to be closed, or by setting `transport.ping_schedule` if
195-
keepalives cannot be configured. Devices which drop connections when they reach
196-
a certain age are a common source of problems to Elasticsearch clusters, and
197-
must not be used.
188+
Nonetheless, {es} requires these connections to remain open, and it can disrupt
189+
the operation of your cluster if any inter-node connections are closed by an
190+
external influence such as a firewall. It is important to configure your network
191+
to preserve long-lived idle connections between {es} nodes, for instance by
192+
leaving `*.tcp.keep_alive` enabled and ensuring that the keepalive interval is
193+
shorter than any timeout that might cause idle connections to be closed, or by
194+
setting `transport.ping_schedule` if keepalives cannot be configured. Devices
195+
which drop connections when they reach a certain age are a common source of
196+
problems to {es} clusters, and must not be used.
197+
198+
For information about troubleshooting unexpected network disconnections, see
199+
<<cluster-fault-detection-troubleshooting-network>>.
198200

199201
[[request-compression]]
200202
===== Request compression

server/src/main/java/org/elasticsearch/common/ReferenceDocs.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ public enum ReferenceDocs {
4343
UNSTABLE_CLUSTER_TROUBLESHOOTING,
4444
LAGGING_NODE_TROUBLESHOOTING,
4545
SHARD_LOCK_TROUBLESHOOTING,
46+
NETWORK_DISCONNECT_TROUBLESHOOTING,
4647
CONCURRENT_REPOSITORY_WRITERS,
4748
ARCHIVE_INDICES,
4849
HTTP_TRACER,

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import org.elasticsearch.action.ActionListener;
1313
import org.elasticsearch.action.support.ContextPreservingActionListener;
1414
import org.elasticsearch.cluster.node.DiscoveryNode;
15+
import org.elasticsearch.common.ReferenceDocs;
1516
import org.elasticsearch.common.settings.Settings;
1617
import org.elasticsearch.common.util.concurrent.ConcurrentCollections;
1718
import org.elasticsearch.common.util.concurrent.ListenableFuture;
@@ -237,7 +238,13 @@ private void connectToNodeOrRetry(
237238
if (connectingRefCounter.hasReferences() == false) {
238239
logger.trace("connection manager shut down, closing transport connection to [{}]", node);
239240
} else if (conn.hasReferences()) {
240-
logger.info("transport connection to [{}] closed by remote", node.descriptionWithoutAttributes());
241+
logger.info(
242+
"""
243+
transport connection to [{}] closed by remote; \
244+
if unexpected, see [{}] for troubleshooting guidance""",
245+
node.descriptionWithoutAttributes(),
246+
ReferenceDocs.NETWORK_DISCONNECT_TROUBLESHOOTING
247+
);
241248
// In production code we only close connections via ref-counting, so this message confirms that a
242249
// 'node-left ... reason: disconnected' event was caused by external factors. Put differently, if a
243250
// node leaves the cluster with "reason: disconnected" but without this message being logged then

server/src/main/resources/org/elasticsearch/common/reference-docs-links.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22
"INITIAL_MASTER_NODES": "important-settings.html#initial_master_nodes",
33
"DISCOVERY_TROUBLESHOOTING": "discovery-troubleshooting.html",
44
"UNSTABLE_CLUSTER_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html",
5-
"LAGGING_NODE_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html#_diagnosing_lagging_nodes_2",
6-
"SHARD_LOCK_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html#_diagnosing_shardlockobtainfailedexception_failures_2",
5+
"LAGGING_NODE_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html#cluster-fault-detection-troubleshooting-lagging",
6+
"SHARD_LOCK_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html#cluster-fault-detection-troubleshooting-shardlockobtainfailedexception",
7+
"NETWORK_DISCONNECT_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html#cluster-fault-detection-troubleshooting-network",
78
"CONCURRENT_REPOSITORY_WRITERS": "diagnosing-corrupted-repositories.html",
89
"ARCHIVE_INDICES": "archive-indices.html",
910
"HTTP_TRACER": "modules-network.html#http-rest-request-tracer",

server/src/test/java/org/elasticsearch/transport/ClusterConnectionManagerTests.java

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,10 @@ public void testDisconnectLogging() {
188188
"remotely-triggered close message",
189189
ClusterConnectionManager.class.getCanonicalName(),
190190
Level.INFO,
191-
"transport connection to [" + remoteClose.descriptionWithoutAttributes() + "] closed by remote"
191+
"transport connection to ["
192+
+ remoteClose.descriptionWithoutAttributes()
193+
+ "] closed by remote; "
194+
+ "if unexpected, see [https://www.elastic.co/guide/en/elasticsearch/reference/*] for troubleshooting guidance"
192195
)
193196
);
194197
mockLog.addExpectation(

0 commit comments

Comments
 (0)