Add remote cluster network troubleshooting docs (#107072) (#107084)

DaveCTurner · web-flow · commit 9da2b3d8b22b · 2024-04-04T03:00:13.000-04:00
Spells out in a little more detail our expectations for remote cluster
connections, including an example log message when the network is
unreliable and some suggestions for how to troubleshoot further.
diff --git a/docs/reference/modules/cluster/remote-clusters-troubleshooting.asciidoc b/docs/reference/modules/cluster/remote-clusters-troubleshooting.asciidoc
@@ -77,6 +77,46 @@ org.elasticsearch.transport.ConnectTransportException: [][192.168.0.42:9443] *co
 server is enabled>> on the remote cluster.
 * Ensure no firewall is blocking the communication.
 
+[[remote-clusters-unreliable-network]]
+===== Remote cluster connection is unreliable
+
+====== Symptom
+
+The local cluster can connect to the remote cluster, but the connection does
+not work reliably. For example, some cross-cluster requests may succeed while
+others report connection errors, time out, or appear to be stuck waiting for
+the remote cluster to respond.
+
+When {es} detects that the remote cluster connection is not working, it will
+report the following message in its logs:
+[source,txt,subs=+quotes]
+----
+[2023-06-28T16:36:47,264][INFO ][o.e.t.ClusterConnectionManager] [local-node] transport connection to [{my-remote#192.168.0.42:9443}{...}] closed by remote
+----
+This message will also be logged if the node of the remote cluster to which
+{es} is connected is shut down or restarted.
+
+Note that with some network configurations it could take minutes or hours for
+the operating system to detect that a connection has stopped working. Until the
+failure is detected and reported to {es}, requests involving the remote cluster
+may time out or may appear to be stuck.
+
+====== Resolution
+
+* Ensure that the network between the clusters is as reliable as possible.
+
+* Ensure that the network is configured to permit <<long-lived-connections>>.
+
+* Ensure that the network is configured to detect faulty connections quickly.
+  In particular, you must enable and fully support TCP keepalives, and set a
+  short <<system-config-tcpretries,retransmission timeout>>.
+
+* On Linux systems, execute `ss -tonie` to verify the details of the
+  configuration of each network connection between the clusters.
+
+* If the problems persist, capture network packets at both ends of the
+  connection and analyse the traffic to look for delays and lost messages.
+
 [[remote-clusters-troubleshooting-tls-trust]]
 ===== TLS trust not established