Skip to content

Commit 9da2b3d

Browse files
authored
Add remote cluster network troubleshooting docs (#107072) (#107084)
Spells out in a little more detail our expectations for remote cluster connections, including an example log message when the network is unreliable and some suggestions for how to troubleshoot further.
1 parent 98d60b3 commit 9da2b3d

File tree

1 file changed

+40
-0
lines changed

1 file changed

+40
-0
lines changed

docs/reference/modules/cluster/remote-clusters-troubleshooting.asciidoc

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,46 @@ org.elasticsearch.transport.ConnectTransportException: [][192.168.0.42:9443] *co
7777
server is enabled>> on the remote cluster.
7878
* Ensure no firewall is blocking the communication.
7979

80+
[[remote-clusters-unreliable-network]]
81+
===== Remote cluster connection is unreliable
82+
83+
====== Symptom
84+
85+
The local cluster can connect to the remote cluster, but the connection does
86+
not work reliably. For example, some cross-cluster requests may succeed while
87+
others report connection errors, time out, or appear to be stuck waiting for
88+
the remote cluster to respond.
89+
90+
When {es} detects that the remote cluster connection is not working, it will
91+
report the following message in its logs:
92+
[source,txt,subs=+quotes]
93+
----
94+
[2023-06-28T16:36:47,264][INFO ][o.e.t.ClusterConnectionManager] [local-node] transport connection to [{my-remote#192.168.0.42:9443}{...}] closed by remote
95+
----
96+
This message will also be logged if the node of the remote cluster to which
97+
{es} is connected is shut down or restarted.
98+
99+
Note that with some network configurations it could take minutes or hours for
100+
the operating system to detect that a connection has stopped working. Until the
101+
failure is detected and reported to {es}, requests involving the remote cluster
102+
may time out or may appear to be stuck.
103+
104+
====== Resolution
105+
106+
* Ensure that the network between the clusters is as reliable as possible.
107+
108+
* Ensure that the network is configured to permit <<long-lived-connections>>.
109+
110+
* Ensure that the network is configured to detect faulty connections quickly.
111+
In particular, you must enable and fully support TCP keepalives, and set a
112+
short <<system-config-tcpretries,retransmission timeout>>.
113+
114+
* On Linux systems, execute `ss -tonie` to verify the details of the
115+
configuration of each network connection between the clusters.
116+
117+
* If the problems persist, capture network packets at both ends of the
118+
connection and analyse the traffic to look for delays and lost messages.
119+
80120
[[remote-clusters-troubleshooting-tls-trust]]
81121
===== TLS trust not established
82122

0 commit comments

Comments
 (0)