You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spells out in a little more detail our expectations for remote cluster
connections, including an example log message when the network is
unreliable and some suggestions for how to troubleshoot further.
* Ensure no firewall is blocking the communication.
79
79
80
+
[[remote-clusters-unreliable-network]]
81
+
===== Remote cluster connection is unreliable
82
+
83
+
====== Symptom
84
+
85
+
The local cluster can connect to the remote cluster, but the connection does
86
+
not work reliably. For example, some cross-cluster requests may succeed while
87
+
others report connection errors, time out, or appear to be stuck waiting for
88
+
the remote cluster to respond.
89
+
90
+
When {es} detects that the remote cluster connection is not working, it will
91
+
report the following message in its logs:
92
+
[source,txt,subs=+quotes]
93
+
----
94
+
[2023-06-28T16:36:47,264][INFO ][o.e.t.ClusterConnectionManager] [local-node] transport connection to [{my-remote#192.168.0.42:9443}{...}] closed by remote
95
+
----
96
+
This message will also be logged if the node of the remote cluster to which
97
+
{es} is connected is shut down or restarted.
98
+
99
+
Note that with some network configurations it could take minutes or hours for
100
+
the operating system to detect that a connection has stopped working. Until the
101
+
failure is detected and reported to {es}, requests involving the remote cluster
102
+
may time out or may appear to be stuck.
103
+
104
+
====== Resolution
105
+
106
+
* Ensure that the network between the clusters is as reliable as possible.
107
+
108
+
* Ensure that the network is configured to permit <<long-lived-connections>>.
109
+
110
+
* Ensure that the network is configured to detect faulty connections quickly.
111
+
In particular, you must enable and fully support TCP keepalives, and set a
112
+
short <<system-config-tcpretries,retransmission timeout>>.
113
+
114
+
* On Linux systems, execute `ss -tonie` to verify the details of the
115
+
configuration of each network connection between the clusters.
116
+
117
+
* If the problems persist, capture network packets at both ends of the
118
+
connection and analyse the traffic to look for delays and lost messages.
0 commit comments