transport: log network reconnects with same peer process #128415

schase-es · 2025-05-24T01:19:53Z

ClusterConnectionManager now caches the previous ephemeralId (created on process-start) of peer nodes on disconnect in a connection history table. On reconnect, when a peer has the same ephemeralId as it did previously, this is logged to indicate a network failure. The connectionHistory is trimmed to the current set of peers by NodeConnectionsService.

elasticsearchmachine · 2025-05-24T01:20:17Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

schase-es · 2025-05-24T01:20:33Z

I wasn't able to find a way to test the ClusterConnectionManager's connectionHistory table when integrated through the NodeConnectionsService.

nicktindall

Looking good, just a few questions and minor comments.

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

nicktindall · 2025-05-26T01:25:18Z

server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

+    /**
+     * Keep the connection history for the nodes listed
+     */
+    void retainConnectionHistory(List<DiscoveryNode> nodes);


In the javadoc I think we should mention that we discard history for nodes not in the list? If you know the Set API then it's suggested by the name retain, but if you don't it might not be obvious.

nicktindall · 2025-05-26T01:33:49Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

                        runnables.add(connectionTarget.connect(null));
                    }
                }
+                transportService.retainConnectionHistory(nodes);


We might be able to use DiscoveryNodes#getAllNodes() rather than building up an auxiliary collection, that might be marginally more efficient? Set#retainAll seems to take a Collection, but we'd need to change the ConnectionManager#retainConnectionHistory interface to accommodate.

Do we need a separate collection here at all? We could just pass discoveryNodes around I think.

But also, really this is cleaning out the nodes about which we no longer care, so I think we should be doing this in disconnectFromNodesExcept instead.

Nick raised an important point about the race between the connection history table and the close callback.

A connection's close callback will always put an entry in the history table. If this close is a consequence of a cluster state change and disconnect in NodeConnectionsService, then it will add a node history right after it's supposed to be cleaned out.

Cleaning out the node history table whenever we disconnect from some nodes or connect to some new nodes works fine, but it means the history table will always lag a version behind, in what it's holding onto.

I came up with a concurrency scheme that works for keeping the node history current in NodeConnectionsService, but it's more complicated.

nicktindall · 2025-05-26T02:07:31Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                                    public void onFailure(Exception e) {
+                                        final NodeConnectionHistory hist = new NodeConnectionHistory(node.getEphemeralId(), e);
+                                        nodeHistory.put(conn.getNode().getId(), hist);
+                                    }


Do we want to store the connection history even when conn.hasReferences() == false ? I'm not 100% familiar with this code, but I wonder if we might get the occasional ungraceful disconnect after we've released all our references?

I guess in that case we would eventually discard the entry via retainConnectionHistory anyway.

Do we need to be careful with the timing of calls to retainConnectionHistory versus the these close handlers firing? I guess any entries that are added after a purge would not survive subsequent purges.

nicktindall · 2025-05-26T02:14:33Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                                                node.descriptionWithoutAttributes(),
+                                                e,
+                                                ReferenceDocs.NETWORK_DISCONNECT_TROUBLESHOOTING
+                                            );


It looks like previously we would only have logged at debug level in this scenario? unless I'm reading it wrong. I'm not sure how interesting this case is (as we were disconnecting from the node anyway)?

nicktindall · 2025-05-26T02:23:10Z

server/src/test/java/org/elasticsearch/transport/ClusterConnectionManagerTests.java

+            assertTrue("recent disconnects should be listed", connectionManager.connectionHistorySize() == 2);
+
+            connectionManager.retainConnectionHistory(Collections.emptyList());
+            assertTrue("connection history should be emptied", connectionManager.connectionHistorySize() == 0);


I wonder if it would be better to expose a read-only copy of the map for testing this, that would allow us to assert that the correct IDs were present?

DaveCTurner

I think ClusterConnectionManager isn't quite the right place to do this - the job of this connection manager is to look after all node-to-node connections including ones used for discovery and remote cluster connections too. There are situations where we might close and re-establish these kinds of connection without either end restarting without that being a problem worthy of logging.

NodeConnectionsService is the class that knows about connections to nodes in the cluster. I'd rather we implemented the logging about unexpected reconnects there. That does raise some difficulties about how to expose the exception that closed the connection, if such an exception exists. I did say that this bit would be tricky 😁 Nonetheless I'd rather we got the logging to happen in the right place first and then we can think about the plumbing needed to achieve this extra detail.

DaveCTurner · 2025-05-27T07:09:15Z

...tty4/src/internalClusterTest/java/org/elasticsearch/transport/netty4/ESLoggingHandlerIT.java

+        value = "org.elasticsearch.transport.ClusterConnectionManager:WARN",
+        reason = "to ensure we log cluster manager disconnect events on WARN level"
+    )
+    public void testExceptionalDisconnectLoggingInClusterConnectionManager() throws Exception {


Could we put this into its own test suite? This suite is supposed to be about ESLoggingHandler which is unrelated to the logging in ClusterConnectionManager. I think this test should work fine in the :server test suite, no need to hide it in the transport-netty4 module.

Also could you open a separate PR to move testConnectionLogging and testExceptionalDisconnectLogging out of this test suite - they're testing the logging in TcpTransport which is similarly unrelated to ESLoggingHandler. IIRC they were added here for historical reasons, but these days we use the Netty transport everywhere so these should work in :server too.

DaveCTurner · 2025-05-27T07:12:23Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

                        runnables.add(connectionTarget.connect(null));
                    }
                }
+                transportService.retainConnectionHistory(nodes);


Do we need a separate collection here at all? We could just pass discoveryNodes around I think.

But also, really this is cleaning out the nodes about which we no longer care, so I think we should be doing this in disconnectFromNodesExcept instead.

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

DaveCTurner · 2025-05-27T07:25:44Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                            NodeConnectionHistory hist = nodeHistory.remove(connNode.getId());
+                            if (hist != null && hist.ephemeralId.equals(connNode.getEphemeralId())) {


Could we extract this to a separate method rather than adding to this already over-long and over-nested code directly?

Also I'd rather use nodeConnectionHistory instead of hist. Abbreviated variable names are a hindrance to readers, particularly if they don't have English as a first language, and there's no disadvantage to using the full type name here.

(nit: also it can be final)

DaveCTurner · 2025-05-27T07:31:11Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                                if (hist.disconnectCause != null) {
+                                    logger.warn(
+                                        () -> format(
+                                            "transport connection reopened to node with same ephemeralId [%s], close exception:",


Users don't really know what ephemeralId is so I think will find this message confusing. Could we say something like reopened transport connection to node [%s] which disconnected exceptionally [%s/%dms] ago but did not restart, so the disconnection is unexpected? NB also tracking the disconnection duration here.

Similarly disconnected gracefully in the other branch.

Also can we link ReferenceDocs.NETWORK_DISCONNECT_TROUBLESHOOTING?

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

schase-es · 2025-05-27T21:55:35Z

Thanks for the feedback everyone. It looks like I can repurpose the TransportConnectionListener interface to get the connect/disconnect events in node connections service, and use the connection close listener to get any exception out.

- moved test out of ESLoggingHandlerIt into a separate ClusterConnectionManagerIntegTests file - moved connection history into NodeConnectionsService, and adopted a consistency scheme - rewrote re-connection log message to include duration - changed log level of local disconnect with exception to debug

DaveCTurner · 2025-05-29T07:40:13Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                                            logger.warn(
+                                                """
+                                                    transport connection to [{}] closed by remote with exception [{}]; \
+                                                    if unexpected, see [{}] for troubleshooting guidance""",


I think this isn't guaranteed to be a WARN worthy event - if the node shut down then we might get a Connection reset or similar but that's not something that needs action, and we do log those exceptions elsewhere. On reflection I'd rather leave the logging in ClusterConnectionManager alone in this PR and just look at the new logs from the NodeConnectionsService.

DaveCTurner · 2025-05-29T07:41:40Z

server/src/test/java/org/elasticsearch/transport/ClusterConnectionManagerIntegTests.java

+import org.elasticsearch.test.junit.annotations.TestLogging;
+
+@ESIntegTestCase.ClusterScope(numDataNodes = 2, scope = ESIntegTestCase.Scope.TEST)
+public class ClusterConnectionManagerIntegTests extends ESIntegTestCase {


nit: ESIntegTestCase tests should have names ending in IT and be in the internalClusterTest source set. But as mentioned in my previous comment we probably don't want to change this here.

DaveCTurner · 2025-05-29T07:42:41Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

        }
    }
+
+    private class ConnectionHistory {


Yeah I like the look of this. Maybe ConnectionHistory implements TransportConnectionListener rather than having another layer of indirection?

Also this needs to be covered in NodeConnectionsServiceTests.

DaveCTurner · 2025-05-29T07:45:27Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+         * Each node in the cluster always has a nodeHistory entry that is either the dummy value or a connection history record. This
+         * allows node disconnect callbacks to discard their entry if the disconnect occurred because of a change in cluster state.
+         */
+        private final NodeConnectionHistory dummy = new NodeConnectionHistory("", 0, null);


Can be static I think, it's a global constant. We tend to name global constants in SHOUTY_SNAKE_CASE reflecting their meaning, so here I'd suggest CONNECTED or CONNECTED_MARKER or something like that. This way you get to say nodeConnectionHistory != CONNECTED_MARKER below which makes it clearer to the reader what this predicate means.

nit: also looks like the javadoc is for the nodeHistory field

DaveCTurner · 2025-05-29T07:46:19Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+                                    "reopened transport connection to node [%s] "
+                                        + "which disconnected exceptionally [%dms] ago but did not "
+                                        + "restart, so the disconnection is unexpected; "
+                                        + "if unexpected, see [{}] for troubleshooting guidance",


No need for if unexpected here, I think the point is that this situation is always unexpected.

DaveCTurner · 2025-05-29T07:46:57Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+                                        + "restart, so the disconnection is unexpected; "
+                                        + "if unexpected, see [{}] for troubleshooting guidance",
+                                    node.descriptionWithoutAttributes(),
+                                    nodeConnectionHistory.disconnectTime,


This'll show the absolute disconnect time in milliseconds (i.e. since 1970) whereas I think we want to see the duration between the disconnect and the current time.

schase-es · 2025-05-29T21:50:02Z

Thanks for the feedback David -- this was definitely a light pass on everything other than the concurrency scheme, and I wanted to get notes on it before adding complete testing and getting everything else just right. In hindsight, I was probably better off not trying to address everything else at the same time instead of committing first-draft versions.

nicktindall · 2025-05-30T06:20:02Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+
+        void reserveConnectionHistoryForNodes(DiscoveryNodes nodes) {
+            for (DiscoveryNode node : nodes) {
+                nodeHistory.put(node.getId(), dummy);


This might need to be putIfAbsent so we don't over-write any actual current NodeConnectionHistory entries right?

I'm not sure. My read was these two calls would come from cluster state changing to add or remove nodes from this table. Inclusion is controlled by these calls, which unconditionally add or remove entries. The close callback has to be careful to check if it has an entry that's valid: this protects against long-running callbacks inserting garbage into the table.

The DiscoveryNodes passed to connectToNodes contains all the nodes in the cluster, including any existing ones, so if there's a node which already exists in the cluster, and is currently disconnected, then it will have an entry in nodeHistory which isn't dummy that this line will overwrite on any cluster state update. So yeah I think putIfAbsent is what we want here.

I get it now -- for whatever reason, I thought it was passing in the deltas, but it's obvious from connectToNodes that the node connections service is doing that calculation

nicktindall · 2025-05-30T06:21:31Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+            });
+        }
+
+        void reserveConnectionHistoryForNodes(DiscoveryNodes nodes) {


nit: I wonder if this should be called something like startTrackingConnectionHistory (and the other method stop...), the "reserving" language seems like an implementation detail leaking?

I do like the implementation though, nice approach to fixing the race.

DaveCTurner · 2025-05-30T07:37:13Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+
+        void reserveConnectionHistoryForNodes(DiscoveryNodes nodes) {
+            for (DiscoveryNode node : nodes) {
+                nodeHistory.put(node.getId(), dummy);


The DiscoveryNodes passed to connectToNodes contains all the nodes in the cluster, including any existing ones, so if there's a node which already exists in the cluster, and is currently disconnected, then it will have an entry in nodeHistory which isn't dummy that this line will overwrite on any cluster state update. So yeah I think putIfAbsent is what we want here.

DaveCTurner · 2025-05-30T07:46:50Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+                    NodeConnectionHistory nodeConnectionHistory = nodeHistory.get(node.getId());
+                    if (nodeConnectionHistory != null) {
+                        nodeHistory.replace(node.getId(), nodeConnectionHistory, dummy);
+                    }


This looks a little racy, although in practice I think it's fine because ClusterConnectionManager protects against opening multiple connections to the same node concurrently. Still, if we did all this (including the logging) within a nodeHistory.compute(node.getId, ...) then there'd obviously be no races.

DaveCTurner · 2025-05-30T07:53:17Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+        void removeConnectionHistoryForNodes(Set<DiscoveryNode> nodes) {
+            final int startSize = nodeHistory.size();
+            for (DiscoveryNode node : nodes) {
+                nodeHistory.remove(node.getId());


There's kind of an implicit invariant here that org.elasticsearch.cluster.NodeConnectionsService.ConnectionHistory#nodeHistory and org.elasticsearch.cluster.NodeConnectionsService#targetsByNode have the same keys. At the very least we should be able to assert this. I also wonder if we should be calling nodeHistory.retainAll() to make it super-clear that we are keeping these keysets aligned.

But then that got me thinking, maybe we should be tracking the connection history of each target node in ConnectionTarget rather than trying to maintain two parallel maps. Could that work?

This is a great idea... ConnectionTarget has exactly the lifecycle needed. I think because I moved it from elsewhere and am having a rough week over here, this didn't occur to me.

DaveCTurner · 2025-05-30T09:00:07Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+                }
+
+                @Override
+                public void onNodeDisconnected(DiscoveryNode node, Transport.Connection connection) {


I just spotted we're already executing this in a close-listener, but one that runs under ActionListener.running(...) so it drops the exception. I think it'd be nicer to adjust this callback to take a @Nullable Exception e parameter rather than having to add a second close listener just to pick up the exception as done here.

- consolidated ConnectionHistory into ConnectionTarget, protected with the service mutex - added logging test for reconnection with and without exception - grew TransportConnectionListener onNodeDisconnected to include a nullable exception - reverted ClusterConnectionManager tests and logging

schase-es · 2025-05-31T01:05:21Z

I think this addresses everything so far. The test doesn't check on the internals of ConnectionTarget -- but no other tests do. I do have a more complete tests I can adapt.

One question/issue that came up earlier in the discussion of this PR was around the lifecycle of DiscoveryNode. Because:

the ephemeralId field is immutable and used for hashing and equality
and the ephemeralId is based on information received after a connection is established
it's not possible for the discovery node view that the NodeConnectionsService has when it establishes the connection to have the correct ephemeralId as the node view after the connection has been established and handshaking has occurred. (At least, this was the hypothesis -- I haven't been able to track down where the ephemeralId gets set).

So far, I've been careful to use the service's view for storage and retrieval, and the connection's view for comparison. It's particularly awkward in this implementation. I am wondering if this makes sense to David: whether this idea/concern is valid, and whether it works out if this is true.

I am also hoping to do a real-world test, or have something with more complete integration.

DaveCTurner · 2025-05-31T06:18:19Z

I haven't been able to track down where the ephemeralId gets set

It's created (randomly) once during node startup, see org.elasticsearch.cluster.node.DiscoveryNode#createLocal. It is admittedly rather hard to track that down.

it's not possible for the discovery node view that the NodeConnectionsService has when it establishes the connection to have the correct ephemeralId

The NodeConnectionsService sees the DiscoveryNode only after it is added to the ClusterState by the cluster-joining process, which means it has all the right details (they're passed over the wire from the joining node to the master in org.elasticsearch.cluster.coordination.JoinRequest#sourceNode).

DaveCTurner

Yep I think this is going to work. A few superficial comments but otherwise this looks ready to start working on some testing.

DaveCTurner · 2025-05-31T06:22:04Z

server/src/main/java/org/elasticsearch/transport/TransportConnectionListener.java

     * Called once a node connection is closed and unregistered.
     */
-    default void onNodeDisconnected(DiscoveryNode node, Transport.Connection connection) {}
+    default void onNodeDisconnected(DiscoveryNode node, Transport.Connection connection, @Nullable Exception closeException) {}


👍 while we're at it I think connection is unused, we could drop that here too. Bit odd to share the connection with the callback after it's closed. Can be done in a separate PR tho - if you did this first then there'd be much less noise in this one.

DaveCTurner · 2025-05-31T06:23:49Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                                    void handleClose(@Nullable Exception e) {
+                                        connectedNodes.remove(node, conn);
+                                        connectionListener.onNodeDisconnected(node, conn, e);
+                                        managerRefs.decRef();


It was like this already, but it strikes me that this decRef should be in a finally just in case some future onNodeDisconnected implementation throws an exception (can be done in a follow-up)

DaveCTurner · 2025-05-31T06:26:48Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

        });
    }

+    record ConnectionHistory(String ephemeralId, long disconnectTime, Exception disconnectCause) {}


Maybe DisconnectionHistory? Also I think we don't need the ephemeralId any more, since targetsByNode is keyed by DiscoveryNode (and hence by ephemeralId).

Also I'd prefer the time field to be named disconnectTimeMillis - most absolute times are indeed in milliseconds in this codebase but it isn't guaranteed and we've had unit-confusion bugs in the past that this naming convention would have avoided.

DaveCTurner · 2025-05-31T06:28:39Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

        private final AtomicInteger consecutiveFailureCount = new AtomicInteger();
        private final AtomicReference<Releasable> connectionRef = new AtomicReference<>();

+        // access is synchronized by the service mutex


DaveCTurner · 2025-05-31T06:30:36Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+        /**
+         * Receives connection/disconnection events from the transport, and records in per-node ConnectionHistory
+         * structures for logging network issues. ConnectionHistory records are stored in ConnectionTargets.


Nit: this comment is documenting the class, not the constructor, so it should be before the private class line above.

Argh, my python docstring habits are getting the better of me :)

DaveCTurner · 2025-05-31T06:34:18Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

        this.threadPool = threadPool;
        this.transportService = transportService;
        this.reconnectInterval = NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.get(settings);
+        this.connectionHistoryListener = new ConnectionHistoryListener();


Nit: I don't think we need to keep hold of the connectionHistoryListener in a field here. Also it's best not to subscribe this to things until the constructor has returned. So maybe this should be:

Suggested change

this.connectionHistoryListener = new ConnectionHistoryListener();

transportService.addConnectionListener(new ConnectionHistoryListener())

dropping the constructor in ConnectionHistoryListener

I gave this a try, and do prefer your suggestion.

It turned out that all the test failures are from the roughly 1000 tests that instantiate NodeConnectionsService with a null TransportService.

I went for a different solution that calls into addConnectionListener during doStart, removes it doStop, and does need to store the listener as a field to support this.

👍 tho there's no need to remove the listener in doStop

DaveCTurner · 2025-05-31T06:35:32Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+                if (connectionHistory.disconnectCause != null) {
+                    logger.warn(
+                        () -> format(
+                            "reopened transport connection to node [%s] "


nit: this branch uses a concatenated format string but the other branch has a single multi-line string. I prefer the latter, but either way I'd prefer to be consistent

schase-es · 2025-06-02T19:39:51Z

Thanks also for the notes about ephemeral ids. I understood the lifecycle aspect, but was confused about a peer node hears about it. In my mind, I was confused about how it knows the ephemeral id before opening a connection, because I expected you'd need the connection to know the ephemeral id. The chicken-egg aspect of the cluster state now seems obvious -- it wouldn't be in cluster state if the peer hadn't heard from it...

I can clean up some aspects of the discovery node and id retrieval now... it's overly verbose now.

- updated ConnectionChangeListener constructor and moved registration to service.doStart() - renamed DisconnectionHistory record and ConnectionChangeListener - fixed up DiscoveryNode vs. Transport.Connection.getNode() confusion (these are the same) - fixed log formatting - edited TransportConnectionListener interface to take a nullable closeException instead of the Transport.Connection - completed test of DisconnectionHistory, at init, post-connection, post-disconnection, and post-reconnection - moved docs for ConnectionChangeListener, and added docs for DisconnectionHistory

DaveCTurner · 2025-06-04T09:28:23Z

server/src/main/java/org/elasticsearch/transport/TransportConnectionListener.java

     * Called once a node connection is closed and unregistered.
     */
-    default void onNodeDisconnected(DiscoveryNode node, Transport.Connection connection) {}
+    default void onNodeDisconnected(DiscoveryNode node, @Nullable Exception closeException) {}


I mentioned earlier that we could consider pulling this API change out to a separate PR. As things stand I now think we should definitely do that - it's a simple refactoring (needs no test changes) and will make this change much more focussed.

DaveCTurner · 2025-06-04T09:30:23Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+    // exposed for testing
+    protected ConnectionTarget connectionTargetForNode(DiscoveryNode node) {


I don't think we need to expose the whole ConnectionTarget out to tests - we could just allow access to the DisconnectionHistory for a node and keep the ConnectionTarget class private.

DaveCTurner · 2025-06-04T09:31:01Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

        private final AtomicReference<Releasable> connectionRef = new AtomicReference<>();

+        // access is synchronized by the service mutex
+        protected DisconnectionHistory disconnectionHistory = null;


nit: suggest marking this nullable (and describing what the null value means)

Suggested change

protected DisconnectionHistory disconnectionHistory = null;

@Nullable // if node is connected

protected DisconnectionHistory disconnectionHistory = null;

DaveCTurner · 2025-06-04T09:33:07Z

server/src/test/java/org/elasticsearch/cluster/NodeConnectionsServiceTests.java

+        MockTransport transport = new MockTransport(deterministicTaskQueue.getThreadPool());
+        TestTransportService transportService = new TestTransportService(transport, deterministicTaskQueue.getThreadPool());


Should probably use threadPool here too rather than creating distinct threadpools for each service.

DaveCTurner · 2025-06-04T09:40:22Z

server/src/test/java/org/elasticsearch/cluster/NodeConnectionsServiceTests.java


+    public void testDisconnectionHistory() {
+        final Settings.Builder settings = Settings.builder();
+        settings.put(CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), "100ms");


We're using a DeterministicTaskQueue so we are simulating the passage of time, no need to set a short retry interval here. I'd suggest dropping this and just using the default (via CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.get(Settings.EMPTY).millis()) below. We could also randomly pick a different value:

final long reconnectIntervalMillis; if (randomBoolean()) { reconnectIntervalMillis = randomLongBetween(1, 100000); settings.put(CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), TimeValue.timeValueMillis(reconnectIntervalMillis)); } else { reconnectIntervalMillis = CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.get(Settings.EMPTY).millis(); }

It's not really the point of this test tho, so I think the default would be fine.

DaveCTurner · 2025-06-04T09:42:45Z

server/src/test/java/org/elasticsearch/cluster/NodeConnectionsServiceTests.java

+                    Level.WARN,
+                    "reopened transport connection to node ["
+                        + exceptionalClose.descriptionWithoutAttributes()
+                        + "] which disconnected exceptionally [*ms] ago "


We control the passage of time so we can assert that the reconnect happens exactly one reconnect interval later:

Suggested change

+ "] which disconnected exceptionally [*ms] ago "

+ "] which disconnected exceptionally ["

+ CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.get(Settings.EMPTY).millis()

+ "ms] ago "

server/src/test/java/org/elasticsearch/cluster/NodeConnectionsServiceTests.java

DaveCTurner · 2025-06-04T09:45:39Z

server/src/test/java/org/elasticsearch/cluster/NodeConnectionsServiceTests.java

+            assertDisconnectionHistoryDetails(service, threadPool, gracefulClose, null);
+            assertDisconnectionHistoryDetails(service, threadPool, exceptionalClose, RuntimeException.class);
+
+            runTasksUntil(deterministicTaskQueue, 200);


I'd prefer this to be deterministicTaskQueue.getCurrentTimeMillis() + ${RECONNECT_INTERVAL} - it just happens that today we start the clock at zero.

Oh I see -- I completely misread that as a time duration

DaveCTurner · 2025-06-04T09:53:50Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+                        () -> format(
+                            """
+                                reopened transport connection to node [%s] \
+                                which disconnected exceptionally [%dms] ago but did not \


I have a slight preference for including both the number of milliseconds and a human-readable representation of the time, see e.g. org.elasticsearch.action.support.SubscribableListener#scheduleTimeout. Sometimes these things may be minutes/hours long and it's hard to eyeball such large timespans in terms of milliseconds.

Suggested change

which disconnected exceptionally [%dms] ago but did not \

which disconnected exceptionally [%s/%dms] ago but did not \

…disruption-reconnects_rebase-after-refactoring

- expose only DisconnectionHistory instead of ConnectionTarget as protected - Nullable annotation on ConnectionTarget's DisconnectionHistory field - log seconds and milliseconds since last connect, as #s/#ms - in test, re-use thread pool, narrowed try-block for log checking, used default reconnection period, and updated logs to test for time formatting

DaveCTurner

Looks great, just one comment about the human-readable times in the message

DaveCTurner · 2025-06-06T10:02:01Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+                        () -> format(
+                            """
+                                reopened transport connection to node [%s] \
+                                which disconnected exceptionally [%ds/%dms] ago but did not \


Ah this could be minutes/hours/days too, not just seconds - we should convert to a TimeValue and use its toString().

Oh I see I got lost in understanding what TimeValue puts out. Thanks!

DaveCTurner

LGTM great stuff

DaveCTurner · 2025-06-07T09:09:06Z

server/src/test/java/org/elasticsearch/cluster/NodeConnectionsServiceTests.java

+        final long reconnectIntervalMillis = CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.get(Settings.EMPTY).millis();
+        final TimeValue reconnectIntervalTimeValue = TimeValue.timeValueMillis(reconnectIntervalMillis);


nit: suggest doing these in the opposite order:

Suggested change

final long reconnectIntervalMillis = CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.get(Settings.EMPTY).millis();

final TimeValue reconnectIntervalTimeValue = TimeValue.timeValueMillis(reconnectIntervalMillis);

final TimeValue reconnectIntervalTimeValue = CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.get(Settings.EMPTY);

final long reconnectIntervalMillis = reconnectIntervalTimeValue.millis();

ywangd · 2025-06-10T06:39:04Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+        public long getDisconnectTimeMillis() {
+            return disconnectTimeMillis;
+        }
+
+        public Exception getDisconnectCause() {
+            return disconnectCause;
+        }


Nit: These methods do not seem necessary for a record class?

Ah, thanks for this -- I did look up making record fields public after accessing them from the test, but for some reason came away thinking I needed to write my own accessors. This helped me realize that I can just do disconnectTimeMillis() instead...

nicktindall · 2025-06-10T07:44:33Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

+
+    /**
+     * Receives connection/disconnection events from the transport, and records them in per-node DisconnectionHistory
+     * structures for logging network issues. DisconnectionHistory records are stored their node's ConnectionTarget.


Nit: DisconnectionHistory records are stored their node's ConnectionTarget, should it be "... stored in their node's ..."

nicktindall

LGTM

- fixed comment typo in ConnectionChangeListener - removed hand-written accessors for DisconnectionHistory for public defaults - corrected settings access for TimeValue/millis

schase-es requested review from DaveCTurner and nicktindall May 24, 2025 01:19

schase-es added >non-issue :Distributed Coordination/Network Http and internode communication implementations labels May 24, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label May 24, 2025

elasticsearchmachine added the v9.1.0 label May 24, 2025

[CI] Auto commit changes from spotless

bd688bb

nicktindall reviewed May 26, 2025

View reviewed changes

DaveCTurner reviewed May 27, 2025

View reviewed changes

DaveCTurner reviewed May 29, 2025

View reviewed changes

nicktindall reviewed May 30, 2025

View reviewed changes

DaveCTurner reviewed May 30, 2025

View reviewed changes

DaveCTurner reviewed May 31, 2025

View reviewed changes

DaveCTurner reviewed Jun 4, 2025

View reviewed changes

schase-es and others added 3 commits June 5, 2025 17:53

Merge remote-tracking branch 'origin/main' into ES-11448_log-network-…

81cbcdc

…disruption-reconnects_rebase-after-refactoring

[CI] Auto commit changes from spotless

0514724

DaveCTurner reviewed Jun 6, 2025

View reviewed changes

Addressed review feedback for TimeValue instead of just seconds

a56e728

DaveCTurner approved these changes Jun 7, 2025

View reviewed changes

ywangd reviewed Jun 10, 2025

View reviewed changes

nicktindall reviewed Jun 10, 2025

View reviewed changes

nicktindall approved these changes Jun 10, 2025

View reviewed changes

schase-es added 4 commits June 10, 2025 16:28

Addressed final review feedback:

d03eb4d

- fixed comment typo in ConnectionChangeListener - removed hand-written accessors for DisconnectionHistory for public defaults - corrected settings access for TimeValue/millis

Merge branch 'main' into ES-11448_log-network-disruption-reconnects

f7f8f72

Merge branch 'main' into ES-11448_log-network-disruption-reconnects

2becaf1

Merge branch 'main' into ES-11448_log-network-disruption-reconnects

8cf607e

schase-es added backport v8.19.0 labels Jun 13, 2025

schase-es merged commit f1e3058 into elastic:main Jun 13, 2025
18 checks passed

schase-es mentioned this pull request Jun 13, 2025

transport: log network reconnects with same peer process #129439

Merged

nicktindall mentioned this pull request Jun 18, 2025

Improve reporting of unexpected network disconnects #125290

Closed

		NodeConnectionHistory hist = nodeHistory.remove(connNode.getId());
		if (hist != null && hist.ephemeralId.equals(connNode.getEphemeralId())) {

	this.connectionHistoryListener = new ConnectionHistoryListener();
	transportService.addConnectionListener(new ConnectionHistoryListener())

		// exposed for testing
		protected ConnectionTarget connectionTargetForNode(DiscoveryNode node) {

	protected DisconnectionHistory disconnectionHistory = null;
	@Nullable // if node is connected
	protected DisconnectionHistory disconnectionHistory = null;

		MockTransport transport = new MockTransport(deterministicTaskQueue.getThreadPool());
		TestTransportService transportService = new TestTransportService(transport, deterministicTaskQueue.getThreadPool());

	which disconnected exceptionally [%dms] ago but did not \
	which disconnected exceptionally [%s/%dms] ago but did not \

		final long reconnectIntervalMillis = CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.get(Settings.EMPTY).millis();
		final TimeValue reconnectIntervalTimeValue = TimeValue.timeValueMillis(reconnectIntervalMillis);

transport: log network reconnects with same peer process #128415

transport: log network reconnects with same peer process #128415

Uh oh!

Conversation

schase-es commented May 24, 2025

Uh oh!

elasticsearchmachine commented May 24, 2025

Uh oh!

schase-es commented May 24, 2025

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schase-es May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

schase-es commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schase-es commented May 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

schase-es May 29, 2025 •

edited

Loading

nicktindall May 30, 2025 •

edited

Loading