15290: Log Unexpected Disconnects #127736

schase-es · 2025-05-06T00:16:24Z

This change makes network errors visible at higher levels (TcpChannel, Transport.Connection), by passing exceptions that cause channels to close onto the close listener's onFailure method.

This pull request has two commits:

the first enables exception visibility in the close listener onFailure() method, and fixes areas of the code/tests that anticipate the onFailure to never run. TcpChannels now have an onException() method for reporting an exception; this report has been added to several places where exceptions are received (the inbound handler, initialization). Some test code (FakeTcpChannel) was updated to match the real one, and update tested assumptions. Some production code asserted that the onFailure method should never fire; these have been changed to ignore it, or log the exception.
the second notes disconnect and reconnect events inside the ClusterConnectionManager, and logs situations where a reconnect has the same ephemeralId as a recent disconnect. This implies that the process did not roll, but the network cut out. Network exceptions are logged, and the connection history is GCed every hour.

Still to do:

write a test
do some refactoring to pass the threadpool into ClusterConnectionManager, so it has a time source.

Closes #125290

Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated.

ClusterConnectionManager now caches the previous ephemeralId (process-scope) of peer nodes when they disconnect. On reconnect, when a peer has the same ephemeralId as it did in the previous connection, this is logged to indicate a network failure. The ephemeralId table is garbage-collected every hour, with entries older than an hour removed.

DaveCTurner

Nice idea, I like it.

I suggest breaking out the change to the close-listener into a separate PR. It deserves its own tests etc and in case of a bug it'd help to be able to bisect between these two parts of this change.

DaveCTurner · 2025-05-06T07:06:58Z

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4Transport.java

        @Override
        public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
+            Netty4TcpChannel channel = ctx.channel().attr(CHANNEL_KEY).get();
+            if (cause instanceof Error) {


We should also use org.elasticsearch.ExceptionsHelper#maybeDieOnAnotherThread here - if an Error is thrown then we cannot continue and must not just quietly suppress it.

I'm very new to how exceptions are reported here. The maybeDieOnAnotherThread helper is invoked a few lines below -- it's snuck in on line 329 in this PR.

An observation I made is that these changes shouldn't modify how the exception is responded to (I almost did this in a few cases!). In this commit, it's more about pinning it to the channel, so if it's closed there's some attribution around why.

Does this explanation address things? I really barely know what that helper does!

DaveCTurner · 2025-05-06T07:20:17Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+    private void collectHistoryGarbage() {
+        final long now = System.currentTimeMillis();
+        final long hour = 60 * 60 * 1000;


I don't think this should be time-based. I'd rather we dropped the entry when the node is removed from the cluster membership (unexpected membership changes are already logged appropriately - see JoinReasonService). If the node remains in the cluster membership then we should report an unexpected reconnection no matter how long it's been disconnected.

(If we were to make this time-based then we should make the timeout configurable via a setting rather than hard-coded at 1 hour).

In principle we could also just make it size-based, expiring the oldest entries to limit the size of the map to no more than (say) double the size of the cluster. That's what JoinReasonService does, because there we cannot fall back on something stable like the cluster membership. But here we can, so I think we should use that precision.

schase-es · 2025-05-06T21:29:10Z

Thanks for reviewing this so promptly David -- I know everyone's trying to cram in work before EAH next week.

There are a few things that I'm wondering about:

in Netty4TcpChannel, I made the channelError field a volatile. I think this concurrency is needed, because of the variety of contexts that can report an exception -- many of them might be from threads outside of Netty. But I also thought about making it an AtomicReference, and pinning any reports of 2+ exceptions to it with addSuppressed. (This pattern is around.)
I had trouble understanding when and how netty channels throw errors. Many exceptions appear to go to the InboundHandler. One spot I'm wondering about is in the OutboundHandler. In the internalSend method, there is a listener that gets exceptions back from the channel write. I got a little lost chasing down the action listener/runnables/releasables chain, but it looks like that exception might just get raised into sendMessage?

schase-es added 2 commits May 5, 2025 16:41

schase-es requested a review from nicktindall May 6, 2025 00:16

schase-es self-assigned this May 6, 2025

schase-es added >non-issue :Distributed Coordination/Network Http and internode communication implementations labels May 6, 2025

elasticsearchmachine added the v9.1.0 label May 6, 2025

DaveCTurner reviewed May 6, 2025

View reviewed changes

schase-es added 2 commits May 6, 2025 15:47

Merge branch 'elastic:main' into 15290-unexpected-disconnects

4cbca2c

Merge branch 'elastic:main' into 15290-unexpected-disconnects

9d05a06

schase-es mentioned this pull request May 8, 2025

transport: pass network channel exceptions to close listeners #127895

Merged

schase-es closed this Jun 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

15290: Log Unexpected Disconnects #127736

15290: Log Unexpected Disconnects #127736

Uh oh!

schase-es commented May 6, 2025

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner May 6, 2025

Uh oh!

schase-es May 6, 2025 •

edited

Loading

Uh oh!

DaveCTurner May 6, 2025

Uh oh!

schase-es May 6, 2025

Uh oh!

schase-es commented May 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

15290: Log Unexpected Disconnects #127736

15290: Log Unexpected Disconnects #127736

Uh oh!

Conversation

schase-es commented May 6, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner May 6, 2025

Choose a reason for hiding this comment

Uh oh!

schase-es May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner May 6, 2025

Choose a reason for hiding this comment

Uh oh!

schase-es May 6, 2025

Choose a reason for hiding this comment

Uh oh!

schase-es commented May 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

schase-es May 6, 2025 •

edited

Loading