-
Notifications
You must be signed in to change notification settings - Fork 25.6k
15290: Log Unexpected Disconnects #127736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated.
ClusterConnectionManager now caches the previous ephemeralId (process-scope) of peer nodes when they disconnect. On reconnect, when a peer has the same ephemeralId as it did in the previous connection, this is logged to indicate a network failure. The ephemeralId table is garbage-collected every hour, with entries older than an hour removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea, I like it.
I suggest breaking out the change to the close-listener into a separate PR. It deserves its own tests etc and in case of a bug it'd help to be able to bisect between these two parts of this change.
| @Override | ||
| public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception { | ||
| Netty4TcpChannel channel = ctx.channel().attr(CHANNEL_KEY).get(); | ||
| if (cause instanceof Error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also use org.elasticsearch.ExceptionsHelper#maybeDieOnAnotherThread here - if an Error is thrown then we cannot continue and must not just quietly suppress it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very new to how exceptions are reported here. The maybeDieOnAnotherThread helper is invoked a few lines below -- it's snuck in on line 329 in this PR.
An observation I made is that these changes shouldn't modify how the exception is responded to (I almost did this in a few cases!). In this commit, it's more about pinning it to the channel, so if it's closed there's some attribution around why.
Does this explanation address things? I really barely know what that helper does!
| private void collectHistoryGarbage() { | ||
| final long now = System.currentTimeMillis(); | ||
| final long hour = 60 * 60 * 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be time-based. I'd rather we dropped the entry when the node is removed from the cluster membership (unexpected membership changes are already logged appropriately - see JoinReasonService). If the node remains in the cluster membership then we should report an unexpected reconnection no matter how long it's been disconnected.
(If we were to make this time-based then we should make the timeout configurable via a setting rather than hard-coded at 1 hour).
In principle we could also just make it size-based, expiring the oldest entries to limit the size of the map to no more than (say) double the size of the cluster. That's what JoinReasonService does, because there we cannot fall back on something stable like the cluster membership. But here we can, so I think we should use that precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
|
Thanks for reviewing this so promptly David -- I know everyone's trying to cram in work before EAH next week. There are a few things that I'm wondering about:
|
This change makes network errors visible at higher levels (TcpChannel, Transport.Connection), by passing exceptions that cause channels to close onto the close listener's onFailure method.
This pull request has two commits:
the first enables exception visibility in the close listener onFailure() method, and fixes areas of the code/tests that anticipate the onFailure to never run. TcpChannels now have an onException() method for reporting an exception; this report has been added to several places where exceptions are received (the inbound handler, initialization). Some test code (FakeTcpChannel) was updated to match the real one, and update tested assumptions. Some production code asserted that the onFailure method should never fire; these have been changed to ignore it, or log the exception.
the second notes disconnect and reconnect events inside the ClusterConnectionManager, and logs situations where a reconnect has the same ephemeralId as a recent disconnect. This implies that the process did not roll, but the network cut out. Network exceptions are logged, and the connection history is GCed every hour.
Still to do:
Closes #125290