-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Master Node Disconnect #134213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master Node Disconnect #134213
Conversation
Extends the Coordinator so that we don't prematurely close the connection to a joining node. This prevents a `node-join: [{}] with reason [{}]; for troubleshooting guidance, see {}` WARN log being emitted unnecessarily. Closes elastic#126192 Jira Ticket - ES-11449
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
// NB we are on the master update thread here at the end of processing the failed cluster state update, so this | ||
// all happens before any cluster state update that re-elects a master | ||
// assert ThreadPool.assertCurrentThreadPool(MasterService.MASTER_UPDATE_THREAD_NAME); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is currently a lie, and we definitely shouldn't have any commented-out code like this. At the very least we should delete these lines.
But the issue here is that this indicates that we're misusing FailedToCommitClusterStateException
and throwing it from places where the cluster state update has not even been published, which means we can be certain its effects won't appear in some future state. In those cases we don't need to delay the completion of the join listener. It's kind of ok, but would you investigate whether we can reasonably fix these to use a different exception (e.g. NotMasterException
) and tighten up the meaning of FailedToCommitClusterStateException
so that it can only happen if we're unsure of the outcome?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you investigate whether we can reasonably fix these
If so, that should be a separate PR which lands ahead of this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least we should delete these lines.
Apologies! I clearly commented out the code to run the tests and then forgot to delete them (insert face palm here)
investigate whether we can reasonably fix these to use a different exception
To confirm, this is to investigate where we're throwing the FailedToCommitClusterStateException
and ensuring they're only thrown from appropriate places? I can do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensuring they're only thrown from appropriate places
Yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉
Extends the Coordinator so that we don't prematurely close the connection to a joining node. This prevents a
node-join: [{}] with reason [{}]; for troubleshooting guidance, see {}
WARN log being emitted unnecessarily.Closes #126192
Jira Ticket - ES-11449
As a note, this change was previously deployed here, #132023. However, the assertion
assert ThreadPool.assertCurrentThreadPool(MasterService.MASTER_UPDATE_THREAD_NAME);
was failing a number of tests, includingDataTierAllocationDeciderIT
. The issue was thatFailedToCommitClusterStateExceptions
were being thrown incorrectly by nodes that were not the master. ES-13061 resolved this, by replacing them withNotMasterExceptions
.The assertion statement is still present (since it should hold).
I have run both the
NodeJoiningIt
andDataTierAllocationDeciderIT
suites 1000+ times.