Skip to content

Conversation

joshua-adams-1
Copy link
Contributor

@joshua-adams-1 joshua-adams-1 commented Sep 5, 2025

Extends the Coordinator so that we don't prematurely close the connection to a joining node. This prevents a node-join: [{}] with reason [{}]; for troubleshooting guidance, see {} WARN log being emitted unnecessarily.

Closes #126192

Jira Ticket - ES-11449


As a note, this change was previously deployed here, #132023. However, the assertion assert ThreadPool.assertCurrentThreadPool(MasterService.MASTER_UPDATE_THREAD_NAME); was failing a number of tests, including DataTierAllocationDeciderIT. The issue was that FailedToCommitClusterStateExceptions were being thrown incorrectly by nodes that were not the master. ES-13061 resolved this, by replacing them with NotMasterExceptions.

The assertion statement is still present (since it should hold).

I have run both the NodeJoiningIt and DataTierAllocationDeciderIT suites 1000+ times.

Extends the Coordinator so that we don't prematurely close the
connection to a joining node. This prevents a `node-join: [{}] with
reason [{}]; for troubleshooting guidance, see {}` WARN log being
emitted unnecessarily.

Closes elastic#126192

Jira Ticket - ES-11449
@joshua-adams-1 joshua-adams-1 self-assigned this Sep 5, 2025
@joshua-adams-1 joshua-adams-1 added >non-issue :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. and removed v9.2.0 labels Sep 5, 2025
@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review September 5, 2025 13:19
@joshua-adams-1 joshua-adams-1 requested a review from a team as a code owner September 5, 2025 13:19
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 5, 2025
@joshua-adams-1 joshua-adams-1 added v9.2.0 and removed Team:Distributed Coordination Meta label for Distributed Coordination team labels Sep 5, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 5, 2025
Comment on lines 685 to 688
// NB we are on the master update thread here at the end of processing the failed cluster state update, so this
// all happens before any cluster state update that re-elects a master
// assert ThreadPool.assertCurrentThreadPool(MasterService.MASTER_UPDATE_THREAD_NAME);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is currently a lie, and we definitely shouldn't have any commented-out code like this. At the very least we should delete these lines.

But the issue here is that this indicates that we're misusing FailedToCommitClusterStateException and throwing it from places where the cluster state update has not even been published, which means we can be certain its effects won't appear in some future state. In those cases we don't need to delay the completion of the join listener. It's kind of ok, but would you investigate whether we can reasonably fix these to use a different exception (e.g. NotMasterException) and tighten up the meaning of FailedToCommitClusterStateException so that it can only happen if we're unsure of the outcome?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you investigate whether we can reasonably fix these

If so, that should be a separate PR which lands ahead of this one.

Copy link
Contributor Author

@joshua-adams-1 joshua-adams-1 Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least we should delete these lines.

Apologies! I clearly commented out the code to run the tests and then forgot to delete them (insert face palm here)

investigate whether we can reasonably fix these to use a different exception

To confirm, this is to investigate where we're throwing the FailedToCommitClusterStateException and ensuring they're only thrown from appropriate places? I can do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensuring they're only thrown from appropriate places

Yep

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

@joshua-adams-1 joshua-adams-1 merged commit 7439976 into elastic:main Oct 9, 2025
34 checks passed
@joshua-adams-1 joshua-adams-1 deleted the master-node-disconnect branch October 9, 2025 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Master node disconnects from joining node too early during re-election

3 participants