Master Node Disconnect #134213

joshua-adams-1 · 2025-09-05T11:26:04Z

Extends the Coordinator so that we don't prematurely close the connection to a joining node. This prevents a node-join: [{}] with reason [{}]; for troubleshooting guidance, see {} WARN log being emitted unnecessarily.

Closes #126192

Jira Ticket - ES-11449

As a note, this change was previously deployed here, #132023. However, the assertion assert ThreadPool.assertCurrentThreadPool(MasterService.MASTER_UPDATE_THREAD_NAME); was failing a number of tests, including DataTierAllocationDeciderIT. The issue was that FailedToCommitClusterStateExceptions were being thrown incorrectly by nodes that were not the master. ES-13061 resolved this, by replacing them with NotMasterExceptions.

The assertion statement is still present (since it should hold).

I have run both the NodeJoiningIt and DataTierAllocationDeciderIT suites 1000+ times.

Extends the Coordinator so that we don't prematurely close the connection to a joining node. This prevents a `node-join: [{}] with reason [{}]; for troubleshooting guidance, see {}` WARN log being emitted unnecessarily. Closes elastic#126192 Jira Ticket - ES-11449

elasticsearchmachine · 2025-09-05T13:19:51Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner · 2025-09-09T12:43:38Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

+                            // NB we are on the master update thread here at the end of processing the failed cluster state update, so this
+                            // all happens before any cluster state update that re-elects a master
+                            // assert ThreadPool.assertCurrentThreadPool(MasterService.MASTER_UPDATE_THREAD_NAME);
+


This comment is currently a lie, and we definitely shouldn't have any commented-out code like this. At the very least we should delete these lines.

But the issue here is that this indicates that we're misusing FailedToCommitClusterStateException and throwing it from places where the cluster state update has not even been published, which means we can be certain its effects won't appear in some future state. In those cases we don't need to delay the completion of the join listener. It's kind of ok, but would you investigate whether we can reasonably fix these to use a different exception (e.g. NotMasterException) and tighten up the meaning of FailedToCommitClusterStateException so that it can only happen if we're unsure of the outcome?

would you investigate whether we can reasonably fix these

If so, that should be a separate PR which lands ahead of this one.

At the very least we should delete these lines.

Apologies! I clearly commented out the code to run the tests and then forgot to delete them (insert face palm here)

investigate whether we can reasonably fix these to use a different exception

To confirm, this is to investigate where we're throwing the FailedToCommitClusterStateException and ensuring they're only thrown from appropriate places? I can do that

ensuring they're only thrown from appropriate places

Yep

…ams-1/elasticsearch into master-node-disconnect

DaveCTurner

LGTM 🎉

joshua-adams-1 self-assigned this Sep 5, 2025

elasticsearchmachine added the v9.2.0 label Sep 5, 2025

joshua-adams-1 added >non-issue :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. and removed v9.2.0 labels Sep 5, 2025

[CI] Auto commit changes from spotless

3487710

joshua-adams-1 requested a review from DaveCTurner September 5, 2025 13:19

joshua-adams-1 marked this pull request as ready for review September 5, 2025 13:19

joshua-adams-1 requested a review from a team as a code owner September 5, 2025 13:19

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 5, 2025

joshua-adams-1 added v9.2.0 and removed Team:Distributed Coordination Meta label for Distributed Coordination team labels Sep 5, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 5, 2025

DaveCTurner reviewed Sep 9, 2025

View reviewed changes

This was referenced Sep 18, 2025

Change FailedToCommitClusterStateException to NotMasterException #135008

Merged

Update exception messages #135017

Merged

This was referenced Sep 30, 2025

Changes FailedToCommitClusterStateException to NotMasterException #135548

Merged

Replace pre publication failed to commit cluster state exceptions #135706

Merged

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

This was referenced Oct 2, 2025

Remove FailedToCommitClusterStateException Check #135846

Merged

Replace FailedToCommitClusterStateException with NotMasterException #136083

Merged

joshua-adams-1 added 4 commits October 8, 2025 16:21

Merge branch 'main' into master-node-disconnect

233cb32

Merge branch 'master-node-disconnect' of https://github.com/joshua-ad…

8770a25

…ams-1/elasticsearch into master-node-disconnect

Merge branch 'main' into master-node-disconnect

a11759b

Uncomments out assertion

6881f0f

joshua-adams-1 requested a review from DaveCTurner October 9, 2025 10:56

DaveCTurner approved these changes Oct 9, 2025

View reviewed changes

joshua-adams-1 merged commit 7439976 into elastic:main Oct 9, 2025
34 checks passed

joshua-adams-1 deleted the master-node-disconnect branch October 9, 2025 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Master Node Disconnect #134213

Master Node Disconnect #134213

Uh oh!

joshua-adams-1 commented Sep 5, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Sep 5, 2025

Uh oh!

DaveCTurner Sep 9, 2025

Uh oh!

DaveCTurner Sep 9, 2025

Uh oh!

joshua-adams-1 Sep 9, 2025 •

edited

Loading

Uh oh!

DaveCTurner Sep 9, 2025

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Master Node Disconnect #134213

Master Node Disconnect #134213

Uh oh!

Conversation

joshua-adams-1 commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 5, 2025

Uh oh!

DaveCTurner Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

joshua-adams-1 Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshua-adams-1 commented Sep 5, 2025 •

edited

Loading

joshua-adams-1 Sep 9, 2025 •

edited

Loading