Skip to content

Conversation

joshua-adams-1
Copy link
Contributor

@joshua-adams-1 joshua-adams-1 commented Jul 28, 2025

Extends the Coordinator so that we don't prematurely close the connection to a joining node. This prevents a node-join: [{}] with reason [{}]; for troubleshooting guidance, see {} WARN log being emitted unnecessarily.

Closes #126192

Jira Ticket - ES-11449

Extends the `Coordinator.handleJoinRequest` `onFailure` method in the
callback to only fail if the node is not in already in the
`ClusterState`. This avoids a rare corner case where a master node's
`ClusterState` already has a node in it that is attempting to join the
cluster, and it logs an incorrect error message
@joshua-adams-1 joshua-adams-1 self-assigned this Jul 28, 2025
@joshua-adams-1 joshua-adams-1 added >non-issue :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v9.2.0 labels Jul 28, 2025
@joshua-adams-1 joshua-adams-1 changed the title Master node disconnect [WIP] Master node disconnect Jul 28, 2025
@joshua-adams-1 joshua-adams-1 force-pushed the master-node-disconnect branch from fbc0026 to 962b3b2 Compare July 28, 2025 17:31
Adds two new integration test suites, `NodeJoiningIT` and
`NodeJoiningMasterElectionIT`. These included tests related to the
coordinator logic when a node joins the cluster.

There is also a proposed solution in the Coordinator class, currently
commented out. This will be uncommented in a follow up commit
@joshua-adams-1 joshua-adams-1 force-pushed the master-node-disconnect branch from dcfb910 to 3e501b4 Compare August 7, 2025 16:19

try {
ensureSufficientMasterEligibleNodes();
DiscoveryNode masterNode = internalCluster().clusterService().state().nodes().getMasterNode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding test flakiness, one contributor could be where you get the info that you use in your assertions. Using internalCluster().clusterService() doesn't necessarily give you the most up to date value since it might not fetch the instance from the current master node. This is documented on that method, but want to make sure you have consider this random behaviour that this utility can have. You can fetch the master node instance by providing the master node name or similar to the test in TransportMasterNodeActionIT, which I assume this is partially based on, use a master client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tried that out briefly, I think the test failure is on longer due to the latch you mentioned but rather log expectation assertions. Hope that hleps!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(at least based on 20-something runs that used to get 8-10 failures in the latch check, which now only fails once later at the log expectation assertion)

@joshua-adams-1 joshua-adams-1 force-pushed the master-node-disconnect branch from d967904 to dac102a Compare August 11, 2025 14:30
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear if this is ready for review or not - it's still marked as a draft but you've requested reviews.

I note that you're using force-push rather than merge:

image

Please don't do that once other folks have started to look at a PR unless absolutely unavoidable. It makes it much harder to track the history of changes against the comments.

* @param cleanupTasks The list of cleanup tasks
* @return A latch that will be released when the old master acknowledges the new master's election
*/
protected CountDownLatch configureElectionLatchForNewMaster(String newMaster, List<Releasable> cleanupTasks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this'd be much simpler implemented with ClusterServiceUtils.addTemporaryStateListener. Not sure why we didn't do so when first written - I think it must have evolved to this from something more complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, have updated in a668ab1

* @param cleanupTasks The list of clean up tasks
* @return A cyclic barrier which when awaited on will un-block the applier
*/
protected static CyclicBarrier blockClusterStateApplier(String nodeName, ArrayList<Releasable> cleanupTasks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't think you need to pull this one up to the base class, it only has one caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in a668ab1

* @param cleanupTasks The list of cleanup tasks
* @return A latch that will be released when the master acknowledges it's re-election
*/
protected CountDownLatch configureElectionLatchForReElectedMaster(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a separate method here? In the new tests I think we can just wait for the term to increase, as observed by any node - no need to worry about which node is master or anything so fiddly.

Also this one is only called from one place. I think it's a premature optimization to generalize these two test suites like this when they have so little in common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in a668ab1

* master accepts its own failed state update before standing down, we can still
* establish a quorum without its (or our own) join.
*/
protected static String ensureSufficientMasterEligibleNodes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests don't need 5 master nodes, and indeed having 5 (rather than 3) makes the situation unnecessarily complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in a668ab1

Comment on lines 688 to 693
1. (T, V+1) is accepted -> NodeConnectionsService now stores an open connection to N. It can be closed.
2. (T, V+1) is rejected -> A new cluster state is published without N in it.
It is right to close the connection and retry.
3. The above scenario occurs. We do not close the connection after (T, V+1) fails and keep it open:
3.1 (T+1, V+2) is accepted -> By waiting, we did not close the connection to N unnecessarily
3.2 (T+1, V+2) is rejected -> A new cluster state is published without N in it. Closing is correct here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I think that makes sense and is much simpler than I anticipated.

Comment on lines 697 to 701
public void clusterChanged(ClusterChangedEvent event) {
// Now it's safe to close the connection
Releasables.close(response);
// Remove this listener to avoid memory leaks
clusterService.removeListener(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't quite work tho, because we apply a cluster state when the master fails too. The state we apply in that case differs from the last committed state only in the values of things which are local properties of the state, i.e. which do not have strong consistency guarantees, such as DiscoveryNodes#masterNodeId. I think we need to keep the connection open until we've applied the next committed state (i.e. one with a non-null master).

I did see the new tests fail in this way when run in a loop, but unfortunately I didn't have things configured to collect a build scan. I'm trying again.

Also we should only insert this listener when the update fails with a FailedToCommitClusterStateException - for all other outcomes there's no need to wait for another state to be committed, indeed we cannot expect another state to be committed any time soon if the cluster is stable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I've changed this logic to only close the connection when a committed state is seen, but how would I know whether a FailedToCommitClusterStateException is thrown?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using ActionListener#runBefore you'd typically capture the exception with ActionListener#delegateResponse. But in this case we're already capturing the success response with ActionListener#delegateFailure so it's probably best to just create a completely new ActionListener that does the right thing with both success and exceptional responses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in a668ab1


// exposed for tests
boolean missingJoinVoteFrom(DiscoveryNode node) {
logger.info("Missing vote from: {}", node.getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray logging leftover from debugging? I don't think we want an INFO log here, it will raise support cases asking what it means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in a668ab1

* @param name of the node which existence should be verified
* @return <code>true</code> if the node exists. Otherwise <code>false</code>
*/
public boolean nodeExistsWithName(String name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in tests, I don't think there's a need to have it in the production codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in a668ab1

@joshua-adams-1
Copy link
Contributor Author

Unclear if this is ready for review or not - it's still marked as a draft but you've requested reviews.

I note that you're using force-push rather than merge:

image Please don't do that once other folks have started to look at a PR unless absolutely unavoidable. It makes it much harder to track the history of changes against the comments.

Hey! Sorry, no I wasn't expecting a review yet. I added you as a reviewer, but I didn't realise that action sent it for review, because there's a button at the bottom of my PR saying Ready for Review. I was under the impression clicking that sent it for review. Still learning, sorry! Thank you for the comments, I'll get it cleaned up. And I shall stop using force merges!

@joshua-adams-1 joshua-adams-1 changed the title [WIP] Master node disconnect Master node disconnect Aug 12, 2025
private String generateNodeDescriptionForNewDiscoveryNode(int numberOfNodesOriginallyInCluster, DiscoveryNode masterNode) {
// Nodes are named `node_s0`, `node_s1` etc ...
// Therefore, if there are N nodes in the cluster, named `node_s0` ... `node_sN-1`, N+1 will be named `node_sN`
String newNodeName = "node_s" + numberOfNodesOriginallyInCluster;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct, sometimes the new node will be of the form node_tN. There may be other possibilities too even today, I'm not sure, and we certainly don't know what'll happen in future. More generally, this is heavily overspecifying the conditions we actually want, and coupling this test to the implementation details of how nodes are named in the cluster. If we changed the format of these messages at all, we'd have to update this code too.

I think we should be asserting that we expect to see no messages at all from NodeJoinExecutor at level WARN. The content isn't really relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have updated in cc67155

- Removes MasterElectionTestCase
- reverts TransportMasterNodeActionIT
- Modifies NodeJoiningIT to use ClusterStateUtils
rather than CountDownLatches
- Modifies the Coordinator to keep the connection open
only when there is a FailedToCommitClusterStateException
and until the next committed cluster state update
@joshua-adams-1
Copy link
Contributor Author

Thank you David for your help on this! Appreciate there's been a bit of churn

Releasables.close(response);
joinListener.onFailure(e);
// Immediate condition check in case another node is elected master
if (clusterService.state().nodes().nodeExists(joinRequest.getSourceNode().getId())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to fail the join listener and remove the cluster-state listener if the last-applied state is committed (i.e. has a master node) and doesn't include the joining node.

Does it work to call clusterStateListener.clusterChanged(new ClusterChangedEvent("", clusterService.state(), clusterService.state()))?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work

Hmm no, not quite: that might complete ll twice which is generally something we should try and avoid. It turns out that as things are implemented today ll will be a SubscribableListener which has well-defined semantics when completed multiple times:

* If this listener is completed more than once then all results other than the first (whether successful or otherwise) are silently
* discarded. All subscribed listeners will be notified of the same result, exactly once, even if several completions occur concurrently.

However we can't rely on that being true in future, it's not guaranteed that ll will always be a SubscribableListener in this context. I'd prefer we explicitly deduplicated this work e.g. by creating another SubscribableListener.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However we can't rely on that being true in future

I contemplated making it not be true by adding a check that these listeners are not completed multiple times:

diff --git a/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java b/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java
index 9eddbb55b776..b7c8c4a28279 100644
--- a/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java
+++ b/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java
@@ -135,7 +135,7 @@ public class SubscribableListener<T> implements ActionListener<T> {
      */
     public static <T> SubscribableListener<T> newForked(CheckedConsumer<ActionListener<T>, ? extends Exception> fork) {
         final var listener = new SubscribableListener<T>();
-        ActionListener.run(listener, fork::accept);
+        ActionListener.run(ActionListener.assertOnce(listener), fork::accept);
         return listener;
     }

However, on reflection this seems unnecessarily strict and indeed it causes the SubscribableListener test suite to fail because we actually already assert that newForked and andThen receive the returned listener instance, which is therefore safe to complete more than once. I think it's best to document this fact, see #133391, and then we can rely on it here too (so disregard my previous message)

safeAwait(publishingBanRemovedListener);
logger.info("Master publishing ban removed");
// Assert the master was re-elected
assertTrue(masterNodeName.equals(internalCluster().getMasterName()) && originalTerm < getTerm(masterNodeName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: slight preference for this being two assertThat calls: if this fails, we get no feedback about why, whereas if we use assertThat twice then we'll be able to see whether it was the master that changed (and it'll identify the new master) or whether the term didn't increase.

Comment on lines 708 to 714
// Another node was elected, and doesn't have the node in it
if (clusterService.state().nodes().getMasterNode() != null
&& clusterService.state().nodes().nodeExists(joinRequest.getSourceNode().getId()) == false) {
// Remove this listener to avoid memory leaks
clusterService.removeListener(clusterStateListener);
ll.onFailure(e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it'd be simpler to call clusterStateListener.clusterChanged directly here, constructing a "fake" ClusterChangedEvent to carry the current cluster state. I think we want the same logic both ways: particularly if the node is for some reason in the current cluster state at this point then we can complete ll successfully too.

Node N should join the cluster, but it should not be disconnected (#ES-11449)
*/
@TestLogging(reason = "test includes assertions about logging", value = "org.elasticsearch.cluster.coordination:INFO")
public void testNodeTriesToJoinClusterAndThenSameMasterIsElected() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw this test is a really good discriminator of the improvement, it failed for me 9 out of 10 times I ran it having reverted the behaviour change in Coordinator, and passed 10 out of 10 with the change in place. Great stuff.


if (discoveryNodes.nodeExists(joinRequest.getSourceNode().getId())) {
// Remove this listener to avoid memory leaks
clusterService.removeListener(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this within the if branches? It's on both branches so we call it either way, and I don't see a need to call it after calling nodeExists or anything like that?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joshua-adams-1 joshua-adams-1 merged commit 656a7a9 into elastic:main Sep 3, 2025
33 checks passed
@joshua-adams-1 joshua-adams-1 deleted the master-node-disconnect branch September 3, 2025 08:06
joshua-adams-1 added a commit to joshua-adams-1/elasticsearch that referenced this pull request Sep 3, 2025
DaveCTurner added a commit that referenced this pull request Sep 3, 2025
joshua-adams-1 added a commit to joshua-adams-1/elasticsearch that referenced this pull request Sep 3, 2025
Unmutes `DataTierAllocationDeciderIT
.testDesiredNodesAreTakenIntoAccountInAutoExpandReplicas` and
`DataTierAllocationDeciderIT
.testShardsAreKeptInPreferredTierUntilTheNextTierIsInItsFinalState`
since the problematic commit elastic#132023 has been reverted
@joshua-adams-1 joshua-adams-1 mentioned this pull request Sep 3, 2025
joshua-adams-1 added a commit that referenced this pull request Sep 4, 2025
Unmutes `DataTierAllocationDeciderIT
.testDesiredNodesAreTakenIntoAccountInAutoExpandReplicas` and
`DataTierAllocationDeciderIT
.testShardsAreKeptInPreferredTierUntilTheNextTierIsInItsFinalState`
since the problematic commit #132023 has been reverted
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Master node disconnects from joining node too early during re-election