Skip to content

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Jul 10, 2025

The original cluster should be properly formed before the tests kick off. This PR ensures that.

Relates: #129118
Resolves: #130883
Resolves: #130979

Nodes that are not publish quorum may not have applied the latest update
when awaitMasterNode returns. This PR fixes it by waiting for the
desired number of nodes.

Relates: elastic#129118
Resolves: elastic#130883
@ywangd ywangd requested review from DaveCTurner and pxsalehi July 10, 2025 03:17
@ywangd ywangd added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v9.2.0 labels Jul 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 10, 2025
internalCluster().stopCurrentMasterNode();
awaitMasterNode();
assertNotEquals(originalMaster, internalCluster().getMasterName());
ensureStableCluster(2 + numDataNodes); // wait for all nodes to join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This weakens the test slightly more than I'd like. The trouble here is in fact that these tests run with autoManageMasterNodes = false so they skip over the call to validateClusterFormed() when starting nodes here:

if (autoManageMasterNodes) {
validateClusterFormed();
}

That makes sense in general because if we're not auto-bootstrapping the test cluster we cannot expect it to be fully-formed after each node starts. But in these tests we can expect the cluster to be fully formed after we've finished starting all the nodes, so we should call that method explicitly ourselves there. Importantly, I think we should do that before stopping the original master node, since the new master node should not spuriously drop any of the other nodes from the cluster when it takes over.

I think I'd like us to rework these tests slightly so that rather than an assertBusy() on the current state having the right size of configuration we use ESIntegTestCase#awaitClusterState(Predicate<ClusterState>) to wait for a state which has the right voting configuration and the right number of nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested in ce1452c
Please let me know if it looks right to you. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update also includes fix for a new but similar nature failure #130979


/**
* Waits for all nodes in the cluster to have a consistent view of which node is currently the master.
* Waits for all nodes forming the publish quorum in the cluster to have a consistent view of which node is currently the master.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct. A publish quorum is some subset of the master-eligible nodes in the cluster (e.g. 2 of the 3 masters) which is much weaker than what this does. Here we're waiting for all the nodes in the cluster (i.e. in ClusterState#nodes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I meant the nodes could still be in a different cluster state version after this method is returned, is this true? But you are right that the nodes should all have the same master. The trouble here is that there could still be nodes trying to join the cluster when this returns. In that case, it does not guarantee the "all nodes", which is more than ClusterState#nodes, to see the same master. Technically it is the current behaviour, but kindas defeats its purpose in such situation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setWaitForEvents(Priority.LANGUID) means that there was a point in time during the execution of this method where there was a committed cluster state and all the nodes in that cluster state were on the same cluster state version (or stuff was timing out, but that doesn't happen in these tests). It can't guarantee that another publication hasn't started before the method returns ofc.

You're right that there may also be nodes still trying to join the cluster at this point, but they're not in the cluster (they haven't joined it yet). In practice, it's up to the caller to account for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted changes to the comment in 5445ef6. We can update it separately (if at all). Thanks!

@ywangd ywangd requested a review from DaveCTurner July 10, 2025 08:34

internalCluster().stopCurrentMasterNode();
awaitMasterNode();
internalCluster().validateClusterFormed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather this remained as just an awaitMasterNode(). That should be sufficient here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is theoretically possible that the data node fails to join the new master so that awaitMasterNode() returns without seeing the data node. If we later ask the data node (via a randomed client) about its master and it could fail with NPE. That said, it is not a practical concern for this test. So I reverted back to use awaitMasterNode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theoretically possible that the data node fails to join the new master

No, that shouldn't be possible (i.e. I'd want the test to fail if that happens). The new master will start from the state that the old master last committed, which will include the data node, so the data node will automatically be part of the new master's cluster. The new master would only remove the data node from its cluster if something positively fails (e.g. network disconnect).

equalTo(false)
);
final ClusterState state = clusterAdmin().prepareState(TEST_REQUEST_TIMEOUT).get().getState();
assertThat(state.nodes().size(), equalTo(2 + numDataNodes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... except for this new assertion, which I think we should revert. This requires us to wait for the old master to be removed, and that shouldn't be necessary to make the more important assertion that the elected master is not voting-only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I removed this.


// start a fresh full master node, which will be brought into the cluster as master by the voting-only nodes
final String newMaster = internalCluster().startNode();
internalCluster().validateClusterFormed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should revert this too, it's stronger than needed. InternalTestCluster#getMasterName already waits for the new master to be elected, and that's enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my above reply applies to here. Thanks!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd
Copy link
Member Author

ywangd commented Jul 10, 2025

This has been a good case for me to dig a bit deeper into the cluster coordination area. Thanks a lot for the review and discussion!

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 10, 2025
@ywangd ywangd changed the title [Test] Wait for stable cluster before assertion [Test] Wait for clsuter to form before assertions Jul 10, 2025
@elasticsearchmachine elasticsearchmachine merged commit e2ec28d into elastic:main Jul 10, 2025
33 checks passed
@ywangd ywangd deleted the es-130883-fix branch July 10, 2025 11:23
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 17, 2025
The original cluster should be properly formed before the tests kick
off. This PR ensures that.

Relates: elastic#129118 Resolves: elastic#130883 Resolves: elastic#130979
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 17, 2025
The original cluster should be properly formed before the tests kick
off. This PR ensures that.

Relates: elastic#129118 Resolves: elastic#130883 Resolves: elastic#130979
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.2.0

Projects

None yet

3 participants