-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Test] Wait for clsuter to form before assertions #130973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
8018a91
5445ef6
ce1452c
c84e96a
c039174
5ded2bd
84945a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -96,7 +96,8 @@ public void testPreferFullMasterOverVotingOnlyNodes() throws Exception { | |||||||
| internalCluster().setBootstrapMasterNodeIndex(0); | ||||||||
| internalCluster().startNodes(2); | ||||||||
| internalCluster().startNode(addRoles(Set.of(DiscoveryNodeRole.VOTING_ONLY_NODE_ROLE))); | ||||||||
| internalCluster().startDataOnlyNodes(randomInt(2)); | ||||||||
| final int numDataNodes = randomInt(2); | ||||||||
| internalCluster().startDataOnlyNodes(numDataNodes); | ||||||||
| assertBusy( | ||||||||
| () -> assertThat( | ||||||||
| clusterAdmin().prepareState(TEST_REQUEST_TIMEOUT).get().getState().getLastCommittedConfiguration().getNodeIds().size(), | ||||||||
|
|
@@ -108,6 +109,7 @@ public void testPreferFullMasterOverVotingOnlyNodes() throws Exception { | |||||||
| internalCluster().stopCurrentMasterNode(); | ||||||||
| awaitMasterNode(); | ||||||||
| assertNotEquals(originalMaster, internalCluster().getMasterName()); | ||||||||
| ensureStableCluster(2 + numDataNodes); // wait for all nodes to join | ||||||||
|
||||||||
| if (autoManageMasterNodes) { | |
| validateClusterFormed(); | |
| } |
That makes sense in general because if we're not auto-bootstrapping the test cluster we cannot expect it to be fully-formed after each node starts. But in these tests we can expect the cluster to be fully formed after we've finished starting all the nodes, so we should call that method explicitly ourselves there. Importantly, I think we should do that before stopping the original master node, since the new master node should not spuriously drop any of the other nodes from the cluster when it takes over.
I think I'd like us to rework these tests slightly so that rather than an assertBusy() on the current state having the right size of configuration we use ESIntegTestCase#awaitClusterState(Predicate<ClusterState>) to wait for a state which has the right voting configuration and the right number of nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as suggested in ce1452c
Please let me know if it looks right to you. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update also includes fix for a new but similar nature failure #130979
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct. A publish quorum is some subset of the master-eligible nodes in the cluster (e.g. 2 of the 3 masters) which is much weaker than what this does. Here we're waiting for all the nodes in the cluster (i.e. in
ClusterState#nodes).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I meant the nodes could still be in a different cluster state version after this method is returned, is this true? But you are right that the nodes should all have the same master. The trouble here is that there could still be nodes trying to join the cluster when this returns. In that case, it does not guarantee the "all nodes", which is more than
ClusterState#nodes, to see the same master. Technically it is the current behaviour, but kindas defeats its purpose in such situation?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
setWaitForEvents(Priority.LANGUID)means that there was a point in time during the execution of this method where there was a committed cluster state and all the nodes in that cluster state were on the same cluster state version (or stuff was timing out, but that doesn't happen in these tests). It can't guarantee that another publication hasn't started before the method returns ofc.You're right that there may also be nodes still trying to join the cluster at this point, but they're not in the cluster (they haven't joined it yet). In practice, it's up to the caller to account for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted changes to the comment in 5445ef6. We can update it separately (if at all). Thanks!