Skip to content

Conversation

@JeremyDahlgren
Copy link
Contributor

@JeremyDahlgren JeremyDahlgren commented May 9, 2025

Prevents assertion errors when the master node is null after obtaining the cluster state in the second call.
See PRs #127213 and #127891 for details about recent changes to this method, and where the code used in this commit was proposed.

Per #127213 getMasterName() was deemed unsafe to call in unstable clusters and scenarios where there could be election jitter. Updating the javadoc to indicate the stable cluster requirement for these methods would have been a good addition, but it likely would not prevent future usages of the methods in such unstable test scenarios. The utility method added in serverless PR 3833 is an example of code added after #127213, and led to a failing test seen in serverless issue 3867. It was suggested in the PR under review for that issue that the root cause of the NPE should be fixed.

There have also been other recent failing tests where the PRs to resolve them required refactoring to avoid the use of getMasterName(), serverless issue 3865 and #127523. It is likely we'll run into more test failures, there currently are 233 calls to getMasterName(). I haven't gone through all of them to see under what cluster conditions it is being called. It might make sense to apply David's suggested change (submitted in this PR) to avoid future failures. Note that if the current cluster state satisfies the predicate (a non-null master node) the listener will be completed immediately.

Prevents assertion errors when the master node is null after obtaining
the cluster state in the second call.  See PRs 127213 and 127891 for
details about recent changes to this method, and where the code used in
this commit was proposed.
@JeremyDahlgren JeremyDahlgren added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0 labels May 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@nielsbauman
Copy link
Contributor

What exactly would be the value of this change? I understand that we're trying to avoid test failures where tests use getMasterName in unstable clusters, but I don't think this is the right way.

IMO, it makes sense that getMasterName shouldn't be called in unstable clusters, simply because we would have very little confidence that the returned value is the "actual" master at that point. If we make tests explicitly wait for a stable cluster, the intent of the test will be much clearer and reduce flakiness.

I think #127891 was valuable in changing the exception message to something a lot more understandable. Other than adding documentation to this method that it shouldn't be used in unstable clusters, I don't immediately see much more we can do to avoid people from using this method in unstable clusters.

@DaveCTurner
Copy link
Contributor

Yeah I agree with Niels, it's trappy if getMasterName() returns the name of a node that was observed to be the master but then has stopped being the master by the time the method returns, which is what these test failures indicate is happening. I'd rather not hide that trappiness as proposed here.

Instead, if a test expects (i.e. causes) the cluster to be unstable, it should be explicitly waiting for stabilisation before starting to think about identifying the master.

@JeremyDahlgren
Copy link
Contributor Author

What exactly would be the value of this change? I understand that we're trying to avoid test failures where tests use getMasterName in unstable clusters, but I don't think this is the right way.

No value, will close, thank you and David for your clear explanations. I will reference the original comment from when this issue was discusssed in 127213 and this PR when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants