Fix race condition in `InternalTestCluster.getMasterName()` #127985

JeremyDahlgren · 2025-05-09T16:33:48Z

Prevents assertion errors when the master node is null after obtaining the cluster state in the second call.
See PRs #127213 and #127891 for details about recent changes to this method, and where the code used in this commit was proposed.

Per #127213 getMasterName() was deemed unsafe to call in unstable clusters and scenarios where there could be election jitter. Updating the javadoc to indicate the stable cluster requirement for these methods would have been a good addition, but it likely would not prevent future usages of the methods in such unstable test scenarios. The utility method added in serverless PR 3833 is an example of code added after #127213, and led to a failing test seen in serverless issue 3867. It was suggested in the PR under review for that issue that the root cause of the NPE should be fixed.

There have also been other recent failing tests where the PRs to resolve them required refactoring to avoid the use of getMasterName(), serverless issue 3865 and #127523. It is likely we'll run into more test failures, there currently are 233 calls to getMasterName(). I haven't gone through all of them to see under what cluster conditions it is being called. It might make sense to apply David's suggested change (submitted in this PR) to avoid future failures. Note that if the current cluster state satisfies the predicate (a non-null master node) the listener will be completed immediately.

Prevents assertion errors when the master node is null after obtaining the cluster state in the second call. See PRs 127213 and 127891 for details about recent changes to this method, and where the code used in this commit was proposed.

elasticsearchmachine · 2025-05-09T16:34:12Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nielsbauman · 2025-05-09T18:36:48Z

What exactly would be the value of this change? I understand that we're trying to avoid test failures where tests use getMasterName in unstable clusters, but I don't think this is the right way.

IMO, it makes sense that getMasterName shouldn't be called in unstable clusters, simply because we would have very little confidence that the returned value is the "actual" master at that point. If we make tests explicitly wait for a stable cluster, the intent of the test will be much clearer and reduce flakiness.

I think #127891 was valuable in changing the exception message to something a lot more understandable. Other than adding documentation to this method that it shouldn't be used in unstable clusters, I don't immediately see much more we can do to avoid people from using this method in unstable clusters.

DaveCTurner · 2025-05-10T08:03:50Z

Yeah I agree with Niels, it's trappy if getMasterName() returns the name of a node that was observed to be the master but then has stopped being the master by the time the method returns, which is what these test failures indicate is happening. I'd rather not hide that trappiness as proposed here.

Instead, if a test expects (i.e. causes) the cluster to be unstable, it should be explicitly waiting for stabilisation before starting to think about identifying the master.

JeremyDahlgren · 2025-05-12T13:46:16Z

What exactly would be the value of this change? I understand that we're trying to avoid test failures where tests use getMasterName in unstable clusters, but I don't think this is the right way.

No value, will close, thank you and David for your clear explanations. I will reference the original comment from when this issue was discusssed in 127213 and this PR when needed.

JeremyDahlgren requested review from DaveCTurner, nielsbauman, pxsalehi and ywangd May 9, 2025 16:35

JeremyDahlgren closed this May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race condition in `InternalTestCluster.getMasterName()` #127985

Fix race condition in `InternalTestCluster.getMasterName()` #127985

Uh oh!

JeremyDahlgren commented May 9, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented May 9, 2025

Uh oh!

nielsbauman commented May 9, 2025

Uh oh!

DaveCTurner commented May 10, 2025

Uh oh!

JeremyDahlgren commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix race condition in InternalTestCluster.getMasterName() #127985

Fix race condition in InternalTestCluster.getMasterName() #127985

Uh oh!

Conversation

JeremyDahlgren commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented May 9, 2025

Uh oh!

nielsbauman commented May 9, 2025

Uh oh!

DaveCTurner commented May 10, 2025

Uh oh!

JeremyDahlgren commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix race condition in `InternalTestCluster.getMasterName()` #127985

Fix race condition in `InternalTestCluster.getMasterName()` #127985

JeremyDahlgren commented May 9, 2025 •

edited

Loading