Fix race condition in InternalTestCluster.getMasterName()
#127985
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prevents assertion errors when the master node is null after obtaining the cluster state in the second call.
See PRs #127213 and #127891 for details about recent changes to this method, and where the code used in this commit was proposed.
Per #127213
getMasterName()was deemed unsafe to call in unstable clusters and scenarios where there could be election jitter. Updating the javadoc to indicate the stable cluster requirement for these methods would have been a good addition, but it likely would not prevent future usages of the methods in such unstable test scenarios. The utility method added in serverless PR 3833 is an example of code added after #127213, and led to a failing test seen in serverless issue 3867. It was suggested in the PR under review for that issue that the root cause of the NPE should be fixed.There have also been other recent failing tests where the PRs to resolve them required refactoring to avoid the use of
getMasterName(), serverless issue 3865 and #127523. It is likely we'll run into more test failures, there currently are 233 calls togetMasterName(). I haven't gone through all of them to see under what cluster conditions it is being called. It might make sense to apply David's suggested change (submitted in this PR) to avoid future failures. Note that if the current cluster state satisfies the predicate (a non-null master node) the listener will be completed immediately.