-
Couldn't load subscription status.
- Fork 25.6k
Fix race condition in RemoteClusterService.collectNodes()
#131937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in RemoteClusterService.collectNodes()
#131937
Conversation
It is possible for a linked remote to get unlinked in between the containsKey() and get() calls in collectNodes(). This change adds a test that produces the NullPointerException and adds a fix.
|
Hi @JeremyDahlgren, I've created a changelog YAML for you. |
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| @Override | ||
| public void onFailure(Exception e) { | ||
| assertNotNull(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can verify the exception is NoSuchRemoteClusterException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the test in a loop and we can hit a few different exceptions here. I refactored to test them as done in testCollectNodes() above.
server/src/test/java/org/elasticsearch/transport/RemoteClusterServiceTests.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comment about early-completion. Also could we rework this to just do the lookup with get rather than the pre-flight check? You'd have to store the looked-up connections in some intermediate collection but that's ok, this shouldn't be on a hot path and the collection won't be unmanageably massive.
| RemoteClusterConnection connection = this.remoteClusters.get(cluster); | ||
| // Ensure the connection is not null, it could have been removed since the containsKey() call above. | ||
| if (connection == null) { | ||
| if (countDown.fastForward()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of using fastForward like this - it means the listener completes early, potentially while other collectNodes calls are still running, which can cause duplicated work and other confusions. The same concern applies to the fastForward call that already existed.
Instead, could we adjust this all to use a RefCountingListener? That would (a) delay the completion properly and (b) collect multiple failures (up to 10) as well as (c) being more idiomatic of modern Elasticsearch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks David, much cleaner!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…131937) It is possible for a linked remote to get unlinked in between the containsKey() and get() calls in collectNodes(). This change adds a test that produces the NullPointerException and adds a fix.
…131937) It is possible for a linked remote to get unlinked in between the containsKey() and get() calls in collectNodes(). This change adds a test that produces the NullPointerException and adds a fix.
It is possible for a linked remote to get unlinked in between the
containsKey()andget()calls incollectNodes().This change adds a test that produces the
NullPointerExceptionand adds a fix.