Skip to content

Conversation

@JeremyDahlgren
Copy link
Contributor

It is possible for a linked remote to get unlinked in between the containsKey() and get() calls in collectNodes().
This change adds a test that produces the NullPointerException and adds a fix.

It is possible for a linked remote to get unlinked in
between the containsKey() and get() calls in collectNodes().
This change adds a test that produces the NullPointerException
and adds a fix.
@JeremyDahlgren JeremyDahlgren added >bug Team:Distributed Coordination Meta label for Distributed Coordination team :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. v9.2.0 labels Jul 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @JeremyDahlgren, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


@Override
public void onFailure(Exception e) {
assertNotNull(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can verify the exception is NoSuchRemoteClusterException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the test in a loop and we can hit a few different exceptions here. I refactored to test them as done in testCollectNodes() above.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment about early-completion. Also could we rework this to just do the lookup with get rather than the pre-flight check? You'd have to store the looked-up connections in some intermediate collection but that's ok, this shouldn't be on a hot path and the collection won't be unmanageably massive.

RemoteClusterConnection connection = this.remoteClusters.get(cluster);
// Ensure the connection is not null, it could have been removed since the containsKey() call above.
if (connection == null) {
if (countDown.fastForward()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of using fastForward like this - it means the listener completes early, potentially while other collectNodes calls are still running, which can cause duplicated work and other confusions. The same concern applies to the fastForward call that already existed.

Instead, could we adjust this all to use a RefCountingListener? That would (a) delay the completion properly and (b) collect multiple failures (up to 10) as well as (c) being more idiomatic of modern Elasticsearch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David, much cleaner!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JeremyDahlgren JeremyDahlgren merged commit ccf9893 into elastic:main Jul 30, 2025
33 checks passed
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Jul 31, 2025
…131937)

It is possible for a linked remote to get unlinked in
between the containsKey() and get() calls in collectNodes().
This change adds a test that produces the NullPointerException
and adds a fix.
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Jul 31, 2025
…131937)

It is possible for a linked remote to get unlinked in
between the containsKey() and get() calls in collectNodes().
This change adds a test that produces the NullPointerException
and adds a fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants