Skip to content

Prevent dropping valid peer connections on failure#20961

Open
gagandhakrey wants to merge 1 commit intoopensearch-project:mainfrom
gagandhakrey:fix/peerfinder-race-condition
Open

Prevent dropping valid peer connections on failure#20961
gagandhakrey wants to merge 1 commit intoopensearch-project:mainfrom
gagandhakrey:fix/peerfinder-race-condition

Conversation

@gagandhakrey
Copy link

@gagandhakrey gagandhakrey commented Mar 22, 2026

Description

Fixed a race condition in cluster discovery where a stale connection failure could wipe out a newer, valid connection to the same address.
The issue was in the onFailure callback — when an older Peer connection attempt failed, it would blindly call peersByAddress.remove(transportAddress), not caring whether a newer connection to that same address had already been established. This caused unnecessary disconnects and flapping.
The fix is simple: swap the unconditional remove(transportAddress) for remove(transportAddress, Peer.this), which only removes the entry if it's still pointing to this specific Peer instance. That way, a failed old connection can't accidentally clean up someone else's active one.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gagan Dhakrey <gagandhakrey@Gagans-MacBook-Pro.local>
@gagandhakrey gagandhakrey requested a review from a team as a code owner March 22, 2026 19:44
@github-actions
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Correctness Validation

The fix uses peersByAddress.remove(transportAddress, Peer.this) which relies on the map's conditional remove (ConcurrentHashMap or similar). It's important to verify that peersByAddress is indeed a map type that supports the two-argument remove(key, value) method (i.e., implements ConcurrentMap or similar), otherwise this will not compile or behave as expected.

peersByAddress.remove(transportAddress, Peer.this);
Missing Cleanup

When a connection failure occurs and the peer is removed from peersByAddress, it should be verified whether any additional cleanup is needed for the Peer instance itself (e.g., closing the connection, removing from other data structures). The old code may have relied on the remove triggering downstream cleanup that is now conditionally skipped.

public void onFailure(Exception e) {
    logger.debug(() -> new ParameterizedMessage("{} connection failed", Peer.this), e);
    synchronized (mutex) {
        peersByAddress.remove(transportAddress, Peer.this);
    }
}

@github-actions
Copy link
Contributor

❌ Gradle check result for 729e31d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant