Prevent dropping valid peer connections on failure#20961
Prevent dropping valid peer connections on failure#20961gagandhakrey wants to merge 1 commit intoopensearch-project:mainfrom
Conversation
Signed-off-by: Gagan Dhakrey <gagandhakrey@Gagans-MacBook-Pro.local>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
❌ Gradle check result for 729e31d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Fixed a race condition in cluster discovery where a stale connection failure could wipe out a newer, valid connection to the same address.
The issue was in the onFailure callback — when an older Peer connection attempt failed, it would blindly call peersByAddress.remove(transportAddress), not caring whether a newer connection to that same address had already been established. This caused unnecessary disconnects and flapping.
The fix is simple: swap the unconditional remove(transportAddress) for remove(transportAddress, Peer.this), which only removes the entry if it's still pointing to this specific Peer instance. That way, a failed old connection can't accidentally clean up someone else's active one.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.