-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix race condition when resolving new location for multiple shards at once #128062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
| for (ShardId shardId : pendingShardIds) { | ||
| if (targetShards.getShard(shardId).remainingNodes.isEmpty()) { | ||
| if (targetShards.getShard(shardId).remainingNodes.isEmpty() | ||
| && (isRetryableFailure(shardFailures.get(shardId)) == false || pendingRetries.contains(shardId))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was previously possible that between executing line 195 and line 206 one of the data nodes returns with NoSuchShard exception and add a new pendingShardId. As a result such shard might still not have remainingNodes nodes to query. This change supposed to detect such situations and delay such resolution to the next round.
This was detected by testSearchWhileRelocating integration test that is fairly slow and expensive so I added testRetryMultipleMovedShards unit test that made it much easier to reproduce.
dnhatn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for fixing this.
💔 Backport failed
You can use sqren/backport to manually backport by running |
… once (elastic#128062) (cherry picked from commit f4b6086)
This fixes the race condition when retrying resolving multiple moved shards at once.
Please see inline comments for more details.
Related to: #127188
Closes: #128082