Skip to content

Conversation

@idegtiarenko
Copy link
Contributor

@idegtiarenko idegtiarenko commented May 14, 2025

This fixes the race condition when retrying resolving multiple moved shards at once.
Please see inline comments for more details.

Related to: #127188
Closes: #128082

@idegtiarenko idegtiarenko requested review from dnhatn and nik9000 May 14, 2025 08:50
@idegtiarenko idegtiarenko added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL v8.19.0 v9.1.0 labels May 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

for (ShardId shardId : pendingShardIds) {
if (targetShards.getShard(shardId).remainingNodes.isEmpty()) {
if (targetShards.getShard(shardId).remainingNodes.isEmpty()
&& (isRetryableFailure(shardFailures.get(shardId)) == false || pendingRetries.contains(shardId))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was previously possible that between executing line 195 and line 206 one of the data nodes returns with NoSuchShard exception and add a new pendingShardId. As a result such shard might still not have remainingNodes nodes to query. This change supposed to detect such situations and delay such resolution to the next round.

This was detected by testSearchWhileRelocating integration test that is fairly slow and expensive so I added testRetryMultipleMovedShards unit test that made it much easier to reproduce.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for fixing this.

@idegtiarenko idegtiarenko added the auto-backport Automatically create backport pull requests when merged label May 20, 2025
@idegtiarenko idegtiarenko merged commit f4b6086 into elastic:main May 20, 2025
17 checks passed
@idegtiarenko idegtiarenko deleted the fix_127188 branch May 20, 2025 08:11
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 128062

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] DataNodeRequestSenderTests testRetryOnlyMovedShards failing

3 participants