Fix race condition when resolving new location for multiple shards at once #128062

idegtiarenko · 2025-05-14T08:50:37Z

This fixes the race condition when retrying resolving multiple moved shards at once.
Please see inline comments for more details.

Related to: #127188
Closes: #128082

elasticsearchmachine · 2025-05-14T08:51:02Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

idegtiarenko · 2025-05-14T08:53:44Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeRequestSender.java

                    for (ShardId shardId : pendingShardIds) {
-                        if (targetShards.getShard(shardId).remainingNodes.isEmpty()) {
+                        if (targetShards.getShard(shardId).remainingNodes.isEmpty()
+                            && (isRetryableFailure(shardFailures.get(shardId)) == false || pendingRetries.contains(shardId))) {


It was previously possible that between executing line 195 and line 206 one of the data nodes returns with NoSuchShard exception and add a new pendingShardId. As a result such shard might still not have remainingNodes nodes to query. This change supposed to detect such situations and delay such resolution to the next round.

This was detected by testSearchWhileRelocating integration test that is fairly slow and expensive so I added testRetryMultipleMovedShards unit test that made it much easier to reproduce.

dnhatn

LGTM. Thanks for fixing this.

elasticsearchmachine · 2025-05-20T08:13:28Z

💔 Backport failed

Status	Branch	Result
❌	8.19	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 128062

… once (elastic#128062) (cherry picked from commit f4b6086)

… once (#128062) (#128175) (cherry picked from commit f4b6086)

idegtiarenko added 2 commits May 14, 2025 09:56

test to demonstrate the issue

5ea6506

fix

cff6ddb

idegtiarenko requested review from dnhatn and nik9000 May 14, 2025 08:50

idegtiarenko added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL v8.19.0 v9.1.0 labels May 14, 2025

idegtiarenko commented May 14, 2025

View reviewed changes

idegtiarenko mentioned this pull request May 15, 2025

[CI] DataNodeRequestSenderTests testRetryOnlyMovedShards failing #128082

Closed

dnhatn approved these changes May 19, 2025

View reviewed changes

idegtiarenko added the auto-backport Automatically create backport pull requests when merged label May 20, 2025

Merge branch 'main' into fix_127188

3b35bb5

idegtiarenko merged commit f4b6086 into elastic:main May 20, 2025
17 checks passed

idegtiarenko deleted the fix_127188 branch May 20, 2025 08:11

elasticsearchmachine added the backport pending label May 20, 2025

idegtiarenko added a commit to idegtiarenko/elasticsearch that referenced this pull request May 20, 2025

Fix race condition when resolving new location for multiple shards at…

bb84123

… once (elastic#128062) (cherry picked from commit f4b6086)

idegtiarenko mentioned this pull request May 20, 2025

[8.19] Fix race condition when resolving new location for multiple shards at once #128175

Merged

idegtiarenko removed the backport pending label May 20, 2025

elasticsearchmachine pushed a commit that referenced this pull request May 20, 2025

Fix race condition when resolving new location for multiple shards at…

ace066d

… once (#128062) (#128175) (cherry picked from commit f4b6086)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race condition when resolving new location for multiple shards at once #128062

Fix race condition when resolving new location for multiple shards at once #128062

Uh oh!

idegtiarenko commented May 14, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented May 14, 2025

Uh oh!

idegtiarenko May 14, 2025

Uh oh!

dnhatn left a comment

Uh oh!

Uh oh!

elasticsearchmachine commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix race condition when resolving new location for multiple shards at once #128062

Fix race condition when resolving new location for multiple shards at once #128062

Uh oh!

Conversation

idegtiarenko commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented May 14, 2025

Uh oh!

idegtiarenko May 14, 2025

Choose a reason for hiding this comment

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented May 20, 2025

💔 Backport failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

idegtiarenko commented May 14, 2025 •

edited

Loading