Fix request processing scheduling #127464

idegtiarenko · 2025-04-28T12:38:22Z

This change fix concurrency around handling moved shards.

The test was failing as the shard failures were visible before retry was processed. In order to fix it the error handling is updated to:

schedule retries before recording shard failures
block request sending as soon as moved shard is detected (before the sending was locked only when we accumulated the list of shards and started resolving their new location).

Closes: #127168

elasticsearchmachine · 2025-04-28T12:38:46Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

dnhatn · 2025-04-28T15:26:08Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeRequestSender.java

                }

-                if (pendingRetries.isEmpty() == false && remainingUnavailableShardResolutionAttempts.decrementAndGet() >= 0) {
+                if (sendingLock.isHeldByCurrentThread()) {


I think isHeldByCurrentThread should be used for assertions or debugging purposes, not in production code.

Agree. let me find a way to replace it.

dnhatn · 2025-04-29T17:29:04Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeRequestSender.java

-                    pendingRetries.add(shardId);
+                    if (pendingRetries == null && remainingUnavailableShardResolutionAttempts.decrementAndGet() >= 0) {
+                        pendingRetries = new HashSet<>();
+                        sendingLock.lock();


I am concerned that we are scattering sendingLock#lock and sendingLock#unlock in two different places. Can we keep them close?

They are in the same inner listener, guarding the same structure at the moment.
They were previously in the same method before but that was not enough and caused a bug.
I suspect we could do this by creating a releasable inner wrapper class on top of pendingRetries = new HashSet<>() but that sounds like an overkill not really helping with readability.

@idegtiarenko Sorry, I should have provided more detail. My concern is that we acquire the sending lock in maybeScheduleRetry and release it in onAfter, which is linked to the status of pendingRetries. While the implementation is technically correct, I think we should stick to the simplest lock pattern unless there is a strong reason to do otherwise:

lock/tryLock try { ... } finally { unlock }

I think we can follow this lock pattern in the DataNodeRequestSender class.

This sounds a lot like the original implementation (within onAfter) that was not correct.
Do you see a way how to implement this suggestion while still locking conditionally (only if shard movement is detected) and blocking concurrent requests while it is detected that new shard location resolution is required?

As discussed yesterday, I moved retry scheduling to trySendingRequestsForPendingShards in 8a0dcc6

dnhatn

LGTM. Thanks @idegtiarenko

Fix request processing scheduling

de2f695

idegtiarenko added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL v9.1.0 labels Apr 28, 2025

idegtiarenko requested review from dnhatn and nik9000 April 28, 2025 12:38

idegtiarenko added 3 commits April 28, 2025 15:53

Merge branch 'main' into debug_DataNodeRequestSenderIT

a5fde53

fix retry limit

7780ed7

Merge branch 'main' into debug_DataNodeRequestSenderIT

3425c26

dnhatn reviewed Apr 28, 2025

View reviewed changes

idegtiarenko added 2 commits April 29, 2025 08:54

replace isHeldByCurrentThreadcheck

b2dd3da

Merge branch 'main' into debug_DataNodeRequestSenderIT

90d297c

idegtiarenko requested a review from dnhatn April 29, 2025 07:01

dnhatn reviewed Apr 29, 2025

View reviewed changes

idegtiarenko added 2 commits May 6, 2025 09:55

Merge branch 'main' into debug_DataNodeRequestSenderIT

733e5c8

Move retry scheduling to trySendingRequestsForPendingShards

8a0dcc6

idegtiarenko requested a review from dnhatn May 6, 2025 11:31

handle all shards unassigned

b193ec8

dnhatn approved these changes May 6, 2025

View reviewed changes

Merge branch 'main' into debug_DataNodeRequestSenderIT

95ed34f

idegtiarenko merged commit c922e52 into elastic:main May 7, 2025
17 checks passed

idegtiarenko deleted the debug_DataNodeRequestSenderIT branch May 7, 2025 06:20

idegtiarenko mentioned this pull request May 7, 2025

[8.19] Retry shard movements during ESQL query #127807

Merged

idegtiarenko added the v8.19.0 label May 7, 2025

ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request May 9, 2025

Fix request processing scheduling (elastic#127464)

a9bc9da

afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request May 9, 2025

Fix request processing scheduling (elastic#127464)

37d1053

jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025

Fix request processing scheduling (elastic#127464)

06cec93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix request processing scheduling #127464

Fix request processing scheduling #127464

Uh oh!

idegtiarenko commented Apr 28, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Apr 28, 2025

Uh oh!

dnhatn Apr 28, 2025

Uh oh!

idegtiarenko Apr 28, 2025

Uh oh!

dnhatn Apr 29, 2025

Uh oh!

idegtiarenko Apr 30, 2025 •

edited

Loading

Uh oh!

dnhatn Apr 30, 2025

Uh oh!

idegtiarenko May 5, 2025

Uh oh!

idegtiarenko May 6, 2025

Uh oh!

dnhatn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix request processing scheduling #127464

Fix request processing scheduling #127464

Uh oh!

Conversation

idegtiarenko commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 28, 2025

Uh oh!

dnhatn Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

idegtiarenko Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

dnhatn Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

idegtiarenko Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

idegtiarenko May 5, 2025

Choose a reason for hiding this comment

Uh oh!

idegtiarenko May 6, 2025

Choose a reason for hiding this comment

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

idegtiarenko commented Apr 28, 2025 •

edited

Loading

idegtiarenko Apr 30, 2025 •

edited

Loading