Skip to content

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Sep 11, 2025

Resolves: #134277

@ywangd ywangd requested a review from DaveCTurner September 11, 2025 08:43
@ywangd ywangd added >test Issues or PRs that are addressing/adding tests auto-backport Automatically create backport pull requests when merged :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. v9.2.0 v9.0.7 v8.19.5 v9.1.5 labels Sep 11, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Comment on lines +424 to +428
// Wait for the latch, the listener for releasing node responses is called before it.
// We need to wait for the latch because the cancellation may be detected in CancellableFanOut#onCompletion with
// the responseHandled flag being true. The flag is set by the cancellation listener which is still in process of
// draining existing responses.
safeAwait(onCancelledLatch);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out we still need the latch. It was actually part of the original issue, i.e. the cancellation can comes in after all node responses are collected and right before the final response is sent. In this case, the final response is short circuited to be an exception while the cancellation listener is still doing work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok I think I see, so you mean the final response can be sent in between these two lines:

semaphore.release();
// finally, release refs to all the per-item listeners (without calling onItemFailure, so this is also fast)
cancellableTask.notifyIfCancelled(itemCancellationListener);

On reflection this seems kinda surprising if not an outright bug: we generally prefer to delay sending the response until everything is released.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the above lines. The final response (cancellation) can be sent at this line

before the node responses are released by this line

Releasables.wrap(Iterators.map(drainedResponses.iterator(), r -> r::decRef)).close();

A potential alternative fix is to add a similar synchronized block to drain node responses right before the first line linked above. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah sorry this is all quite a tangle. I don't think it'll work to drain responses in onCompletion - we could already have drained them in the listener within addReleaseOnCancellationListener but still not quite released them yet.

On reflection it looks like there's other ways we can complete the final listener before releasing all the node-level responses, e.g. here:

try (var ignored = Releasables.wrap(Iterators.map(responses.iterator(), r -> r::decRef))) {
newResponseAsync(task, request, actionContext, responses, exceptions, l);
}

Let's leave this alone then; I appreciate the comment in the test calling out that this is slightly odd.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd
Copy link
Member Author

ywangd commented Sep 15, 2025

@elasticmachine update branch

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 15, 2025
@ywangd
Copy link
Member Author

ywangd commented Sep 15, 2025

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

There are no new commits on the base branch.

@elasticsearchmachine elasticsearchmachine merged commit 2ea81d0 into elastic:main Sep 17, 2025
35 checks passed
@ywangd ywangd deleted the es-134277-fix-again branch September 17, 2025 03:08
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.0 Commit could not be cherrypicked due to conflicts
8.19 Commit could not be cherrypicked due to conflicts
9.1 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 134532

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 17, 2025
Resolves: elastic#134277
(cherry picked from commit 2ea81d0)

# Conflicts:
#	muted-tests.yml
@ywangd
Copy link
Member Author

ywangd commented Sep 17, 2025

💚 All backports created successfully

Status Branch Result
9.1
9.0
8.19

Questions ?

Please refer to the Backport tool documentation

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 17, 2025
Resolves: elastic#134277
(cherry picked from commit 2ea81d0)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Sep 17, 2025
Resolves: #134277
(cherry picked from commit 2ea81d0)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Sep 17, 2025
Resolves: #134277
(cherry picked from commit 2ea81d0)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Sep 17, 2025
Resolves: #134277
(cherry picked from commit 2ea81d0)

# Conflicts:
#	muted-tests.yml
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Sep 17, 2025
gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport pending :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v8.19.5 v9.0.7 v9.1.5 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] TransportNodesActionTests testConcurrentlyCompletionAndCancellation failing
4 participants