Improve cancellability in TransportTasksAction #96279

DaveCTurner · 2023-05-23T10:25:18Z

Each TransportTasksAction fans-out to multiple nodes, accumulates responses and retains them until all the nodes have responded, and then converts the responses into a final result.

Similarly to #92987 and #93484, we should accumulate the responses in a structure that doesn't require so much copying later on, and should drop the received responses if the task is cancelled while some nodes' responses are still pending.

Each `TransportTasksAction` fans-out to multiple nodes, accumulates responses and retains them until all the nodes have responded, and then converts the responses into a final result. Similarly to elastic#92987 and elastic#93484, we should accumulate the responses in a structure that doesn't require so much copying later on, and should drop the received responses if the task is cancelled while some nodes' responses are still pending.

elasticsearchmachine · 2023-05-23T10:25:43Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2023-05-23T10:25:43Z

Hi @DaveCTurner, I've created a changelog YAML for you.

In a busy cluster the list-tasks API may retain information about a very large number of tasks while waiting for all nodes to respond. This commit makes the API cancellable so that unnecessary partial results can be released earlier. Relates elastic#96279, which implements the early-release functionality.

arteam

LGTM! I've left a few cosmeticcomments

arteam · 2023-05-25T12:44:04Z

...c/test/java/org/elasticsearch/action/admin/cluster/node/tasks/TransportTasksActionTests.java

+
+        reachabilityChecker.ensureUnreachable();
+
+        while (true) {


Would it make sense to use while (taskResponseListeners.peek() != null) here instead of manually breaking out of the loop?

That'd work, but I'd rather just do a single read, avoiding having to reason about the relationship between peek() and poll().

arteam · 2023-05-25T12:47:02Z

server/src/main/java/org/elasticsearch/action/support/tasks/TransportTasksAction.java

+                            return;
+                        }
+
+                        logger.debug(Strings.format("failed to execute on node [{}]", nodeId), e);


I was wondering if can perform formatting lazily by using a supplier : logger.debug(() -> Strings.format(

Yes; it's not very common to get here, but I'll do that.

arteam · 2023-05-25T12:48:47Z

server/src/main/java/org/elasticsearch/action/support/tasks/TransportTasksAction.java

+                    @Override
+                    public void onResponse(NodeTasksResponse nodeResponse) {
+                        synchronized (taskResponses) {
+                            taskResponses.addAll(nodeResponse.results);


I've run some tests and have seen a lot of cases where nodeResponse.results is empty. Would it make sense to perform an isEmpty check on nodeResponse.results before acquiring the lock on taskResponses?

Yes that makes sense.

arteam · 2023-05-25T12:51:08Z

server/src/main/java/org/elasticsearch/action/support/tasks/TransportTasksAction.java

-    private void nodeOperation(CancellableTask task, NodeTaskRequest nodeTaskRequest, ActionListener<NodeTasksResponse> listener) {
-        TasksRequest request = nodeTaskRequest.tasksRequest;
-        processTasks(request, ActionListener.wrap(tasks -> nodeOperation(task, listener, request, tasks), listener::onFailure));
+        final var taskResponses = new ArrayList<TaskResponse>();


Does it make sense to pre-size taskResponses based on the amount of the nodes from which we collect responses?

I don't think the number of nodes would be a particularly useful estimate for this - as you say, many nodes return nothing, but sometimes they will return thousands of results. These are pretty low-throughput APIs so I think the default sizing is ok.

DaveCTurner · 2023-05-25T15:01:28Z

@elasticmachine update branch

henningandersen

LGTM.

Slightly complex (but good), I wonder if a comment or two on how the result lists are captured locally and how cancelling ensures it is freed could make sense?

We have this somewhat-complex pattern in 3 places already, and elastic#96279 will introduce a couple more, so this commit extracts it as a dedicated utility. Relates elastic#92987 Relates elastic#93484

DaveCTurner · 2023-05-26T10:59:06Z

Yeah it is a bit convoluted isn't it? I extracted a utility (and added more commentary) in #96373, and will update this PR to use it once that's merged.

We have this somewhat-complex pattern in 3 places already, and #96279 will introduce a couple more, so this commit extracts it as a dedicated utility. Relates #92987 Relates #93484

In a busy cluster the list-tasks API may retain information about a very large number of tasks while waiting for all nodes to respond. This commit makes the API cancellable so that unnecessary partial results can be released earlier. Relates #96279, which implements the early-release functionality.

DaveCTurner added >bug :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v8.9.0 labels May 23, 2023

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 23, 2023

Update docs/changelog/96279.yaml

a27a6a2

DaveCTurner mentioned this pull request May 23, 2023

Make list-tasks API cancellable #96283

Merged

DaveCTurner added 2 commits May 23, 2023 12:46

Must handle exceptions from newResponse too

c21be33

Merge branch 'main' into 2023-05-23-TransportTasksAction-cancellability

5fd7afe

DaveCTurner requested review from arteam, henningandersen and idegtiarenko May 23, 2023 14:12

arteam approved these changes May 25, 2023

View reviewed changes

DaveCTurner added 3 commits May 25, 2023 14:09

Merge branch 'main' into 2023-05-23-TransportTasksAction-cancellability

4631894

Lazy log

4168dcd

Fast-path for empty responses

e55fa85

DaveCTurner added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels May 25, 2023

Merge branch 'main' into 2023-05-23-TransportTasksAction-cancellability

145f688

henningandersen approved these changes May 25, 2023

View reviewed changes

DaveCTurner mentioned this pull request May 26, 2023

Introduce CancellableFanOut #96373

Merged

DaveCTurner added 2 commits May 26, 2023 17:33

Merge branch 'main' into 2023-05-23-TransportTasksAction-cancellability

dedd37e

Use CancellableFanOut

960ba35

Merge branch 'main' into 2023-05-23-TransportTasksAction-cancellability

b04677c

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label May 30, 2023

elasticsearchmachine merged commit 2513104 into elastic:main May 30, 2023

DaveCTurner deleted the 2023-05-23-TransportTasksAction-cancellability branch May 30, 2023 06:51

Improve cancellability in TransportTasksAction #96279

Improve cancellability in TransportTasksAction #96279

Uh oh!

Conversation

DaveCTurner commented May 23, 2023

Uh oh!

elasticsearchmachine commented May 23, 2023

Uh oh!

elasticsearchmachine commented May 23, 2023

Uh oh!

arteam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented May 25, 2023

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented May 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants