Propagate cancellation within leaf search #6002

rdettai-sk · 2025-11-28T19:37:50Z

Description

We observed that query spikes create huge leaf search tasks backlogs that don't get cancelled when the queries time out.

This is caused by the timeout cancellation that isn't propagated to spawned tasks.

This implementation is based on JoinSet, a Tokio primitive that helps managing the lifecycle of a group of tasks. It is crucial to make sure all the tasks get cancelled when the leaf request times out.

How was this PR tested?

Describe how you tested this PR.

guilload · 2025-12-02T18:35:52Z

quickwit/quickwit-search/src/leaf.rs

-        try_join_all(leaf_request_tasks),
-    )
-    .await??;
+    let leaf_responses: Vec<LeafSearchResponse> = try_join_vec.try_join_all().await?;


I know this is how things are currently, but do we actually need the responses to come back in the same order. That seems odd.

I asked myself the same question. I got to the conclusion that it's likely for reproducibility reasons: same list of splits + same query => same result. But it only holds at the leaf level. Given that the split list doesn't seem to be deterministic on the root (no order for list_indexes_metadata()), I don't know how much we really win from this.

Probably nothing. Let's get rid of that constraint so we can use JoinSet directly.

Fine by me! It simplifies the code quite a bit. I'll apply the same orderless processing when gathering join errors from individual splits.

@guilload done! I simplified the code to use plain JoinSet and rebased the PR 😃

quickwit/quickwit-search/src/leaf.rs

guilload · 2025-12-08T21:01:28Z

quickwit/quickwit-search/src/leaf.rs

-            }
-        }
+    while let Some(result) = join_set.join_next().await {
+        incremental_merge_collector.add_result(result??)?;


This not the original behavior, right? Before we would keep going, now we return an error right away.

We were already erroring right away for JoinError. For regular errors we were continuing, but adding a SplitSearchError with "unknown" split_id to the list of failed splits. I think the most likely reason a child request might fail is a merge error. Given that the user doesn't know how many and which splits failed, it seems quite unlikely that the result can be reasonably used. I can revert this part if you think the partial result is valuable in this scenario.

guilload · 2025-12-08T23:20:21Z

quickwit/quickwit-search/src/leaf.rs

                // An explicit task cancellation is not an error.
                continue;
            }
+            let position = split_with_task_id


It would be easier to have the future return the split id.

that doesn't work in case of panic (JoinError)

guilload · 2025-12-08T23:21:40Z

quickwit/quickwit-search/src/service.rs

-        }
-        .await
+        };
+        let timeout = self.searcher_context.searcher_config.request_timeout();


If the root search is wrapped with the same timeout value and we cancel all the in flight feature when we cancel the root search, why do we need this timeout? Belt and suspenders?

I would rather not trust the RPC for cancellation. In case of connection drop, that would rely on the network level timeouts to perform the cancelation, and those are often hidden defaults that are hard to figure out. It's also close to impossible to test against regressions (even though to be honest the current cancellation also lacks tests and I didn't find a good solution yet).

I also have this other PR where I enable different timeouts for different search sizes. In that PR the root timeout is disabled because the actual timeout is chosen on the leaf (open to change that, see PR description for details).

rdettai-sk self-assigned this Nov 28, 2025

guilload self-requested a review December 2, 2025 18:30

guilload reviewed Dec 2, 2025

View reviewed changes

Propagate cancellation within leaf search

05093f0

rdettai-sk force-pushed the propagate-leaf-search-cancel branch from 97823ad to 05093f0 Compare December 5, 2025 14:05

guilload reviewed Dec 8, 2025

View reviewed changes

Improve future pool namings

472229f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Propagate cancellation within leaf search #6002

Propagate cancellation within leaf search #6002

rdettai-sk commented Nov 28, 2025 •

edited

Loading

Uh oh!

guilload Dec 2, 2025

Uh oh!

rdettai-sk Dec 3, 2025

Uh oh!

guilload Dec 3, 2025

Uh oh!

rdettai-sk Dec 4, 2025

Uh oh!

rdettai-sk Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

guilload Dec 8, 2025

Uh oh!

rdettai-sk Dec 9, 2025

Uh oh!

guilload Dec 8, 2025

Uh oh!

rdettai-sk Dec 9, 2025

Uh oh!

guilload Dec 8, 2025

Uh oh!

rdettai-sk Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Propagate cancellation within leaf search #6002

Are you sure you want to change the base?

Propagate cancellation within leaf search #6002

Conversation

rdettai-sk commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How was this PR tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdettai-sk commented Nov 28, 2025 •

edited

Loading