Skip to content

Conversation

@rdettai-sk
Copy link
Collaborator

@rdettai-sk rdettai-sk commented Nov 28, 2025

Description

We observed that query spikes create huge leaf search tasks backlogs that don't get cancelled when the queries time out.

This is caused by the timeout cancellation that isn't propagated to spawned tasks.

This implementation is based on JoinSet, a Tokio primitive that helps managing the lifecycle of a group of tasks. It is crucial to make sure all the tasks get cancelled when the leaf request times out.

How was this PR tested?

Describe how you tested this PR.

@rdettai-sk rdettai-sk self-assigned this Nov 28, 2025
@guilload guilload self-requested a review December 2, 2025 18:30
try_join_all(leaf_request_tasks),
)
.await??;
let leaf_responses: Vec<LeafSearchResponse> = try_join_vec.try_join_all().await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is how things are currently, but do we actually need the responses to come back in the same order. That seems odd.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked myself the same question. I got to the conclusion that it's likely for reproducibility reasons: same list of splits + same query => same result. But it only holds at the leaf level. Given that the split list doesn't seem to be deterministic on the root (no order for list_indexes_metadata()), I don't know how much we really win from this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably nothing. Let's get rid of that constraint so we can use JoinSet directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me! It simplifies the code quite a bit. I'll apply the same orderless processing when gathering join errors from individual splits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guilload done! I simplified the code to use plain JoinSet and rebased the PR 😃

@rdettai-sk rdettai-sk force-pushed the propagate-leaf-search-cancel branch from 97823ad to 05093f0 Compare December 5, 2025 14:05
}
}
while let Some(result) = join_set.join_next().await {
incremental_merge_collector.add_result(result??)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This not the original behavior, right? Before we would keep going, now we return an error right away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were already erroring right away for JoinError. For regular errors we were continuing, but adding a SplitSearchError with "unknown" split_id to the list of failed splits. I think the most likely reason a child request might fail is a merge error. Given that the user doesn't know how many and which splits failed, it seems quite unlikely that the result can be reasonably used. I can revert this part if you think the partial result is valuable in this scenario.

// An explicit task cancellation is not an error.
continue;
}
let position = split_with_task_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easier to have the future return the split id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't work in case of panic (JoinError)

}
.await
};
let timeout = self.searcher_context.searcher_config.request_timeout();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the root search is wrapped with the same timeout value and we cancel all the in flight feature when we cancel the root search, why do we need this timeout? Belt and suspenders?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not trust the RPC for cancellation. In case of connection drop, that would rely on the network level timeouts to perform the cancelation, and those are often hidden defaults that are hard to figure out. It's also close to impossible to test against regressions (even though to be honest the current cancellation also lacks tests and I didn't find a good solution yet).

I also have this other PR where I enable different timeouts for different search sizes. In that PR the root timeout is disabled because the actual timeout is chosen on the leaf (open to change that, see PR description for details).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants