Skip to content

Add cancellation support to IndicesRequestCache#141708

Open
drempapis wants to merge 29 commits intoelastic:mainfrom
drempapis:fix/cache-cancellation-support
Open

Add cancellation support to IndicesRequestCache#141708
drempapis wants to merge 29 commits intoelastic:mainfrom
drempapis:fix/cache-cancellation-support

Conversation

@drempapis
Copy link
Contributor

@drempapis drempapis commented Feb 3, 2026

Related github issue #108703

The problem
When expensive queries fill up the search thread pool, threads can become blocked in IndicesRequestCache.getOrCompute waiting for other threads to compute cached results. If these queries are cancelled, the waiting threads don't react to the cancellation and continue blocking indefinitely. This can lead to search thread pool exhaustion, requiring node restarts to recover.

This PR adds cancellation support to the cache's blocking operations, allowing waiting threads to be notified when their task is cancelled. However, this PR does not prevent the search pool from filling with blocking tasks. A follow-up pr will follow this one, changing the cache to use SubscribableListener for a complete async solution. I haven't worked on it here for simplicity and to make the code updates discrete.

@drempapis drempapis changed the title Add cancellation support to IndicesRequestCache to prevent search thread pool exhaustion Add cancellation support to IndicesRequestCache Feb 3, 2026
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.4.0 labels Feb 3, 2026
@drempapis drempapis added >non-issue Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations and removed needs:triage Requires assignment of a team area label labels Feb 3, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@drempapis drempapis requested a review from DaveCTurner February 4, 2026 14:51
@drempapis drempapis added >bug and removed >non-issue labels Feb 11, 2026
@drempapis drempapis removed the request for review from DaveCTurner February 25, 2026 08:32
@eranweiss-elastic eranweiss-elastic self-requested a review March 16, 2026 13:28
* @throws TaskCancelledException if the operation was cancelled
*/
private static <T> T blockOnFuture(CompletableFuture<T> future, Consumer<Runnable> cancellationRegistrar) throws ExecutionException,
InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java doc mentions TaskCancelledException being thrown, but the definition doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, removed

private void cleanupFailedFuture(CacheSegment segment, K key, CompletableFuture<Entry<K, V>> future) {
segment.writeLock.lock();
try {
if (segment.map != null && segment.map.get(key) == future) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that segment.map.get(key) != future deserves some handling. It's not expected, but maybe at least a log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a debug log when the key maps to a different future: Skipped cleanup for key [] because the future was replaced.

Do you think that it is enough? Do you suggest doing something else here?

if (cancellationRegistrar != null) {
cancellationRegistrar.accept(() -> {
cancelled.set(true);
latch.countDown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about this if. This assumes that a cancellationRegistrar != null is passed if a task can be cancelled, but doesn't enforce it. If a task is cancelled for a future when cancellationRegistrar is null, this with be a deadlock.
A safer way to do this will be to add the latch.countDown(); regardless of the value of cancellationRegistrar.

Copy link
Contributor Author

@drempapis drempapis Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is structured as follows

 future.whenComplete((value, throwable) -> {
            if (throwable != null) {
                error.set(throwable);
            } else {
                result.set(value);
            }
            latch.countDown();
        });

        if (cancellationRegistrar != null) {
            cancellationRegistrar.accept(() -> {
                cancelled.set(true);
                latch.countDown();
            });
        }

The whenComplete callback, which always calls latch.countDown(), is registered before the cancellationRegistrar check. The latch is always released when the future completes, regardless of whether a registrar is provided.

When cancellationRegistrar == null, the waiting thread cannot exit early if its task is cancelled. It will remain blocked until the computation finishes. This is by design, not a deadlock.

Add latch.countDown() regardless of cancellationRegistrar would immediately count down the latch causing latch.await() to return instantly with no result.

I've added a test to prove that when cancellationRegistrar is null, the thread is not deadlocked, it simply cannot exit early and must wait for the future to complete.

* @throws InterruptedException if the thread was interrupted
* @throws TaskCancelledException if the operation was cancelled
*/
private static <T> T blockOnFuture(CompletableFuture<T> future, Consumer<Runnable> cancellationRegistrar) throws ExecutionException,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a static method, and has locking, I think it would be nice to test it in CacheTests.java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, added more tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants