Optimise shared-blob-cache evictions #126581

nicktindall · 2025-04-10T06:35:21Z

I added a ThrottledTaskRunner to optionally execute the shared blob cache evictions asynchronously
- SharedSnapshotIndexEventListener and SharedSnapshotIndexFoldersDeletionListener now perform evictions according to the following logic
  - For DELETED shards that we know won't be re-assigned to the node, we schedule asynchronous eviction
  - For FAILED shards that may be re-assigned, but for which we would like to clear potentially invalid state, we retain synchronous evictions
  - For any other index removal, we just let the cache entries expire
- We limit to 5 concurrent deletion threads by default. I think there’s limited gains to be had by increasing concurrency much more than this because the removals happen inside a mutex anyway (the scan for what to remove can be done concurrently). Open to reducing it to less than 5 if we think that’s appropriate.
Whenever we are clearing the cache, we only clear the shared cache for partially mounted indices. It looked like only partial mounted indices use the shared cache. I imagine if recommendations are followed and people use dedicated frozen nodes there will be limited impact from that change, because the cache will either be empty or always scanned. Open to removing those changes for simplicity’s sake.

I don’t know if there’s anything to be gained by moving evictions from the CacheService off the applier thread any more than they already are. I have concerns about making that more asynchronous than it is because in CacheService there is a method called waitForCacheFilesEvictionIfNeeded which takes the shardsEvictionMutex and blocks until any pending shard evictions for the specified shard are completed. It uses the pendingShardsEvictions map to know if there are any pending evictions. If we add another layer of asynchrony, we will potentially be adding a “shadow” queue of evictions that this method doesn’t know about. I wonder if that might break things.

If there are performance issue with CacheService evictions, I think we’d be better off optimising the enqueueing and processing of evictions in that, some ideas for that include

Reduce the amount of lock contention. There is potentially a lot of contention for the shardsEvictionsMutex between the evicting threads on generic and the calls to markShardAsEvictedInCache. I believe there are ways to reduce that.
Reduce the amount of concurrent evictions. Currently there is no limit other than the size of the generic pool. We could add a ThrottledTaskRunner and it might reduce the contention enough to make markShardsAsEvictedInCache faster.

Relates: ES-10744

…exEventListener)

elasticsearchmachine · 2025-04-10T07:12:51Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

elasticsearchmachine · 2025-04-10T07:12:51Z

Hi @nicktindall, I've created a changelog YAML for you.

henningandersen

Looks good, will leave actual approval to Tanguy.

I think we could maybe do the spawn further out to avoid too many tasks - but it may not really be helpful.

I guess this does not address contention on the CacheService evictions - but we can see if we need that to address that too.

tlrx

I left some comment about the (lack of) reasons to evict cache entries for partially mounted shards. I do like the new forceEvictAsync method, it might be useful in other places too.

...ugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBlobCacheService.java

tlrx · 2025-04-11T09:46:49Z

...rch/xpack/searchablesnapshots/allocation/SearchableSnapshotIndexFoldersDeletionListener.java

+            final SharedBlobCacheService<CacheKey> sharedBlobCacheService =
+                SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get();
+            assert sharedBlobCacheService != null : "frozen cache service not initialized";
+            sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings()));


I'm afraid I don't remember all the decisions around evicting cache entries for partially mounted shards 😞

I suspect that we made it this way to allow cache regions to be reused sooner without waiting for them to decay. It was also useful at the beginning when some corrupted data got written in the cache for some reason, as the forced eviction would clear the mess for us.

But besides this, for partially mounted shards I don't see much reason to force the eviction of cache entries vs. just let them expire in cache. And if the shard is quickly relocated them reassigned to the same node, I think there is a risk that the async force eviction now runs concurrently with a shard recovery?

So maybe we could only force-evict asynchronously when the shard is deleted or failed, and let cache entries in cache if it's no longer assigned.

Thanks! That sounds reasonable. It's easy to implement in SearchableSnapshotIndexEventListener#beforeIndexRemoved because we have the IndexRemovalReason. It's a bit trickier in the SearchableSnapshotIndexFoldersDeletionListener#before(Index|Shard)FoldersDeleted because we lack that context, I'll trace back to where those deletions originate to see if there's an obvious way to propagate that.

I had a go at propagating the reason for the deletion to the listeners. This allows the listener to trigger the asynchronous eviction when we know the shards/indices aren't coming back (i.e. only on DELETE). It meant changes in a few places.

I used the IndexRemovalReason to communicate the reason for deletion. I don't like borrowing that from an unrelated interface but we did already have it in scope in some of these places. If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.

There are some places that now take a reasonText and an IndexRemovalReason we could get rid of the reason text if we don't feel it's adding anything, but it would mean some log messages would change. It sometimes seems to offer more context, for example the text is different when the IndexService executes a pending delete vs when it succeeds on the first attempt, also delete unassigned index specifies that the index is being deleted despite it not being assigned to the local node.

I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those? I don't have the context to make this call I don't think.

Thanks Nick.

If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.

That makes sense.

There are some places that now take a reasonText and an IndexRemovalReason

Thanks for having kept the reason as text. It's provides a bit more context and people are also used to search them in logs.

I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those?

Yes,it is possible that the failed shard got reassigned on the same node after it failed. But in that case, we don't really know the cause of the failure and it would be preferable to synchronously evict the cache I think. It makes sure that cached data are cleaned up so that retries will fetch them again from the source of truth (in the case the cached data are the cause of the failure if we were not evicting them then the shard would have no chance to recover ever).

It goes against the purpose of this PR but shard failures should be the exception so I think keeping the synchronous eviction is OK for failures.

I have moved IndexRemovalReason to be a top-level enum in the org.elasticsearch.indices.cluster package. I wasn't sure if this was the best location for it, but I think it has most meaning in the org.elasticsearch.indices.cluster.IndicesClusterStateService.AllocatedIndices interface where it came from, so I left it close to there.

I changed the logic to

evict DELETED shards/indices asynchronously

evict FAILED shards/indices synchronously

leave everything else to age out of the cache

I'll investigate what testing might be appropriate

…ache/shared/SharedBlobCacheService.java Co-authored-by: Tanguy Leroux <[email protected]>

…_blob_cache_asynchronously

nicktindall · 2025-04-23T06:32:40Z

...rch/xpack/searchablesnapshots/allocation/SearchableSnapshotIndexFoldersDeletionListener.java

+            final SharedBlobCacheService<CacheKey> sharedBlobCacheService =
+                SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get();
+            assert sharedBlobCacheService != null : "frozen cache service not initialized";
+            sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings()));


I had a go at propagating the reason for the deletion to the listeners. This allows the listener to trigger the asynchronous eviction when we know the shards/indices aren't coming back (i.e. only on DELETE). It meant changes in a few places.

I used the IndexRemovalReason to communicate the reason for deletion. I don't like borrowing that from an unrelated interface but we did already have it in scope in some of these places. If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.

There are some places that now take a reasonText and an IndexRemovalReason we could get rid of the reason text if we don't feel it's adding anything, but it would mean some log messages would change. It sometimes seems to offer more context, for example the text is different when the IndexService executes a pending delete vs when it succeeds on the first attempt, also delete unassigned index specifies that the index is being deleted despite it not being assigned to the local node.

I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those? I don't have the context to make this call I don't think.

nicktindall · 2025-04-23T06:40:37Z

server/src/main/java/org/elasticsearch/index/IndexService.java

+                            this.indexSettings,
+                            shardPaths,
+                            IndexRemovalReason.FAILURE
+                        )


This may be a mis-categorisation as FAILURE. The javadoc seems to suggest it's deleting remnants of a different shard rather than the shard being created, due to a name collision. So we're deleting not because the shard failed to start, but to clear old state from a shard that used to have the same name as the one being started.

I think it's OK to use FAILURE, but maybe worth a comment?

Actually the more I look at it I think it's vanilla enough to use without comment. It's just clearing some bad state which is the same as all the other cases. The fact that bad state came from an earlier event is kind of irrelevant.

tlrx

Sorry for the late review. I left a comment about evictions in case of shard failures, otherwise looks good.

tlrx · 2025-04-28T07:18:04Z

server/src/main/java/org/elasticsearch/index/IndexService.java

+                            this.indexSettings,
+                            shardPaths,
+                            IndexRemovalReason.FAILURE
+                        )


I think it's OK to use FAILURE, but maybe worth a comment?

tlrx · 2025-04-28T07:41:51Z

...rch/xpack/searchablesnapshots/allocation/SearchableSnapshotIndexFoldersDeletionListener.java

+            final SharedBlobCacheService<CacheKey> sharedBlobCacheService =
+                SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get();
+            assert sharedBlobCacheService != null : "frozen cache service not initialized";
+            sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings()));


Thanks Nick.

If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.

That makes sense.

There are some places that now take a reasonText and an IndexRemovalReason

Thanks for having kept the reason as text. It's provides a bit more context and people are also used to search them in logs.

I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those?

Yes,it is possible that the failed shard got reassigned on the same node after it failed. But in that case, we don't really know the cause of the failure and it would be preferable to synchronously evict the cache I think. It makes sure that cached data are cleaned up so that retries will fetch them again from the source of truth (in the case the cached data are the cause of the failure if we were not evicting them then the shard would have no chance to recover ever).

It goes against the purpose of this PR but shard failures should be the exception so I think keeping the synchronous eviction is OK for failures.

…_blob_cache_asynchronously

tlrx

LGTM

Closes: ES-10744 Co-authored-by: Tanguy Leroux <[email protected]> (cherry picked from commit 83fe2ed)

* Optimise shared-blob-cache evictions (#126581) Closes: ES-10744 Co-authored-by: Tanguy Leroux <[email protected]> (cherry picked from commit 83fe2ed) * Update docs/changelog/128539.yaml * Fix changelogs * Delete docs/changelog/128539.yaml * Restore original changelog

Evict from the shared blob cache asynchronously

c86859c

elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Apr 10, 2025

Only evict from shared cache when index is partial (SharedSnapshotInd…

c143dba

…exEventListener)

nicktindall added >enhancement :Distributed Indexing/Searchable Snapshots Searchable snapshots / frozen indices. labels Apr 10, 2025

nicktindall requested review from henningandersen and tlrx April 10, 2025 07:12

elasticsearchmachine added Team:Distributed Indexing Meta label for Distributed Indexing team and removed needs:triage Requires assignment of a team area label labels Apr 10, 2025

Update docs/changelog/126581.yaml

a62ac00

github-actions bot deployed to docs-preview April 10, 2025 07:13 View deployment

Merge branch 'main' into evict_from_the_shared_blob_cache_asynchronously

6ae1e7c

henningandersen reviewed Apr 10, 2025

View reviewed changes

Fix changelog

ff3a25d

tlrx reviewed Apr 11, 2025

View reviewed changes

nicktindall and others added 9 commits April 14, 2025 15:55

Update x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobc…

2ef16c9

…ache/shared/SharedBlobCacheService.java Co-authored-by: Tanguy Leroux <[email protected]>

Fix indenting

d3ce506

evictionsRunner -> asyncEvictionsRunner

bd35686

Merge branch 'main' into evict_from_the_shared_blob_cache_asynchronously

83c1bda

Only evict asynchronously for shards we know are not coming back

632afbc

Merge remote-tracking branch 'origin/main' into evict_from_the_shared…

3274c1c

…_blob_cache_asynchronously

Merge remote-tracking branch 'origin/main' into evict_from_the_shared…

253dba1

…_blob_cache_asynchronously

Merge branch 'main' into evict_from_the_shared_blob_cache_asynchronously

f035f25

Propagate IndexRemovalReason to deletion listeners

7ac3220

nicktindall requested a review from a team as a code owner April 23, 2025 03:40

nicktindall added 2 commits April 23, 2025 13:45

Fix naming (reasonMessage/reason)

8e18644

Fix naming (reasonText/reason)

410fb35

nicktindall and others added 6 commits April 23, 2025 13:49

Naming

2372056

[CI] Auto commit changes from spotless

8c91b45

Naming/javadoc

87d1ba4

randomReason()

ea43b2d

Don't evict shards when IndexRemovalReason is FAILURE

c6e7a05

javadoc/naming

7eebc42

nicktindall commented Apr 23, 2025

View reviewed changes

nicktindall requested a review from tlrx April 23, 2025 10:50

tlrx reviewed Apr 28, 2025

View reviewed changes

nicktindall added 8 commits May 20, 2025 13:22

Merge remote-tracking branch 'origin/main' into evict_from_the_shared…

250df4c

…_blob_cache_asynchronously

Make IndexRemovalReason a top-level enum for sharing

d3cd806

Fix eviction logic

5eabc0f

Comment

53ad877

Fix eviction logic

fa17c66

Improve change summary

69e748f

Merge branch 'main' into evict_from_the_shared_blob_cache_asynchronously

f498216

Merge branch 'main' into evict_from_the_shared_blob_cache_asynchronously

185b390

nicktindall changed the title ~~Evict from the shared blob cache asynchronously~~ Optimise shared-blob-cache evictions May 23, 2025

nicktindall added 4 commits May 23, 2025 17:15

Add tests

c2f3b0f

Work with any number of nodes

63f4b1f

Randomise number of docs

cd09f7d

Merge branch 'main' into evict_from_the_shared_blob_cache_asynchronously

69eb2e5

nicktindall requested a review from tlrx May 23, 2025 09:02

tlrx approved these changes May 26, 2025

View reviewed changes

nicktindall merged commit 83fe2ed into elastic:main May 27, 2025
19 checks passed

nicktindall deleted the evict_from_the_shared_blob_cache_asynchronously branch May 27, 2025 00:01

nicktindall added a commit to nicktindall/elasticsearch that referenced this pull request May 27, 2025

Optimise shared-blob-cache evictions (elastic#126581)

ea8b5c7

Closes: ES-10744 Co-authored-by: Tanguy Leroux <[email protected]> (cherry picked from commit 83fe2ed)

nicktindall mentioned this pull request May 27, 2025

Optimise shared-blob-cache evictions #128539

Merged

Optimise shared-blob-cache evictions #126581

Optimise shared-blob-cache evictions #126581

Uh oh!

Conversation

nicktindall commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 10, 2025

Uh oh!

elasticsearchmachine commented Apr 10, 2025

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nicktindall commented Apr 10, 2025 •

edited

Loading

nicktindall May 21, 2025 •

edited

Loading