Skip to content

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Apr 10, 2025

  • I added a ThrottledTaskRunner to optionally execute the shared blob cache evictions asynchronously
    • SharedSnapshotIndexEventListener and SharedSnapshotIndexFoldersDeletionListener now perform evictions according to the following logic
      • For DELETED shards that we know won't be re-assigned to the node, we schedule asynchronous eviction
      • For FAILED shards that may be re-assigned, but for which we would like to clear potentially invalid state, we retain synchronous evictions
      • For any other index removal, we just let the cache entries expire
    • We limit to 5 concurrent deletion threads by default. I think there’s limited gains to be had by increasing concurrency much more than this because the removals happen inside a mutex anyway (the scan for what to remove can be done concurrently). Open to reducing it to less than 5 if we think that’s appropriate.
  • Whenever we are clearing the cache, we only clear the shared cache for partially mounted indices. It looked like only partial mounted indices use the shared cache. I imagine if recommendations are followed and people use dedicated frozen nodes there will be limited impact from that change, because the cache will either be empty or always scanned. Open to removing those changes for simplicity’s sake.

I don’t know if there’s anything to be gained by moving evictions from the CacheService off the applier thread any more than they already are. I have concerns about making that more asynchronous than it is because in CacheService there is a method called waitForCacheFilesEvictionIfNeeded which takes the shardsEvictionMutex and blocks until any pending shard evictions for the specified shard are completed. It uses the pendingShardsEvictions map to know if there are any pending evictions. If we add another layer of asynchrony, we will potentially be adding a “shadow” queue of evictions that this method doesn’t know about. I wonder if that might break things.

If there are performance issue with CacheService evictions, I think we’d be better off optimising the enqueueing and processing of evictions in that, some ideas for that include

  • Reduce the amount of lock contention. There is potentially a lot of contention for the shardsEvictionsMutex between the evicting threads on generic and the calls to markShardAsEvictedInCache. I believe there are ways to reduce that.
  • Reduce the amount of concurrent evictions. Currently there is no limit other than the size of the generic pool. We could add a ThrottledTaskRunner and it might reduce the contention enough to make markShardsAsEvictedInCache faster.

Relates: ES-10744

@elasticsearchmachine elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Apr 10, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Indexing Meta label for Distributed Indexing team and removed needs:triage Requires assignment of a team area label labels Apr 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, will leave actual approval to Tanguy.

I think we could maybe do the spawn further out to avoid too many tasks - but it may not really be helpful.

I guess this does not address contention on the CacheService evictions - but we can see if we need that to address that too.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comment about the (lack of) reasons to evict cache entries for partially mounted shards. I do like the new forceEvictAsync method, it might be useful in other places too.

final SharedBlobCacheService<CacheKey> sharedBlobCacheService =
SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get();
assert sharedBlobCacheService != null : "frozen cache service not initialized";
sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't remember all the decisions around evicting cache entries for partially mounted shards 😞

I suspect that we made it this way to allow cache regions to be reused sooner without waiting for them to decay. It was also useful at the beginning when some corrupted data got written in the cache for some reason, as the forced eviction would clear the mess for us.

But besides this, for partially mounted shards I don't see much reason to force the eviction of cache entries vs. just let them expire in cache. And if the shard is quickly relocated them reassigned to the same node, I think there is a risk that the async force eviction now runs concurrently with a shard recovery?

So maybe we could only force-evict asynchronously when the shard is deleted or failed, and let cache entries in cache if it's no longer assigned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That sounds reasonable. It's easy to implement in SearchableSnapshotIndexEventListener#beforeIndexRemoved because we have the IndexRemovalReason. It's a bit trickier in the SearchableSnapshotIndexFoldersDeletionListener#before(Index|Shard)FoldersDeleted because we lack that context, I'll trace back to where those deletions originate to see if there's an obvious way to propagate that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a go at propagating the reason for the deletion to the listeners. This allows the listener to trigger the asynchronous eviction when we know the shards/indices aren't coming back (i.e. only on DELETE). It meant changes in a few places.

  • I used the IndexRemovalReason to communicate the reason for deletion. I don't like borrowing that from an unrelated interface but we did already have it in scope in some of these places. If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.
  • There are some places that now take a reasonText and an IndexRemovalReason we could get rid of the reason text if we don't feel it's adding anything, but it would mean some log messages would change. It sometimes seems to offer more context, for example the text is different when the IndexService executes a pending delete vs when it succeeds on the first attempt, also delete unassigned index specifies that the index is being deleted despite it not being assigned to the local node.
  • I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those? I don't have the context to make this call I don't think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nick.

  • If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.

That makes sense.

  • There are some places that now take a reasonText and an IndexRemovalReason

Thanks for having kept the reason as text. It's provides a bit more context and people are also used to search them in logs.

  • I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those?

Yes,it is possible that the failed shard got reassigned on the same node after it failed. But in that case, we don't really know the cause of the failure and it would be preferable to synchronously evict the cache I think. It makes sure that cached data are cleaned up so that retries will fetch them again from the source of truth (in the case the cached data are the cause of the failure if we were not evicting them then the shard would have no chance to recover ever).

It goes against the purpose of this PR but shard failures should be the exception so I think keeping the synchronous eviction is OK for failures.

Copy link
Contributor Author

@nicktindall nicktindall May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved IndexRemovalReason to be a top-level enum in the org.elasticsearch.indices.cluster package. I wasn't sure if this was the best location for it, but I think it has most meaning in the org.elasticsearch.indices.cluster.IndicesClusterStateService.AllocatedIndices interface where it came from, so I left it close to there.

I changed the logic to

  • evict DELETED shards/indices asynchronously
  • evict FAILED shards/indices synchronously
  • leave everything else to age out of the cache

I'll investigate what testing might be appropriate

@nicktindall nicktindall requested a review from a team as a code owner April 23, 2025 03:40
final SharedBlobCacheService<CacheKey> sharedBlobCacheService =
SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get();
assert sharedBlobCacheService != null : "frozen cache service not initialized";
sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a go at propagating the reason for the deletion to the listeners. This allows the listener to trigger the asynchronous eviction when we know the shards/indices aren't coming back (i.e. only on DELETE). It meant changes in a few places.

  • I used the IndexRemovalReason to communicate the reason for deletion. I don't like borrowing that from an unrelated interface but we did already have it in scope in some of these places. If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.
  • There are some places that now take a reasonText and an IndexRemovalReason we could get rid of the reason text if we don't feel it's adding anything, but it would mean some log messages would change. It sometimes seems to offer more context, for example the text is different when the IndexService executes a pending delete vs when it succeeds on the first attempt, also delete unassigned index specifies that the index is being deleted despite it not being assigned to the local node.
  • I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those? I don't have the context to make this call I don't think.

this.indexSettings,
shardPaths,
IndexRemovalReason.FAILURE
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a mis-categorisation as FAILURE. The javadoc seems to suggest it's deleting remnants of a different shard rather than the shard being created, due to a name collision. So we're deleting not because the shard failed to start, but to clear old state from a shard that used to have the same name as the one being started.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to use FAILURE, but maybe worth a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the more I look at it I think it's vanilla enough to use without comment. It's just clearing some bad state which is the same as all the other cases. The fact that bad state came from an earlier event is kind of irrelevant.

@nicktindall nicktindall requested a review from tlrx April 23, 2025 10:50
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I left a comment about evictions in case of shard failures, otherwise looks good.

this.indexSettings,
shardPaths,
IndexRemovalReason.FAILURE
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to use FAILURE, but maybe worth a comment?

final SharedBlobCacheService<CacheKey> sharedBlobCacheService =
SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get();
assert sharedBlobCacheService != null : "frozen cache service not initialized";
sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nick.

  • If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.

That makes sense.

  • There are some places that now take a reasonText and an IndexRemovalReason

Thanks for having kept the reason as text. It's provides a bit more context and people are also used to search them in logs.

  • I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those?

Yes,it is possible that the failed shard got reassigned on the same node after it failed. But in that case, we don't really know the cause of the failure and it would be preferable to synchronously evict the cache I think. It makes sure that cached data are cleaned up so that retries will fetch them again from the source of truth (in the case the cached data are the cause of the failure if we were not evicting them then the shard would have no chance to recover ever).

It goes against the purpose of this PR but shard failures should be the exception so I think keeping the synchronous eviction is OK for failures.

@nicktindall nicktindall changed the title Evict from the shared blob cache asynchronously Optimise shared-blob-cache evictions May 23, 2025
@nicktindall nicktindall requested a review from tlrx May 23, 2025 09:02
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicktindall nicktindall merged commit 83fe2ed into elastic:main May 27, 2025
19 checks passed
@nicktindall nicktindall deleted the evict_from_the_shared_blob_cache_asynchronously branch May 27, 2025 00:01
nicktindall added a commit to nicktindall/elasticsearch that referenced this pull request May 27, 2025
Closes: ES-10744

Co-authored-by: Tanguy Leroux <[email protected]>

(cherry picked from commit 83fe2ed)
elasticsearchmachine pushed a commit that referenced this pull request May 28, 2025
* Optimise shared-blob-cache evictions (#126581)

Closes: ES-10744

Co-authored-by: Tanguy Leroux <[email protected]>

(cherry picked from commit 83fe2ed)

* Update docs/changelog/128539.yaml

* Fix changelogs

* Delete docs/changelog/128539.yaml

* Restore original changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Searchable Snapshots Searchable snapshots / frozen indices. >enhancement Team:Distributed Indexing Meta label for Distributed Indexing team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants