-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Optimise shared-blob-cache evictions #126581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise shared-blob-cache evictions #126581
Conversation
|
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
|
Hi @nicktindall, I've created a changelog YAML for you. |
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, will leave actual approval to Tanguy.
I think we could maybe do the spawn further out to avoid too many tasks - but it may not really be helpful.
I guess this does not address contention on the CacheService evictions - but we can see if we need that to address that too.
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comment about the (lack of) reasons to evict cache entries for partially mounted shards. I do like the new forceEvictAsync method, it might be useful in other places too.
...ugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBlobCacheService.java
Outdated
Show resolved
Hide resolved
...ugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBlobCacheService.java
Outdated
Show resolved
Hide resolved
| final SharedBlobCacheService<CacheKey> sharedBlobCacheService = | ||
| SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get(); | ||
| assert sharedBlobCacheService != null : "frozen cache service not initialized"; | ||
| sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I don't remember all the decisions around evicting cache entries for partially mounted shards 😞
I suspect that we made it this way to allow cache regions to be reused sooner without waiting for them to decay. It was also useful at the beginning when some corrupted data got written in the cache for some reason, as the forced eviction would clear the mess for us.
But besides this, for partially mounted shards I don't see much reason to force the eviction of cache entries vs. just let them expire in cache. And if the shard is quickly relocated them reassigned to the same node, I think there is a risk that the async force eviction now runs concurrently with a shard recovery?
So maybe we could only force-evict asynchronously when the shard is deleted or failed, and let cache entries in cache if it's no longer assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That sounds reasonable. It's easy to implement in SearchableSnapshotIndexEventListener#beforeIndexRemoved because we have the IndexRemovalReason. It's a bit trickier in the SearchableSnapshotIndexFoldersDeletionListener#before(Index|Shard)FoldersDeleted because we lack that context, I'll trace back to where those deletions originate to see if there's an obvious way to propagate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a go at propagating the reason for the deletion to the listeners. This allows the listener to trigger the asynchronous eviction when we know the shards/indices aren't coming back (i.e. only on DELETE). It meant changes in a few places.
- I used the
IndexRemovalReasonto communicate the reason for deletion. I don't like borrowing that from an unrelated interface but we did already have it in scope in some of these places. If we think it's right to use it I could break it out to be a top-level enum rather than being underIndicesClusterStateService.AllocatedIndices.IndexRemovalReason. - There are some places that now take a
reasonTextand anIndexRemovalReasonwe could get rid of the reason text if we don't feel it's adding anything, but it would mean some log messages would change. It sometimes seems to offer more context, for example the text is different when theIndexServiceexecutes a pending delete vs when it succeeds on the first attempt, also delete unassigned index specifies that the index is being deleted despite it not being assigned to the local node. - I think the only time it's safe to schedule an asynchronous delete is on an
IndexRemovalReason.DELETE. I don't thinkFAILUREis appropriate, because I assume we could retry after one of those? I don't have the context to make this call I don't think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nick.
- If we think it's right to use it I could break it out to be a top-level enum rather than being under
IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.
That makes sense.
- There are some places that now take a
reasonTextand anIndexRemovalReason
Thanks for having kept the reason as text. It's provides a bit more context and people are also used to search them in logs.
- I think the only time it's safe to schedule an asynchronous delete is on an
IndexRemovalReason.DELETE. I don't thinkFAILUREis appropriate, because I assume we could retry after one of those?
Yes,it is possible that the failed shard got reassigned on the same node after it failed. But in that case, we don't really know the cause of the failure and it would be preferable to synchronously evict the cache I think. It makes sure that cached data are cleaned up so that retries will fetch them again from the source of truth (in the case the cached data are the cause of the failure if we were not evicting them then the shard would have no chance to recover ever).
It goes against the purpose of this PR but shard failures should be the exception so I think keeping the synchronous eviction is OK for failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved IndexRemovalReason to be a top-level enum in the org.elasticsearch.indices.cluster package. I wasn't sure if this was the best location for it, but I think it has most meaning in the org.elasticsearch.indices.cluster.IndicesClusterStateService.AllocatedIndices interface where it came from, so I left it close to there.
I changed the logic to
- evict DELETED shards/indices asynchronously
- evict FAILED shards/indices synchronously
- leave everything else to age out of the cache
I'll investigate what testing might be appropriate
…ache/shared/SharedBlobCacheService.java Co-authored-by: Tanguy Leroux <[email protected]>
…_blob_cache_asynchronously
…_blob_cache_asynchronously
| final SharedBlobCacheService<CacheKey> sharedBlobCacheService = | ||
| SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get(); | ||
| assert sharedBlobCacheService != null : "frozen cache service not initialized"; | ||
| sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a go at propagating the reason for the deletion to the listeners. This allows the listener to trigger the asynchronous eviction when we know the shards/indices aren't coming back (i.e. only on DELETE). It meant changes in a few places.
- I used the
IndexRemovalReasonto communicate the reason for deletion. I don't like borrowing that from an unrelated interface but we did already have it in scope in some of these places. If we think it's right to use it I could break it out to be a top-level enum rather than being underIndicesClusterStateService.AllocatedIndices.IndexRemovalReason. - There are some places that now take a
reasonTextand anIndexRemovalReasonwe could get rid of the reason text if we don't feel it's adding anything, but it would mean some log messages would change. It sometimes seems to offer more context, for example the text is different when theIndexServiceexecutes a pending delete vs when it succeeds on the first attempt, also delete unassigned index specifies that the index is being deleted despite it not being assigned to the local node. - I think the only time it's safe to schedule an asynchronous delete is on an
IndexRemovalReason.DELETE. I don't thinkFAILUREis appropriate, because I assume we could retry after one of those? I don't have the context to make this call I don't think.
| this.indexSettings, | ||
| shardPaths, | ||
| IndexRemovalReason.FAILURE | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a mis-categorisation as FAILURE. The javadoc seems to suggest it's deleting remnants of a different shard rather than the shard being created, due to a name collision. So we're deleting not because the shard failed to start, but to clear old state from a shard that used to have the same name as the one being started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to use FAILURE, but maybe worth a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the more I look at it I think it's vanilla enough to use without comment. It's just clearing some bad state which is the same as all the other cases. The fact that bad state came from an earlier event is kind of irrelevant.
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. I left a comment about evictions in case of shard failures, otherwise looks good.
| this.indexSettings, | ||
| shardPaths, | ||
| IndexRemovalReason.FAILURE | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to use FAILURE, but maybe worth a comment?
| final SharedBlobCacheService<CacheKey> sharedBlobCacheService = | ||
| SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get(); | ||
| assert sharedBlobCacheService != null : "frozen cache service not initialized"; | ||
| sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nick.
- If we think it's right to use it I could break it out to be a top-level enum rather than being under
IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.
That makes sense.
- There are some places that now take a
reasonTextand anIndexRemovalReason
Thanks for having kept the reason as text. It's provides a bit more context and people are also used to search them in logs.
- I think the only time it's safe to schedule an asynchronous delete is on an
IndexRemovalReason.DELETE. I don't thinkFAILUREis appropriate, because I assume we could retry after one of those?
Yes,it is possible that the failed shard got reassigned on the same node after it failed. But in that case, we don't really know the cause of the failure and it would be preferable to synchronously evict the cache I think. It makes sure that cached data are cleaned up so that retries will fetch them again from the source of truth (in the case the cached data are the cause of the failure if we were not evicting them then the shard would have no chance to recover ever).
It goes against the purpose of this PR but shard failures should be the exception so I think keeping the synchronous eviction is OK for failures.
…_blob_cache_asynchronously
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Closes: ES-10744 Co-authored-by: Tanguy Leroux <[email protected]> (cherry picked from commit 83fe2ed)
* Optimise shared-blob-cache evictions (#126581) Closes: ES-10744 Co-authored-by: Tanguy Leroux <[email protected]> (cherry picked from commit 83fe2ed) * Update docs/changelog/128539.yaml * Fix changelogs * Delete docs/changelog/128539.yaml * Restore original changelog
ThrottledTaskRunnerto optionally execute the shared blob cache evictions asynchronouslySharedSnapshotIndexEventListenerandSharedSnapshotIndexFoldersDeletionListenernow perform evictions according to the following logicDELETEDshards that we know won't be re-assigned to the node, we schedule asynchronous evictionFAILEDshards that may be re-assigned, but for which we would like to clear potentially invalid state, we retain synchronous evictionsI don’t know if there’s anything to be gained by moving evictions from the
CacheServiceoff the applier thread any more than they already are. I have concerns about making that more asynchronous than it is because inCacheServicethere is a method calledwaitForCacheFilesEvictionIfNeededwhich takes theshardsEvictionMutexand blocks until any pending shard evictions for the specified shard are completed. It uses thependingShardsEvictionsmap to know if there are any pending evictions. If we add another layer of asynchrony, we will potentially be adding a “shadow” queue of evictions that this method doesn’t know about. I wonder if that might break things.If there are performance issue with
CacheServiceevictions, I think we’d be better off optimising the enqueueing and processing of evictions in that, some ideas for that includeshardsEvictionsMutexbetween the evicting threads on generic and the calls tomarkShardAsEvictedInCache. I believe there are ways to reduce that.ThrottledTaskRunnerand it might reduce the contention enough to makemarkShardsAsEvictedInCachefaster.Relates: ES-10744