-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Remove DocumentSubsetBitsetCache locking #133681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a WIP commit, in that these locks will be going away entirely, but I want the 'ignored' name to be available.
the code is simpler this way, and doesn't require an allocation.
If the set has been emptied in and onCacheEviction call, then remove it from the map.
Since we're tidying the map as we go, it's harder to reason about whether the error is that the set is null versus if the set doesn't contain one entry in particular, so treat those conditions as being the same.
This test hits the race condition described in `onClose` a little less than once in a thousand runs on my machine, so we can't check for the same level of strict internal consistency between the two data structures (it's possible for the cache to contain a bitset that isn't referenced by the keysByIndex structure, and that's okay).
Pinging @elastic/es-security (Team:Security) |
Hi @joegallo, I've created a changelog YAML for you. |
Hi @joegallo, I've updated the changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for all the work on this.
It's amazing when a substantial investment in time leads to such an improvement simply by removing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
💔 Backport failed
You can use sqren/backport to manually backport by running |
For heavy users of Document Level Security (DLS), and where the entire DLS bitset cache cannot be held in memory at once, we can experience a high degree of cache churn. Due to the locking that keeps the
DocumentSubsetBitsetCache
'sbitsetCache
andkeysByIndex
in sync, we end up seeing an extreme level of lock contention when entries are evicted from the cache (as they are frequently if the cache is churning).The purpose of the
keysByIndex
data structure is to allow us to proactively evict entries from the cache in the event that their associated segment becomes inaccessible (e.g. because of a segment merge, or if an index is closed or deleted).By removing the locking around the updates to the
bitsetCache
andkeysByIndex
structures, we get significantly improved throughput, but it becomes possible (though the chances are quite small) that we will no longer have an entry in thekeysByIndex
structure for an index that is open and for which there is an entry in thebitsetCache
. As a consequence, if that segment later becomes inaccessible, we will not proactively remove the entry from the cache. This is not a true memory leak, however, as the maximum size and TTL policies of the cache still apply, and the entry will be removed from the cache eventually.I've created a small benchmark that indexes ~10 million documents in 8 indices and then runs a selection of searches and aggregations against the indices via 63 user accounts associated with different DLS role queries. If all the data were in the cache at the same time, it would be approximately 64mb, but the cache is limited to 48mb during the benchmark run.
On
main
without these changes, I see the following results from the benchmark:And with these changes (approximately 15x better throughput and latency):
Because sufficiently poor performance can be characterized as a bug, I'm labeling this PR as a
>bug
and I intend to backport it to all the relevant branches.Note: because I've removed the
bitsetCache.get
call from theonCacheEviction
method, this PR happens to fix #132842.