Remove DocumentSubsetBitsetCache locking #133681

joegallo · 2025-08-27T19:37:25Z

For heavy users of Document Level Security (DLS), and where the entire DLS bitset cache cannot be held in memory at once, we can experience a high degree of cache churn. Due to the locking that keeps the DocumentSubsetBitsetCache's bitsetCache and keysByIndex in sync, we end up seeing an extreme level of lock contention when entries are evicted from the cache (as they are frequently if the cache is churning).

The purpose of the keysByIndex data structure is to allow us to proactively evict entries from the cache in the event that their associated segment becomes inaccessible (e.g. because of a segment merge, or if an index is closed or deleted).

By removing the locking around the updates to the bitsetCache and keysByIndex structures, we get significantly improved throughput, but it becomes possible (though the chances are quite small) that we will no longer have an entry in the keysByIndex structure for an index that is open and for which there is an entry in the bitsetCache. As a consequence, if that segment later becomes inaccessible, we will not proactively remove the entry from the cache. This is not a true memory leak, however, as the maximum size and TTL policies of the cache still apply, and the entry will be removed from the cache eventually.

I've created a small benchmark that indexes ~10 million documents in 8 indices and then runs a selection of searches and aggregations against the indices via 63 user accounts associated with different DLS role queries. If all the data were in the cache at the same time, it would be approximately 64mb, but the cache is limited to 48mb during the benchmark run.

On main without these changes, I see the following results from the benchmark:

|                                                Mean Throughput |           dls-search |    76.79        |  ops/s |
|                                              Median Throughput |           dls-search |    76.84        |  ops/s |
|                                        50th percentile latency |           dls-search |  1604.59        |     ms |
|                                        90th percentile latency |           dls-search |  2624.26        |     ms |
|                                        99th percentile latency |           dls-search |  3547.23        |     ms |

And with these changes (approximately 15x better throughput and latency):

|                                                Mean Throughput |           dls-search |  1169.86        |  ops/s |
|                                              Median Throughput |           dls-search |  1170.12        |  ops/s |
|                                        50th percentile latency |           dls-search |    93.6846      |     ms |
|                                        90th percentile latency |           dls-search |   167.648       |     ms |
|                                        99th percentile latency |           dls-search |   269.456       |     ms |

Because sufficiently poor performance can be characterized as a bug, I'm labeling this PR as a >bug and I intend to backport it to all the relevant branches.

Note: because I've removed the bitsetCache.get call from the onCacheEviction method, this PR happens to fix #132842.

This is a WIP commit, in that these locks will be going away entirely, but I want the 'ignored' name to be available.

the code is simpler this way, and doesn't require an allocation.

If the set has been emptied in and onCacheEviction call, then remove it from the map.

Since we're tidying the map as we go, it's harder to reason about whether the error is that the set is null versus if the set doesn't contain one entry in particular, so treat those conditions as being the same.

This test hits the race condition described in `onClose` a little less than once in a thousand runs on my machine, so we can't check for the same level of strict internal consistency between the two data structures (it's possible for the cache to contain a bitset that isn't referenced by the keysByIndex structure, and that's okay).

elasticsearchmachine · 2025-08-27T19:38:06Z

Pinging @elastic/es-security (Team:Security)

elasticsearchmachine · 2025-08-27T19:38:07Z

Hi @joegallo, I've created a changelog YAML for you.

elasticsearchmachine · 2025-08-27T19:41:36Z

Hi @joegallo, I've updated the changelog YAML for you.

tvernum

LGTM, thanks for all the work on this.

It's amazing when a substantial investment in time leads to such an improvement simply by removing code.

szybia

lgtm!

elasticsearchmachine · 2025-08-28T10:19:36Z

💔 Backport failed

Status	Branch	Result
❌	9.1	Commit could not be cherrypicked due to conflicts
❌	9.0	Commit could not be cherrypicked due to conflicts
❌	8.18	Commit could not be cherrypicked due to conflicts
❌	8.19	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 133681

joegallo · 2025-08-28T11:00:57Z

#133707 will be autobackported to 8.18 and 8.19, so between that and #133705 we should be all set on the backports.

…3707)

…3707) (elastic#133710)

joegallo added 17 commits August 27, 2025 15:09

Prefer the static import for these

142bceb

Conserve precious characters

0218a97

Rename these variables

c67e8d2

This is a WIP commit, in that these locks will be going away entirely, but I want the 'ignored' name to be available.

Avoid the optional/ifPresent chain

2a2a958

the code is simpler this way, and doesn't require an allocation.

Tidy the map as we go

6df93e1

If the set has been emptied in and onCacheEviction call, then remove it from the map.

Collapse these checks into a single loop

83682ea

Collapse these checks, too

0c32742

Since we're tidying the map as we go, it's harder to reason about whether the error is that the set is null versus if the set doesn't contain one entry in particular, so treat those conditions as being the same.

Rip out the locks

a82dd22

Rip out the executor

db0a286

Drop the get and contains in favor of computeIfPresent

610a792

Add a little more explanation

5913a0f

Make this test work again

2fe947e

It should always be consistent when single threaded

e4f5760

Split this method into two sub-methods

a8ffa25

Make this test run again

873ad84

Add some consistency checks to this test

5c1f299

joegallo added >bug :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team auto-backport Automatically create backport pull requests when merged v9.2.0 v9.1.4 v9.0.7 v8.18.7 v8.19.4 labels Aug 27, 2025

Update docs/changelog/133681.yaml

bb57cbb

joegallo requested a review from tvernum August 27, 2025 19:39

joegallo mentioned this pull request Aug 27, 2025

DocumentSubsetBitsetCache eviction increments misses statistic #132842

Closed

Update docs/changelog/133681.yaml

dcc9d37

tvernum approved these changes Aug 27, 2025

View reviewed changes

szybia approved these changes Aug 28, 2025

View reviewed changes

joegallo merged commit 98a73ce into elastic:main Aug 28, 2025
39 checks passed

joegallo deleted the remove-bitsetcache-locking branch August 28, 2025 10:18

elasticsearchmachine added the backport pending label Aug 28, 2025

joegallo added a commit to joegallo/elasticsearch that referenced this pull request Aug 28, 2025

Remove DocumentSubsetBitsetCache locking (elastic#133681)

e5cca63

joegallo mentioned this pull request Aug 28, 2025

[9.1] Remove DocumentSubsetBitsetCache locking (#133681) #133705

Merged

joegallo added a commit to joegallo/elasticsearch that referenced this pull request Aug 28, 2025

Remove DocumentSubsetBitsetCache locking (elastic#133681)

b969331

joegallo mentioned this pull request Aug 28, 2025

[9.0] Remove DocumentSubsetBitsetCache locking (#133681) #133707

Merged

joegallo removed the backport pending label Aug 28, 2025

elasticsearchmachine pushed a commit that referenced this pull request Aug 28, 2025

Remove DocumentSubsetBitsetCache locking (#133681) (#133705)

5515f9d

elasticsearchmachine pushed a commit that referenced this pull request Aug 28, 2025

Remove DocumentSubsetBitsetCache locking (#133681) (#133707)

e05d005

This was referenced Aug 28, 2025

[8.18] Remove DocumentSubsetBitsetCache locking (#133681) (#133707) #133709

Merged

[8.19] Remove DocumentSubsetBitsetCache locking (#133681) (#133707) #133710

Merged

joegallo added a commit to joegallo/elasticsearch that referenced this pull request Aug 28, 2025

Remove DocumentSubsetBitsetCache locking (elastic#133681) (elastic#13…

12efb6c

…3707)

elasticsearchmachine pushed a commit that referenced this pull request Aug 28, 2025

Remove DocumentSubsetBitsetCache locking (#133681) (#133707) (#133709)

dc2a93a

elasticsearchmachine pushed a commit that referenced this pull request Aug 28, 2025

Remove DocumentSubsetBitsetCache locking (#133681) (#133707) (#133710)

27723ab

sarog pushed a commit to portsbuild/elasticsearch that referenced this pull request Sep 11, 2025

Remove DocumentSubsetBitsetCache locking (elastic#133681) (elastic#13…

e7e78ac

…3707) (elastic#133710)

sarog pushed a commit to portsbuild/elasticsearch that referenced this pull request Sep 19, 2025

Remove DocumentSubsetBitsetCache locking (elastic#133681) (elastic#13…

2f29696

…3707) (elastic#133710)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove DocumentSubsetBitsetCache locking #133681

Remove DocumentSubsetBitsetCache locking #133681

Uh oh!

joegallo commented Aug 27, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Aug 27, 2025

Uh oh!

elasticsearchmachine commented Aug 27, 2025

Uh oh!

elasticsearchmachine commented Aug 27, 2025

Uh oh!

tvernum left a comment

Uh oh!

szybia left a comment

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 28, 2025

Uh oh!

joegallo commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Remove DocumentSubsetBitsetCache locking #133681

Remove DocumentSubsetBitsetCache locking #133681

Uh oh!

Conversation

joegallo commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 27, 2025

Uh oh!

elasticsearchmachine commented Aug 27, 2025

Uh oh!

elasticsearchmachine commented Aug 27, 2025

Uh oh!

tvernum left a comment

Choose a reason for hiding this comment

Uh oh!

szybia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 28, 2025

💔 Backport failed

Uh oh!

joegallo commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joegallo commented Aug 27, 2025 •

edited

Loading