Skip to content

Conversation

@mjmbischoff
Copy link
Contributor

@mjmbischoff mjmbischoff commented Jul 9, 2025

Chore: Extracting a Cache interface to allow exploring alternative implementations. Alternative implementations are outside the scope of this PR.

I know indirection is bad for performance, but the Cache/LRUCache already lists trade-offs, I expect things to get inlined quickly enough during runtime, so everything considered, I don't think this will 'do us in'.

Along the way found some minor bugs and issues

  • Renamed Cache to LRUCache and extracted an interface and put under the old name to allow other implementations to be tested.
  • Adjusted tests but made explicit the ones who rely on the LRUCache implementation to test if settings are applied correctly. (While CacheBuilder#build() returns a Cache it still always returns the LRUCache implementation)
  • Made Cache.CacheStats / Cache.Stats a record. Getters we're kept for now, should perhaps inline access in the future.
  • Updated the javadoc of the interface to provide enough guarantees for current usage while also lowering it to provide more flexibility for future alternative implementations
  • Notification of removal is documented in the javadoc, also described the situation with regards to registering a RemovalListener and how the interface doesn't mandate how, how many, and what the effects are if registered listener is swapped or if multiple listeners can be registered and what happens when they are added or removed.
  • Added a test to assert if implementations of Cache follow the contract

Context

The Cache/LRUCache is the only implementation of a cache within the ES code base, the implementation dates from 2015 when different priorities were being juggled. Being the only implementation part of the codebase, it's usage has grown over the years, for example: Enrich, Interference, Searchable snapshots. While the choices at the time were sound, as the external factors have changed, it makes sense to revisit these. As a first step PR which looks to extract an Interface and formalize the contract between implementation and consumer.

Somewhat complete list of use cases that use the cache:

  • Geoip ingest processor
  • User agent ingest processor
  • 'Query Caches' (BitsetFilterCache, IndicesRequestCache(shard level), IndicesFieldDataCache) (NOT LRUQueryCache as it is supplied / managed as part of Lucene)
  • Script cache
  • Security(dls, field permissions, Role-privileges, automata, idp / saml, APIkey, LDAP, etc)
  • Enrich cache
  • Interference (Amazon SageMaker)
  • ML (Job auditing)
  • searchable snapshots

NOTES (ramblings of the author, for context, can be ignored):

The wording of javadoc of the old Cache/LRUCache might have sounded 'overly scary/dangerous' leading to some interesting usage/interaction

As already pointed out in comments org.elasticsearch.xpack.core.security.support.CacheIteratorHelper, it's a little wonk. I'm also unsure if the extra protection is needed. Regardless the design of that class to leak the locks and not pull up interaction and fully decorate a cache a la Collections.unmodifiableXXX(XXX) is... 'not (personally) preferred'. But this would be a separate PR.

A computed weight is being tracked as long, which can overflow. Always calculating it on the fly or calculating it on the fly after detecting overflow is an option. Also wondering about thread safety as we're incrementing and something like an AtomicLong is needed. 'Best-effort' :-) Changing the LRUCache impl is out of scope of this PR, so left alone.


For the reviewer:

old Cache was refactored to LRUCache, please diff these, there shouldn't be any changes beyond the name and pulling the shared inner classes out. Github shows a diff between the old Cache and the interface that is now created at the old coordinate. It seems that context was lost which makes it appear like a much bigger PR than it is.

Similar with CacheTests and LRUCacheTests. (CacheTests is completely new)

…nder the old name to allow other implementations to be tested.

- Found and fixed minor bug where we tested max weight against count() instead of weight
- Found and fixed minor issue with possible long overflow.
- Adjusted tests but made explicit the ones who rely on the `LRUCache` implementation to test if settings are applied correctly.
While CacheBuilder#build() returns a `Cache` it still always returns the `LRUCache` implementation
@elasticsearchmachine elasticsearchmachine added v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jul 10, 2025
@mjmbischoff mjmbischoff marked this pull request as ready for review July 14, 2025 14:24
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 14, 2025
@mjmbischoff mjmbischoff self-assigned this Jul 15, 2025
@ywangd ywangd removed their request for review July 21, 2025 00:35
@ywangd
Copy link
Member

ywangd commented Jul 21, 2025

Not sure about the context of this change. The code is owned by @elastic/es-core-infra, so I took myself off the reviewers.

@gbanasiak gbanasiak added the :Core/Infra/Core Core issues without another label label Jul 21, 2025
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team and removed needs:triage Requires assignment of a team area label labels Jul 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@mjmbischoff mjmbischoff removed the serverless-linked Added by automation, don't add manually label Jul 23, 2025
@nielsbauman nielsbauman removed their request for review July 28, 2025 14:29
@mjmbischoff mjmbischoff requested a review from rjernst July 31, 2025 16:36
Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc for Cache notes that the interface is still tightly coupled with LRUCache. I wonder if it would be cleaner to do a simple rename of Cache to LRUCache, then (in a separate change) work on defining a (decoupled) interface, and migrating uses of LRUCache to it as necessary.

@mjmbischoff
Copy link
Contributor Author

@rjernst

The javadoc for Cache notes that the interface is still tightly coupled with LRUCache. I wonder if it would be cleaner to do a simple rename of Cache to LRUCache, then (in a separate change) work on defining a (decoupled) interface, and migrating uses of LRUCache to it as necessary.

Good point, I think I changed my stance on that but in that case I should update the docs, let's park that for a moment; could still be a good idea to slim down the PR. However if we split the refactor with the interface extraction it will mean that all consumers (all places where the class is used) need to be updated. That would mean the PR would touch many more files. Keeping the action together actually reduces the size, in my opinion.

So I think I was able to remove the problematic methods as they weren't used in the code base, or only from the class testing the LRUCache. I was also able to finish my alternative implementation and plug it in, so I guess by that simple fact that statement isn't true as far as the Interface goes.

The PR looks bigger as it doesn't see the rename separately, but LRUCache is unchanged apart from the class rename.
Same for the associated tests.

The CacheBuilder always returns a LRUCache as before, I adjusted the tests to reflect this. A followup PR introducing a new implementation should address this or even two separate PR's

I would suggest removing that remark if it's ok. Let me check and update the javadoc for the whole interface again.

mjmbischoff and others added 2 commits August 8, 2025 14:24
removing statement about coupling that is no longer accurate.
@mjmbischoff
Copy link
Contributor Author

@rjernst All green again, Let me know if the above is alright

@mjmbischoff mjmbischoff added the auto-backport Automatically create backport pull requests when merged label Aug 18, 2025
@mjmbischoff mjmbischoff requested a review from rjernst August 18, 2025 21:50
* An LRU sequencing of the values in the cache. This sequence is not protected from mutations
* to the cache (except for {@link Iterator#remove()}. The result of iteration under any other mutation is
* undefined.
* An Iterable that allows to transverse all keys in the cache. Modifications might be visible and no guarantee is made on ordering of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keys -> values

public long weight() {
return weight;
}
long weight();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems implementation specific? The meaning of weight seems opaque, and isn't necessarily representable by a long value. Why does it even need to be retrieved given there is no setter on the interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like with the forEach method, the method exists. Getting rid of it would require changing the call side and expanding the PR.

In it's current use it's implied that this is bytes. sometimes heap, sometimes of heap, sometimes it's only the payload, sometimes it covers the whole entry. Because the weigher is external, the unit is invariant: We ask the weigher for a number for each entry and check if it's more than a max. Could do a lot here but not without changing the call sites and increasing the PR size significantly.

* An Iterable that allows to transverse all keys in the cache. Modifications might be visible and no guarantee is made on ordering of
* these modifications as new entries might end up in the already transversed part of the cache. So while transversing; if A and B are
* added to the cache, and in that order, we might only observe B. Implementations should allow the cache to be modified while
* transversing but only {@link Iterator#remove()} is guaranteed to not affect further transversal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transversing -> traversing

* undefined.
* An Iterable that allows to transverse all keys in the cache. Modifications might be visible and no guarantee is made on ordering of
* these modifications as new entries might end up in the already transversed part of the cache. So while transversing; if A and B are
* added to the cache, and in that order, we might only observe B. Implementations should allow the cache to be modified while
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of notes about "implementations should..." in these javadocs. If there are restrictions on implementations, it should be reflected in the API. If implementations want to add restrictions, it should be in their own javadocs. But these notes detract from defining what the API actually guarantees, or doesn't.

As examples, see the javadocs for Collection.iterator() and Map.values():
https://docs.oracle.com/javase/8/docs/api/java/util/Collection.html#iterator--
https://docs.oracle.com/javase/8/docs/api/java/util/Map.html#values--

private final long hits;
private final long misses;
private final long evictions;
void forEach(BiConsumer<Key, Value> consumer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this in the existing API? Why is it necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I didn't introduce any new methods: all are extracted from the current implementation and then filtered for use in the codebase.

I think I have the same feels, but I didn't want to increase the size of the PR further by adjusting the call side. This would be something for a followup PR, to unify and have a single style to iterate over data inside the cache,

public class CacheTests extends ESTestCase {
private int numberOfEntries;

private List<CacheSupplier<String, String>> knownImplementations;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests seem to still be testing the LRU implementation. Presumably the point should be to test the interface and any default methods. I think a mock implementation in this test suite would better test that.

public class CacheTests extends ESTestCase {
private int numberOfEntries;

private List<CacheSupplier<String, String>> knownImplementations;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests seem to still be testing the LRU implementation. Presumably the point should be to test the interface and any default methods. I think a mock implementation in this test suite would better test that.

@mjmbischoff mjmbischoff marked this pull request as draft September 4, 2025 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Core/Infra/Core Core issues without another label external-contributor Pull request authored by a developer outside the Elasticsearch team >refactoring Team:Core/Infra Meta label for core/infra team >tech debt v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants