-
Notifications
You must be signed in to change notification settings - Fork 0
LRUStoreCache #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a first round of comments - hopefully enough to start iterating on. I haven't yet reviewed the implementation or tests, and will do that next.
Co-authored-by: David Stansby <[email protected]>
pyproject.toml
Outdated
"ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning" | ||
] | ||
markers = [ | ||
"asyncio: mark test as asyncio test", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you need to add this to stop some tests failing (or pytest in general failing)? It seems unrelated to the tests you added, so I'm a bit surprised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove it for now, but yes I think there were some failing tests within the github actions when I didn't include it. Tests pass locally fine without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewd the implementation, and have left (lots of!) comments. I get the impression that you might have taken code from v2 of zarr-python and added code to work with v3 of zarr-python, but then haven't removed a lot of the code that is no longer relevant for v3 zarr-python. For example, there are lots of hasattr()
checks that will either always return False
or True
if you look at the definition of the Store
abstract class, and there are lots of methods you've defined that duplicate functionality in the async
methods.
I think my comments inline cover most of what needs changing/updating, but it would be good if you could cross check what you've written against some other v3 stores to make sure the code structure and methods implemented are similar.
Co-authored-by: David Stansby <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some general comments:
- There are several
if hasattr(...)
blocks that are redundant, because all stores define those attributes. You can check which properties/attributes/methods all stores have at https://zarr.readthedocs.io/en/stable/api/zarr/abc/store/index.html#zarr.abc.store.Store. These redudnant blocks should be simplified. I started commenting on these, but there's enough that I stopped, so you'll have to find them all. - There are several branches that claim to handle "dict-like" objects, but
self._store
is always a store. So these branches can be removed. I started commenting on these, but there's enough that I stopped, so you'll have to find them all. - There are a few methods that are redundant or duplicate functionality, and should be removed (see inline comments)
src/zarr/storage/_cache.py
Outdated
Extract directory listing from store keys by filtering keys that start with the given path. | ||
""" | ||
children: set[str] = set() | ||
# Handle both Store objects and dict-like objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you handling both Store and dict-like objects, when the type of store in the signature only allows Store objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove handling of both types of objects.
src/zarr/storage/_cache.py
Outdated
self._keys_cache = [] | ||
return self._keys_cache | ||
|
||
def listdir(self, path: Path) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole method can be deleted, because it's made redundant by list_dir()
, which is the method that the Store
base class says has to be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this.
src/zarr/storage/_cache.py
Outdated
return sorted(children) | ||
|
||
|
||
def listdir(store: Store, path: Path | None = None) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole function can be deleted, because it's only used in listdir()
below, which can also be delted (see other comment below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this.
src/zarr/storage/_cache.py
Outdated
supports_writes: bool = True | ||
supports_deletes: bool = True | ||
supports_partial_writes: bool = True | ||
supports_listing: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will not always be True
, because it depends on if the underlying store supports listing. In fact, I think all these properties will depend on the underlying store. So they should be re-implemented as properties that return the values from the underlying store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made into properties that use underlying store.
src/zarr/storage/_cache.py
Outdated
self._contains_cache.clear() | ||
self._listdir_cache.clear() | ||
|
||
def _invalidate_keys_unsafe(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of defining this and invalidate_keys()
and duplicating code, I would keep just invalidate_keys()
, and when you need to call it just always call it outsdie a with self._mutex:
block (because it handles the mutex iteself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, implemented this.
src/zarr/storage/_cache.py
Outdated
self._contains_cache.clear() | ||
self._listdir_cache.clear() | ||
|
||
def _invalidate_value_unsafe(self, key: Any) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/zarr/storage/_cache.py
Outdated
self._check_writable() | ||
|
||
# Check if it's a Store object vs dict-like object | ||
if hasattr(self._store, "supports_listing"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant, all stores define the supports_listing
property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
src/zarr/storage/_cache.py
Outdated
with self._mutex: | ||
self._invalidate_keys_unsafe() | ||
cache_key = self._normalize_key(key) | ||
self._invalidate_value_unsafe(cache_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think it's easier to just do _invalidate_keys()
and then _invalidate_value()
without the mutex block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/zarr/storage/_cache.py
Outdated
|
||
async def exists(self, key: str) -> bool: | ||
# Delegate to the underlying store | ||
return await self._store.exists(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be checking the cache first instead of checking the store every time? Otherwise it defeats the point of having the cache!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, matching with getsize now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
I've addressed a lot of the inline comments and removed the dict-like branches, but I'm getting lots of the same mypy errors when removing if hasattr blocks either:
Will we always be delegating to the underlying store in these cases? In which instance, I can get rid of the if-else statement which resolves the mypy issues. If we want to have the underlying store or the cache for these methods then I'm not sure how to keep the if-else statement without running into mypy issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've commented and/or implemented changes for the inline comments.
src/zarr/storage/_cache.py
Outdated
Extract directory listing from store keys by filtering keys that start with the given path. | ||
""" | ||
children: set[str] = set() | ||
# Handle both Store objects and dict-like objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove handling of both types of objects.
src/zarr/storage/_cache.py
Outdated
self._keys_cache = [] | ||
return self._keys_cache | ||
|
||
def listdir(self, path: Path) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this.
src/zarr/storage/_cache.py
Outdated
return sorted(children) | ||
|
||
|
||
def listdir(store: Store, path: Path | None = None) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this.
src/zarr/storage/_cache.py
Outdated
supports_writes: bool = True | ||
supports_deletes: bool = True | ||
supports_partial_writes: bool = True | ||
supports_listing: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made into properties that use underlying store.
src/zarr/storage/_cache.py
Outdated
supports_partial_writes: bool = True | ||
supports_listing: bool = True | ||
|
||
root: Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this.
src/zarr/storage/_cache.py
Outdated
with self._mutex: | ||
self._invalidate_keys_unsafe() | ||
cache_key = self._normalize_key(key) | ||
self._invalidate_value_unsafe(cache_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/zarr/storage/_cache.py
Outdated
self._check_writable() | ||
|
||
# Check if it's a Store object vs dict-like object | ||
if hasattr(self._store, "supports_listing"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
src/zarr/storage/_cache.py
Outdated
|
||
async def exists(self, key: str) -> bool: | ||
# Delegate to the underlying store | ||
return await self._store.exists(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, matching with getsize now.
src/zarr/storage/_cache.py
Outdated
underlying_store = self._store.with_read_only(read_only) | ||
return LRUStoreCache(underlying_store, max_size=self._max_size) | ||
|
||
def _normalize_key(self, key: str | Path) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to cache_key = key
without normalize_key function so that it's obvious that it's a cached key but can change if cleaner.
src/zarr/storage/_cache.py
Outdated
raise ValueError("max_size must be a positive integer (bytes)") | ||
|
||
# Always inherit read_only state from the underlying store | ||
read_only = getattr(store, "read_only", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep changed this now.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementaiton is looking real nice now 🕺 . I left a few more comments, but nothing major. Mostly more code that I think can be binned.
I had a first look at the tests, and they're a bit funny at the moment. I really like the idea behind CountingDict
to track the method calls, but it needs some improvement because you should be testing the method calls on an actual Store
object, not a dict
. I left a suggestion as to the best way to implement this in the inline comments. Once that's done, I will probably have some more comments/questions on the tests.
tests/test_store/test_cache.py
Outdated
from zarr.testing.store import StoreTests | ||
|
||
|
||
class CountingDict(dict[Any, Any]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking how many times the different methods have been called is a really neat idea! This currently needs some fixing though, because you should be testing how many times method on a Store
obejct is called, but this is not a Store
object (although it sort of looks like one).
So my recommendation to fix this is to remove CountingDict
, and instead implement a store that is a thin wrapper of a MemoryStore
, that also contains logic to track method calls. Something like:
class CounterStore(MemoryStore):
"""
A thin wrapper of MemoryStore to count different method calls for testing.
"""
def __init__(self) -> None:
super().__init__()
self.counter: Counter[tuple[str, Any] | str] = Counter()
async def clear(self) -> None:
self.counter["clear"] += 1
# docstring inherited
self._store_dict.clear()
# TODO: implement other methods that should be tracked
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some more questions, left inline. As well as checking the tests run, can you check that pre-commit
runs too, and fix any errors? It looks like there's currently a few typing errors that need addressing in the new code.
src/zarr/storage/_cache.py
Outdated
self.misses += 1 | ||
size = await self._store.getsize(key) | ||
|
||
# Try to get and cache the value if it's reasonably small i.e. ≤10% of max cache size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the motivation for you choosing this threshold for the value size?
One of the ideas of the cache store is it's useful for stores (e.g., remote stores) where making a query for the store is the reason for them being slow, not then getting a value from the store. So my thinking was it getsize(key)
takes just as long as getvalue(key)
for a remote store, the implementation of LRUCacheStore.getsize
might as well 1) get the value 2) cache it, and 3) 'manually' calculate and return the size of the value, instead of relying on the underlying _store.getsize(key)
implementation.
Please could you do some benchmarking using a remote store to see if it makes sense to do what I suggested above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this now and did some quick benchmarking so your suggestion is correct that the time saved is worth the new implementation.
I've run pre-commit for the 2 files I changed i.e. _cache.py and the tests |
caching_store.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you accidentally committed this file?
uv.lock
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you accidentally committed this file too?
[Description of PR]
TODO:
docs/user-guide/*.rst
changes/