Skip to content

Conversation

ruaridhg
Copy link
Owner

[Description of PR]

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@ruaridhg ruaridhg marked this pull request as ready for review August 27, 2025 14:37
Copy link

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a first round of comments - hopefully enough to start iterating on. I haven't yet reviewed the implementation or tests, and will do that next.

pyproject.toml Outdated
"ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning"
]
markers = [
"asyncio: mark test as asyncio test",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you need to add this to stop some tests failing (or pytest in general failing)? It seems unrelated to the tests you added, so I'm a bit surprised.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it for now, but yes I think there were some failing tests within the github actions when I didn't include it. Tests pass locally fine without it.

Copy link

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewd the implementation, and have left (lots of!) comments. I get the impression that you might have taken code from v2 of zarr-python and added code to work with v3 of zarr-python, but then haven't removed a lot of the code that is no longer relevant for v3 zarr-python. For example, there are lots of hasattr() checks that will either always return False or True if you look at the definition of the Store abstract class, and there are lots of methods you've defined that duplicate functionality in the async methods.

I think my comments inline cover most of what needs changing/updating, but it would be good if you could cross check what you've written against some other v3 stores to make sure the code structure and methods implemented are similar.

@ruaridhg ruaridhg requested a review from dstansby September 4, 2025 10:06
Copy link

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general comments:

  • There are several if hasattr(...) blocks that are redundant, because all stores define those attributes. You can check which properties/attributes/methods all stores have at https://zarr.readthedocs.io/en/stable/api/zarr/abc/store/index.html#zarr.abc.store.Store. These redudnant blocks should be simplified. I started commenting on these, but there's enough that I stopped, so you'll have to find them all.
  • There are several branches that claim to handle "dict-like" objects, but self._store is always a store. So these branches can be removed. I started commenting on these, but there's enough that I stopped, so you'll have to find them all.
  • There are a few methods that are redundant or duplicate functionality, and should be removed (see inline comments)

Extract directory listing from store keys by filtering keys that start with the given path.
"""
children: set[str] = set()
# Handle both Store objects and dict-like objects
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you handling both Store and dict-like objects, when the type of store in the signature only allows Store objects?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove handling of both types of objects.

self._keys_cache = []
return self._keys_cache

def listdir(self, path: Path) -> list[str]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole method can be deleted, because it's made redundant by list_dir(), which is the method that the Store base class says has to be implemented.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this.

return sorted(children)


def listdir(store: Store, path: Path | None = None) -> list[str]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function can be deleted, because it's only used in listdir() below, which can also be delted (see other comment below).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this.

supports_writes: bool = True
supports_deletes: bool = True
supports_partial_writes: bool = True
supports_listing: bool = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will not always be True, because it depends on if the underlying store supports listing. In fact, I think all these properties will depend on the underlying store. So they should be re-implemented as properties that return the values from the underlying store.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made into properties that use underlying store.

self._contains_cache.clear()
self._listdir_cache.clear()

def _invalidate_keys_unsafe(self) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining this and invalidate_keys() and duplicating code, I would keep just invalidate_keys(), and when you need to call it just always call it outsdie a with self._mutex: block (because it handles the mutex iteself).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, implemented this.

self._contains_cache.clear()
self._listdir_cache.clear()

def _invalidate_value_unsafe(self, key: Any) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self._check_writable()

# Check if it's a Store object vs dict-like object
if hasattr(self._store, "supports_listing"):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant, all stores define the supports_listing property.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 498 to 501
with self._mutex:
self._invalidate_keys_unsafe()
cache_key = self._normalize_key(key)
self._invalidate_value_unsafe(cache_key)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think it's easier to just do _invalidate_keys() and then _invalidate_value() without the mutex block.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


async def exists(self, key: str) -> bool:
# Delegate to the underlying store
return await self._store.exists(key)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be checking the cache first instead of checking the store every time? Otherwise it defeats the point of having the cache!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, matching with getsize now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@ruaridhg
Copy link
Owner Author

ruaridhg commented Sep 5, 2025

Some general comments:

  • There are several if hasattr(...) blocks that are redundant, because all stores define those attributes. You can check which properties/attributes/methods all stores have at https://zarr.readthedocs.io/en/stable/api/zarr/abc/store/index.html#zarr.abc.store.Store. These redudnant blocks should be simplified. I started commenting on these, but there's enough that I stopped, so you'll have to find them all.
  • There are several branches that claim to handle "dict-like" objects, but self._store is always a store. So these branches can be removed. I started commenting on these, but there's enough that I stopped, so you'll have to find them all.
  • There are a few methods that are redundant or duplicate functionality, and should be removed (see inline comments)

I've addressed a lot of the inline comments and removed the dict-like branches, but I'm getting lots of the same mypy errors when removing if hasattr blocks either:
Function "list" could always be true in boolean context [truthy-function] - when I changed it to if self._store.list:
Statement is unreachable [unreachable] - when I changed it to if callable(self._store.keys):

async def list(self) -> AsyncIterator[str]:
        # Delegate to the underlying store
        if self._store.list:
            async for key in self._store.list():
                yield key
        else:
            # Fallback for stores that don't have async list
            if callable(self._store.keys):
                for key in list(self._store.keys()):
                    yield key

Will we always be delegating to the underlying store in these cases? In which instance, I can get rid of the if-else statement which resolves the mypy issues. If we want to have the underlying store or the cache for these methods then I'm not sure how to keep the if-else statement without running into mypy issues.

Copy link
Owner Author

@ruaridhg ruaridhg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've commented and/or implemented changes for the inline comments.

Extract directory listing from store keys by filtering keys that start with the given path.
"""
children: set[str] = set()
# Handle both Store objects and dict-like objects
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove handling of both types of objects.

self._keys_cache = []
return self._keys_cache

def listdir(self, path: Path) -> list[str]:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this.

return sorted(children)


def listdir(store: Store, path: Path | None = None) -> list[str]:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this.

supports_writes: bool = True
supports_deletes: bool = True
supports_partial_writes: bool = True
supports_listing: bool = True
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made into properties that use underlying store.

supports_partial_writes: bool = True
supports_listing: bool = True

root: Path
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this.

Comment on lines 498 to 501
with self._mutex:
self._invalidate_keys_unsafe()
cache_key = self._normalize_key(key)
self._invalidate_value_unsafe(cache_key)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self._check_writable()

# Check if it's a Store object vs dict-like object
if hasattr(self._store, "supports_listing"):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


async def exists(self, key: str) -> bool:
# Delegate to the underlying store
return await self._store.exists(key)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, matching with getsize now.

underlying_store = self._store.with_read_only(read_only)
return LRUStoreCache(underlying_store, max_size=self._max_size)

def _normalize_key(self, key: str | Path) -> str:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to cache_key = key without normalize_key function so that it's obvious that it's a cached key but can change if cleaner.

raise ValueError("max_size must be a positive integer (bytes)")

# Always inherit read_only state from the underlying store
read_only = getattr(store, "read_only", False)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep changed this now.

@dstansby
Copy link

dstansby commented Sep 5, 2025

Function "list" could always be true in boolean context [truthy-function] - when I changed it to if self._store.list:
Statement is unreachable [unreachable] - when I changed it to if callable(self._store.keys):

.list is a function, so I'm not actually sure how if self._store.list: is evaluated. Instead, you can use .supports_listing to check if the store supports listing or not, and if it doesn't an error should be raised by LRUCacheStore because it won't be possible to list the underlying store.

@ruaridhg ruaridhg requested a review from dstansby September 5, 2025 14:01
Copy link

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementaiton is looking real nice now 🕺 . I left a few more comments, but nothing major. Mostly more code that I think can be binned.

I had a first look at the tests, and they're a bit funny at the moment. I really like the idea behind CountingDict to track the method calls, but it needs some improvement because you should be testing the method calls on an actual Store object, not a dict. I left a suggestion as to the best way to implement this in the inline comments. Once that's done, I will probably have some more comments/questions on the tests.

from zarr.testing.store import StoreTests


class CountingDict(dict[Any, Any]):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking how many times the different methods have been called is a really neat idea! This currently needs some fixing though, because you should be testing how many times method on a Store obejct is called, but this is not a Store object (although it sort of looks like one).

So my recommendation to fix this is to remove CountingDict, and instead implement a store that is a thin wrapper of a MemoryStore, that also contains logic to track method calls. Something like:

class CounterStore(MemoryStore):
    """
    A thin wrapper of MemoryStore to count different method calls for testing.
    """

    def __init__(self) -> None:
        super().__init__()
        self.counter: Counter[tuple[str, Any] | str] = Counter()

    async def clear(self) -> None:
        self.counter["clear"] += 1
        # docstring inherited
        self._store_dict.clear()

# TODO: implement other methods that should be tracked

Does that make sense?

@ruaridhg ruaridhg requested a review from dstansby September 15, 2025 12:20
Copy link

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some more questions, left inline. As well as checking the tests run, can you check that pre-commit runs too, and fix any errors? It looks like there's currently a few typing errors that need addressing in the new code.

self.misses += 1
size = await self._store.getsize(key)

# Try to get and cache the value if it's reasonably small i.e. ≤10% of max cache size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the motivation for you choosing this threshold for the value size?

One of the ideas of the cache store is it's useful for stores (e.g., remote stores) where making a query for the store is the reason for them being slow, not then getting a value from the store. So my thinking was it getsize(key) takes just as long as getvalue(key) for a remote store, the implementation of LRUCacheStore.getsize might as well 1) get the value 2) cache it, and 3) 'manually' calculate and return the size of the value, instead of relying on the underlying _store.getsize(key) implementation.

Please could you do some benchmarking using a remote store to see if it makes sense to do what I suggested above?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this now and did some quick benchmarking so your suggestion is correct that the time saved is worth the new implementation.

@ruaridhg
Copy link
Owner Author

I have some more questions, left inline. As well as checking the tests run, can you check that pre-commit runs too, and fix any errors? It looks like there's currently a few typing errors that need addressing in the new code.

I've run pre-commit for the 2 files I changed i.e. _cache.py and the tests

@ruaridhg ruaridhg requested a review from dstansby September 16, 2025 15:36
caching_store.py Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you accidentally committed this file?

uv.lock Outdated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you accidentally committed this file too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants