Skip to content
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
abb764e
Add _cache.py first attempt
ruaridhg Jul 31, 2025
d72078f
test.py ran without error, creating test.zarr/
ruaridhg Jul 31, 2025
e1266b4
Added testing for cache.py LRUStoreCache for v3
ruaridhg Aug 4, 2025
40e6f46
Fix ruff errors
ruaridhg Aug 4, 2025
eadc7bb
Add working example comparing LocalStore to LRUStoreCache
ruaridhg Aug 4, 2025
5f90a71
Delete test.py to clean-up
ruaridhg Aug 4, 2025
ae51d23
Added lrustorecache to changes and user-guide docs
ruaridhg Aug 7, 2025
e58329a
Fix linting issues
ruaridhg Aug 7, 2025
26bd3fc
Implement dual store cache
ruaridhg Aug 8, 2025
5c92d48
Fixed failing tests
ruaridhg Aug 8, 2025
f0c302c
Fix linting errors
ruaridhg Aug 8, 2025
11f17d6
Add logger info
ruaridhg Aug 11, 2025
a7810dc
Delete unnecessary extra functionality
ruaridhg Aug 11, 2025
a607ce0
Rename to caching_store
ruaridhg Aug 11, 2025
8e79e3e
Add test_storage.py
ruaridhg Aug 11, 2025
d31e565
Fix logic in _caching_store.py
ruaridhg Aug 11, 2025
92cd63c
Update tests to match caching_store implemtation
ruaridhg Aug 11, 2025
aa38def
Delete LRUStoreCache files
ruaridhg Aug 11, 2025
86dda09
Update __init__
ruaridhg Aug 11, 2025
bb807d0
Add functionality for max_size
ruaridhg Aug 11, 2025
ed4b284
Add tests for cache_info and clear_cache
ruaridhg Aug 11, 2025
0fe580b
Delete test.py
ruaridhg Aug 11, 2025
1d9a1f7
Fix linting errors
ruaridhg Aug 11, 2025
16ae3bd
Update feature description
ruaridhg Aug 11, 2025
62b739f
Fix errors
ruaridhg Aug 11, 2025
f51fdb8
Fix cachingstore.rst errors
ruaridhg Aug 11, 2025
ffa9822
Fix cachingstore.rst errors
ruaridhg Aug 11, 2025
cda4767
Merge branch 'main' into rmg/cache_remote_stores_locally
ruaridhg Aug 11, 2025
d20843a
Fixed eviction key logic with proper size tracking
ruaridhg Aug 11, 2025
4b8d0a6
Increase code coverage to 98%
ruaridhg Aug 11, 2025
84a87e2
Fix linting errors
ruaridhg Aug 11, 2025
f3b6b3e
Merge branch 'main' into rmg/cache_remote_stores_locally
d-v-b Aug 21, 2025
114a29a
Merge branch 'main' into rmg/cache_remote_stores_locally
d-v-b Aug 28, 2025
39cb6b1
Merge branch 'main' into rmg/cache_remote_stores_locally
d-v-b Sep 17, 2025
1f200ed
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Sep 30, 2025
f9c8c09
move cache store to experimental, fix bugs
d-v-b Sep 30, 2025
6861490
update changelog
d-v-b Sep 30, 2025
41d182c
remove logging config override, remove dead code, adjust evict_key lo…
d-v-b Oct 1, 2025
83539d3
add docs
d-v-b Oct 1, 2025
56db161
add tests for relaxed cache coherency
d-v-b Oct 1, 2025
3d21514
adjust code examples (but we don't know if they work, because we don'…
d-v-b Oct 1, 2025
923cc53
apply changes based on AI code review, and move tests into tests/test…
d-v-b Oct 2, 2025
e319ce3
update changelog
d-v-b Oct 2, 2025
af81c17
fix exception log
d-v-b Oct 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/3357.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add CacheStore to Zarr 3.0
304 changes: 304 additions & 0 deletions docs/user-guide/cachingstore.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,304 @@
.. only:: doctest

>>> import shutil
>>> shutil.rmtree('test.zarr', ignore_errors=True)

.. _user-guide-cachestore:

CacheStore guide
================

The :class:`zarr.storage.CacheStore` provides a dual-store caching implementation
that can be wrapped around any Zarr store to improve performance for repeated data access.
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network
latency can significantly impact data access speed.

The CacheStore implements a cache that uses a separate Store instance as the cache backend,
providing persistent caching capabilities with time-based expiration, size-based eviction,
and flexible cache storage options. It automatically evicts the least recently used items
when the cache reaches its maximum size.

.. note::
The CacheStore is a wrapper store that maintains compatibility with the full
:class:`zarr.abc.store.Store` API while adding transparent caching functionality.

Basic Usage
-----------

Creating a CacheStore requires both a source store and a cache store. The cache store
can be any Store implementation, providing flexibility in cache persistence:

>>> import zarr
>>> import zarr.storage
>>> import numpy as np
>>>
>>> # Create a local store and a separate cache store
>>> source_store = zarr.storage.LocalStore('test.zarr')
>>> cache_store = zarr.storage.MemoryStore() # In-memory cache
>>> cached_store = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=256*1024*1024 # 256MB cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. I think it would be better to have the LRU functionality on the cache_store (in this example the MemoryStore). Otherwise the enclosing CacheStore would need to keep track of all keys and their access order in the inner store. That could be problematic if the inner store would be shared with other CacheStores or other code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be problematic if the inner store would be shared with other CacheStores or other code.

As long as one of the design goals is to use a regular zarr store as the caching layer, there will be nothing we can do to guard against external access to the cache store. For example, if someone uses a LocalStore as a cache, we can't protect the local file system from external modification. I think it's the user's responsibility to ensure that they don't use the same cache for separate CacheStores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but our default behavior could be to create a fresh MemoryStore, which would be a safe default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is about the abstraction. The LRU fits better in the inner store than in the CacheStore, imo. There could even be an LRUStore that wraps a store and implements the tracking and eviction.
The safety concern is, as you pointed out, something the user should take care of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is about the abstraction. The LRU fits better in the inner store than in the CacheStore, imo.

That makes sense, maybe we could implement LRUStore as another store wrapper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have the LRU functionality on the cache_store (in this example the MemoryStore)

To understand, is the suggestion here that

  1. There's no CacheStore class
  2. Instead all the logic for caching is implemented on Store, and there is a cache_store property that can be set to a second store to enable caching?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanrz unless you have a concrete proposal for a refactor that would be workable in the scope of this PR, I would suggest we move forward with the PR as-is, and use experience to dial in the abstraction in a later PR.

But knowing that design here might change, I think we should introduce an "experimental" storage module, e.g. zarr.storage.experimental, or a top-level experimental module, zarr.experimental, and put this class there until we are sure that the API is final.

Thoughts? I would like to ship this important feature while also retaining the ability to safely adjust it later. An experimental module seems like a safe way to do that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just commenting that I'd love to have this included sooner rather than later, it will be immediately useful 🎉 Thanks @ruaridhg for taking the initiative and putting this together!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of experimental here! 🤩

... )
>>>
>>> # Create an array using the cached store
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
>>>
>>> # Write some data to force chunk creation
>>> zarr_array[:] = np.random.random((100, 100))

The dual-store architecture allows you to use different store types for source and cache,
such as a remote store for source data and a local store for persistent caching.

Performance Benefits
--------------------

The CacheStore provides significant performance improvements for repeated data access:

>>> import time
>>>
>>> # Benchmark reading with cache
>>> start = time.time()
>>> for _ in range(100):
... _ = zarr_array[:]
>>> elapsed_cache = time.time() - start
>>>
>>> # Compare with direct store access (without cache)
>>> zarr_array_nocache = zarr.open('test.zarr', mode='r')
>>> start = time.time()
>>> for _ in range(100):
... _ = zarr_array_nocache[:]
>>> elapsed_nocache = time.time() - start
>>>
>>> # Cache provides speedup for repeated access
>>> speedup = elapsed_nocache / elapsed_cache # doctest: +SKIP

Cache effectiveness is particularly pronounced with repeated access to the same data chunks.

Remote Store Caching
--------------------

The CacheStore is most beneficial when used with remote stores where network latency
is a significant factor. You can use different store types for source and cache:

>>> from zarr.storage import FsspecStore, LocalStore
>>>
>>> # Create a remote store (S3 example) - for demonstration only
>>> remote_store = FsspecStore.from_url('s3://bucket/data.zarr', storage_options={'anon': True}) # doctest: +SKIP
>>>
>>> # Use a local store for persistent caching
>>> local_cache_store = LocalStore('cache_data') # doctest: +SKIP
>>>
>>> # Create cached store with persistent local cache
>>> cached_store = zarr.storage.CacheStore( # doctest: +SKIP
... store=remote_store,
... cache_store=local_cache_store,
... max_size=512*1024*1024 # 512MB cache
... )
>>>
>>> # Open array through cached store
>>> z = zarr.open(cached_store) # doctest: +SKIP

The first access to any chunk will be slow (network retrieval), but subsequent accesses
to the same chunk will be served from the local cache, providing dramatic speedup.
The cache persists between sessions when using a LocalStore for the cache backend.

Cache Configuration
-------------------

The CacheStore can be configured with several parameters:

**max_size**: Controls the maximum size of cached data in bytes

>>> # 256MB cache with size limit
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=256*1024*1024
... )
>>>
>>> # Unlimited cache size (use with caution)
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=None
... )

**max_age_seconds**: Controls time-based cache expiration

>>> # Cache expires after 1 hour
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_age_seconds=3600
... )
>>>
>>> # Cache never expires
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_age_seconds="infinity"
... )

**cache_set_data**: Controls whether written data is cached

>>> # Cache data when writing (default)
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... cache_set_data=True
... )
>>>
>>> # Don't cache written data (read-only cache)
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... cache_set_data=False
... )

Cache Statistics
----------------

The CacheStore provides statistics to monitor cache performance and state:

>>> # Access some data to generate cache activity
>>> data = zarr_array[0:50, 0:50] # First access - cache miss
>>> data = zarr_array[0:50, 0:50] # Second access - cache hit
>>>
>>> # Get comprehensive cache information
>>> info = cached_store.cache_info()
>>> info['cache_store_type'] # doctest: +SKIP
'MemoryStore'
>>> isinstance(info['max_age_seconds'], (int, str))
True
>>> isinstance(info['max_size'], (int, type(None)))
True
>>> info['current_size'] >= 0
True
>>> info['tracked_keys'] >= 0
True
>>> info['cached_keys'] >= 0
True
>>> isinstance(info['cache_set_data'], bool)
True

The `cache_info()` method returns a dictionary with detailed information about the cache state.

Cache Management
----------------

The CacheStore provides methods for manual cache management:

>>> # Clear all cached data and tracking information
>>> import asyncio
>>> asyncio.run(cached_store.clear_cache()) # doctest: +SKIP
>>>
>>> # Check cache info after clearing
>>> info = cached_store.cache_info() # doctest: +SKIP
>>> info['tracked_keys'] == 0 # doctest: +SKIP
True
>>> info['current_size'] == 0 # doctest: +SKIP
True

The `clear_cache()` method is an async method that clears both the cache store
(if it supports the `clear` method) and all internal tracking data.

Best Practices
--------------

1. **Choose appropriate cache store**: Use MemoryStore for fast temporary caching or LocalStore for persistent caching
2. **Size the cache appropriately**: Set ``max_size`` based on available storage and expected data access patterns
3. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores
4. **Monitor cache statistics**: Use `cache_info()` to tune cache size and access patterns
5. **Consider data locality**: Group related data accesses together to improve cache efficiency
6. **Set appropriate expiration**: Use `max_age_seconds` for time-sensitive data or "infinity" for static data

Working with Different Store Types
----------------------------------

The CacheStore can wrap any store that implements the :class:`zarr.abc.store.Store` interface
and use any store type for the cache backend:

Local Store with Memory Cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>> from zarr.storage import LocalStore, MemoryStore
>>> source_store = LocalStore('data.zarr')
>>> cache_store = MemoryStore()
>>> cached_store = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=128*1024*1024
... )

Remote Store with Local Cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>> from zarr.storage import FsspecStore, LocalStore
>>> remote_store = FsspecStore.from_url('s3://bucket/data.zarr', storage_options={'anon': True}) # doctest: +SKIP
>>> local_cache = LocalStore('local_cache') # doctest: +SKIP
>>> cached_store = zarr.storage.CacheStore( # doctest: +SKIP
... store=remote_store,
... cache_store=local_cache,
... max_size=1024*1024*1024,
... max_age_seconds=3600
... )

Memory Store with Persistent Cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>> from zarr.storage import MemoryStore, LocalStore
>>> memory_store = MemoryStore()
>>> persistent_cache = LocalStore('persistent_cache')
>>> cached_store = zarr.storage.CacheStore(
... store=memory_store,
... cache_store=persistent_cache,
... max_size=256*1024*1024
... )

The dual-store architecture provides flexibility in choosing the best combination
of source and cache stores for your specific use case.

Examples from Real Usage
------------------------

Here's a complete example demonstrating cache effectiveness:

>>> import zarr
>>> import zarr.storage
>>> import time
>>> import numpy as np
>>>
>>> # Create test data with dual-store cache
>>> source_store = zarr.storage.LocalStore('benchmark.zarr')
>>> cache_store = zarr.storage.MemoryStore()
>>> cached_store = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=256*1024*1024
... )
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
>>> zarr_array[:] = np.random.random((100, 100))
>>>
>>> # Demonstrate cache effectiveness with repeated access
>>> start = time.time()
>>> data = zarr_array[20:30, 20:30] # First access (cache miss)
>>> first_access = time.time() - start
>>>
>>> start = time.time()
>>> data = zarr_array[20:30, 20:30] # Second access (cache hit)
>>> second_access = time.time() - start
>>>
>>> # Check cache statistics
>>> info = cached_store.cache_info()
>>> info['cached_keys'] > 0 # Should have cached keys
True
>>> info['current_size'] > 0 # Should have cached data
True

This example shows how the CacheStore can significantly reduce access times for repeated
data reads, particularly important when working with remote data sources. The dual-store
architecture allows for flexible cache persistence and management.

.. _Zip Store Specification: https://github.com/zarr-developers/zarr-specs/pull/311
.. _fsspec: https://filesystem-spec.readthedocs.io
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Advanced Topics
data_types
performance
consolidated_metadata
cachingstore
extending
gpu

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,7 @@ filterwarnings = [
"ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning"
]
markers = [
"asyncio: mark test as asyncio test",
"gpu: mark a test as requiring CuPy and GPU",
"slow_hypothesis: slow hypothesis tests",
]
Expand Down
2 changes: 2 additions & 0 deletions src/zarr/storage/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from typing import Any

from zarr.errors import ZarrDeprecationWarning
from zarr.storage._caching_store import CacheStore
from zarr.storage._common import StoreLike, StorePath
from zarr.storage._fsspec import FsspecStore
from zarr.storage._local import LocalStore
Expand All @@ -14,6 +15,7 @@
from zarr.storage._zip import ZipStore

__all__ = [
"CacheStore",
"FsspecStore",
"GpuMemoryStore",
"LocalStore",
Expand Down
Loading
Loading