Skip to content

CacheStore containing source store and cache store #3366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
abb764e
Add _cache.py first attempt
ruaridhg Jul 31, 2025
d72078f
test.py ran without error, creating test.zarr/
ruaridhg Jul 31, 2025
e1266b4
Added testing for cache.py LRUStoreCache for v3
ruaridhg Aug 4, 2025
40e6f46
Fix ruff errors
ruaridhg Aug 4, 2025
eadc7bb
Add working example comparing LocalStore to LRUStoreCache
ruaridhg Aug 4, 2025
5f90a71
Delete test.py to clean-up
ruaridhg Aug 4, 2025
ae51d23
Added lrustorecache to changes and user-guide docs
ruaridhg Aug 7, 2025
e58329a
Fix linting issues
ruaridhg Aug 7, 2025
26bd3fc
Implement dual store cache
ruaridhg Aug 8, 2025
5c92d48
Fixed failing tests
ruaridhg Aug 8, 2025
f0c302c
Fix linting errors
ruaridhg Aug 8, 2025
11f17d6
Add logger info
ruaridhg Aug 11, 2025
a7810dc
Delete unnecessary extra functionality
ruaridhg Aug 11, 2025
a607ce0
Rename to caching_store
ruaridhg Aug 11, 2025
8e79e3e
Add test_storage.py
ruaridhg Aug 11, 2025
d31e565
Fix logic in _caching_store.py
ruaridhg Aug 11, 2025
92cd63c
Update tests to match caching_store implemtation
ruaridhg Aug 11, 2025
aa38def
Delete LRUStoreCache files
ruaridhg Aug 11, 2025
86dda09
Update __init__
ruaridhg Aug 11, 2025
bb807d0
Add functionality for max_size
ruaridhg Aug 11, 2025
ed4b284
Add tests for cache_info and clear_cache
ruaridhg Aug 11, 2025
0fe580b
Delete test.py
ruaridhg Aug 11, 2025
1d9a1f7
Fix linting errors
ruaridhg Aug 11, 2025
16ae3bd
Update feature description
ruaridhg Aug 11, 2025
62b739f
Fix errors
ruaridhg Aug 11, 2025
f51fdb8
Fix cachingstore.rst errors
ruaridhg Aug 11, 2025
ffa9822
Fix cachingstore.rst errors
ruaridhg Aug 11, 2025
cda4767
Merge branch 'main' into rmg/cache_remote_stores_locally
ruaridhg Aug 11, 2025
d20843a
Fixed eviction key logic with proper size tracking
ruaridhg Aug 11, 2025
4b8d0a6
Increase code coverage to 98%
ruaridhg Aug 11, 2025
84a87e2
Fix linting errors
ruaridhg Aug 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/3357.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add CacheStore to Zarr 3.0
304 changes: 304 additions & 0 deletions docs/user-guide/cachingstore.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,304 @@
.. only:: doctest

>>> import shutil
>>> shutil.rmtree('test.zarr', ignore_errors=True)

.. _user-guide-cachestore:

CacheStore guide
================

The :class:`zarr.storage.CacheStore` provides a dual-store caching implementation
that can be wrapped around any Zarr store to improve performance for repeated data access.
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network
latency can significantly impact data access speed.

The CacheStore implements a cache that uses a separate Store instance as the cache backend,
providing persistent caching capabilities with time-based expiration, size-based eviction,
and flexible cache storage options. It automatically evicts the least recently used items
when the cache reaches its maximum size.

.. note::
The CacheStore is a wrapper store that maintains compatibility with the full
:class:`zarr.abc.store.Store` API while adding transparent caching functionality.

Basic Usage
-----------

Creating a CacheStore requires both a source store and a cache store. The cache store
can be any Store implementation, providing flexibility in cache persistence:

>>> import zarr
>>> import zarr.storage
>>> import numpy as np
>>>
>>> # Create a local store and a separate cache store
>>> source_store = zarr.storage.LocalStore('test.zarr')
>>> cache_store = zarr.storage.MemoryStore() # In-memory cache
>>> cached_store = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=256*1024*1024 # 256MB cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. I think it would be better to have the LRU functionality on the cache_store (in this example the MemoryStore). Otherwise the enclosing CacheStore would need to keep track of all keys and their access order in the inner store. That could be problematic if the inner store would be shared with other CacheStores or other code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be problematic if the inner store would be shared with other CacheStores or other code.

As long as one of the design goals is to use a regular zarr store as the caching layer, there will be nothing we can do to guard against external access to the cache store. For example, if someone uses a LocalStore as a cache, we can't protect the local file system from external modification. I think it's the user's responsibility to ensure that they don't use the same cache for separate CacheStores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but our default behavior could be to create a fresh MemoryStore, which would be a safe default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is about the abstraction. The LRU fits better in the inner store than in the CacheStore, imo. There could even be an LRUStore that wraps a store and implements the tracking and eviction.
The safety concern is, as you pointed out, something the user should take care of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is about the abstraction. The LRU fits better in the inner store than in the CacheStore, imo.

That makes sense, maybe we could implement LRUStore as another store wrapper?

... )
>>>
>>> # Create an array using the cached store
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
>>>
>>> # Write some data to force chunk creation
>>> zarr_array[:] = np.random.random((100, 100))

The dual-store architecture allows you to use different store types for source and cache,
such as a remote store for source data and a local store for persistent caching.

Performance Benefits
--------------------

The CacheStore provides significant performance improvements for repeated data access:

>>> import time
>>>
>>> # Benchmark reading with cache
>>> start = time.time()
>>> for _ in range(100):
... _ = zarr_array[:]
>>> elapsed_cache = time.time() - start
>>>
>>> # Compare with direct store access (without cache)
>>> zarr_array_nocache = zarr.open('test.zarr', mode='r')
>>> start = time.time()
>>> for _ in range(100):
... _ = zarr_array_nocache[:]
>>> elapsed_nocache = time.time() - start
>>>
>>> # Cache provides speedup for repeated access
>>> speedup = elapsed_nocache / elapsed_cache # doctest: +SKIP

Cache effectiveness is particularly pronounced with repeated access to the same data chunks.

Remote Store Caching
--------------------

The CacheStore is most beneficial when used with remote stores where network latency
is a significant factor. You can use different store types for source and cache:

>>> from zarr.storage import FsspecStore, LocalStore
>>>
>>> # Create a remote store (S3 example) - for demonstration only
>>> remote_store = FsspecStore.from_url('s3://bucket/data.zarr', storage_options={'anon': True}) # doctest: +SKIP
>>>
>>> # Use a local store for persistent caching
>>> local_cache_store = LocalStore('cache_data') # doctest: +SKIP
>>>
>>> # Create cached store with persistent local cache
>>> cached_store = zarr.storage.CacheStore( # doctest: +SKIP
... store=remote_store,
... cache_store=local_cache_store,
... max_size=512*1024*1024 # 512MB cache
... )
>>>
>>> # Open array through cached store
>>> z = zarr.open(cached_store) # doctest: +SKIP

The first access to any chunk will be slow (network retrieval), but subsequent accesses
to the same chunk will be served from the local cache, providing dramatic speedup.
The cache persists between sessions when using a LocalStore for the cache backend.

Cache Configuration
-------------------

The CacheStore can be configured with several parameters:

**max_size**: Controls the maximum size of cached data in bytes

>>> # 256MB cache with size limit
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=256*1024*1024
... )
>>>
>>> # Unlimited cache size (use with caution)
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=None
... )

**max_age_seconds**: Controls time-based cache expiration

>>> # Cache expires after 1 hour
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_age_seconds=3600
... )
>>>
>>> # Cache never expires
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_age_seconds="infinity"
... )

**cache_set_data**: Controls whether written data is cached

>>> # Cache data when writing (default)
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... cache_set_data=True
... )
>>>
>>> # Don't cache written data (read-only cache)
>>> cache = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... cache_set_data=False
... )

Cache Statistics
----------------

The CacheStore provides statistics to monitor cache performance and state:

>>> # Access some data to generate cache activity
>>> data = zarr_array[0:50, 0:50] # First access - cache miss
>>> data = zarr_array[0:50, 0:50] # Second access - cache hit
>>>
>>> # Get comprehensive cache information
>>> info = cached_store.cache_info()
>>> info['cache_store_type'] # doctest: +SKIP
'MemoryStore'
>>> isinstance(info['max_age_seconds'], (int, str))
True
>>> isinstance(info['max_size'], (int, type(None)))
True
>>> info['current_size'] >= 0
True
>>> info['tracked_keys'] >= 0
True
>>> info['cached_keys'] >= 0
True
>>> isinstance(info['cache_set_data'], bool)
True

The `cache_info()` method returns a dictionary with detailed information about the cache state.

Cache Management
----------------

The CacheStore provides methods for manual cache management:

>>> # Clear all cached data and tracking information
>>> import asyncio
>>> asyncio.run(cached_store.clear_cache()) # doctest: +SKIP
>>>
>>> # Check cache info after clearing
>>> info = cached_store.cache_info() # doctest: +SKIP
>>> info['tracked_keys'] == 0 # doctest: +SKIP
True
>>> info['current_size'] == 0 # doctest: +SKIP
True

The `clear_cache()` method is an async method that clears both the cache store
(if it supports the `clear` method) and all internal tracking data.

Best Practices
--------------

1. **Choose appropriate cache store**: Use MemoryStore for fast temporary caching or LocalStore for persistent caching
2. **Size the cache appropriately**: Set ``max_size`` based on available storage and expected data access patterns
3. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores
4. **Monitor cache statistics**: Use `cache_info()` to tune cache size and access patterns
5. **Consider data locality**: Group related data accesses together to improve cache efficiency
6. **Set appropriate expiration**: Use `max_age_seconds` for time-sensitive data or "infinity" for static data

Working with Different Store Types
----------------------------------

The CacheStore can wrap any store that implements the :class:`zarr.abc.store.Store` interface
and use any store type for the cache backend:

Local Store with Memory Cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>> from zarr.storage import LocalStore, MemoryStore
>>> source_store = LocalStore('data.zarr')
>>> cache_store = MemoryStore()
>>> cached_store = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=128*1024*1024
... )

Remote Store with Local Cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>> from zarr.storage import FsspecStore, LocalStore
>>> remote_store = FsspecStore.from_url('s3://bucket/data.zarr', storage_options={'anon': True}) # doctest: +SKIP
>>> local_cache = LocalStore('local_cache') # doctest: +SKIP
>>> cached_store = zarr.storage.CacheStore( # doctest: +SKIP
... store=remote_store,
... cache_store=local_cache,
... max_size=1024*1024*1024,
... max_age_seconds=3600
... )

Memory Store with Persistent Cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>> from zarr.storage import MemoryStore, LocalStore
>>> memory_store = MemoryStore()
>>> persistent_cache = LocalStore('persistent_cache')
>>> cached_store = zarr.storage.CacheStore(
... store=memory_store,
... cache_store=persistent_cache,
... max_size=256*1024*1024
... )

The dual-store architecture provides flexibility in choosing the best combination
of source and cache stores for your specific use case.

Examples from Real Usage
------------------------

Here's a complete example demonstrating cache effectiveness:

>>> import zarr
>>> import zarr.storage
>>> import time
>>> import numpy as np
>>>
>>> # Create test data with dual-store cache
>>> source_store = zarr.storage.LocalStore('benchmark.zarr')
>>> cache_store = zarr.storage.MemoryStore()
>>> cached_store = zarr.storage.CacheStore(
... store=source_store,
... cache_store=cache_store,
... max_size=256*1024*1024
... )
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
>>> zarr_array[:] = np.random.random((100, 100))
>>>
>>> # Demonstrate cache effectiveness with repeated access
>>> start = time.time()
>>> data = zarr_array[20:30, 20:30] # First access (cache miss)
>>> first_access = time.time() - start
>>>
>>> start = time.time()
>>> data = zarr_array[20:30, 20:30] # Second access (cache hit)
>>> second_access = time.time() - start
>>>
>>> # Check cache statistics
>>> info = cached_store.cache_info()
>>> info['cached_keys'] > 0 # Should have cached keys
True
>>> info['current_size'] > 0 # Should have cached data
True

This example shows how the CacheStore can significantly reduce access times for repeated
data reads, particularly important when working with remote data sources. The dual-store
architecture allows for flexible cache persistence and management.

.. _Zip Store Specification: https://github.com/zarr-developers/zarr-specs/pull/311
.. _fsspec: https://filesystem-spec.readthedocs.io
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Advanced Topics
data_types
performance
consolidated_metadata
cachingstore
extending
gpu

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,7 @@ filterwarnings = [
"ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning"
]
markers = [
"asyncio: mark test as asyncio test",
"gpu: mark a test as requiring CuPy and GPU",
"slow_hypothesis: slow hypothesis tests",
]
Expand Down
2 changes: 2 additions & 0 deletions src/zarr/storage/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from typing import Any

from zarr.errors import ZarrDeprecationWarning
from zarr.storage._caching_store import CacheStore
from zarr.storage._common import StoreLike, StorePath
from zarr.storage._fsspec import FsspecStore
from zarr.storage._local import LocalStore
Expand All @@ -14,6 +15,7 @@
from zarr.storage._zip import ZipStore

__all__ = [
"CacheStore",
"FsspecStore",
"GpuMemoryStore",
"LocalStore",
Expand Down
Loading
Loading