Skip to content
Open
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
abb764e
Add _cache.py first attempt
ruaridhg Jul 31, 2025
d72078f
test.py ran without error, creating test.zarr/
ruaridhg Jul 31, 2025
e1266b4
Added testing for cache.py LRUStoreCache for v3
ruaridhg Aug 4, 2025
40e6f46
Fix ruff errors
ruaridhg Aug 4, 2025
eadc7bb
Add working example comparing LocalStore to LRUStoreCache
ruaridhg Aug 4, 2025
5f90a71
Delete test.py to clean-up
ruaridhg Aug 4, 2025
ae51d23
Added lrustorecache to changes and user-guide docs
ruaridhg Aug 7, 2025
e58329a
Fix linting issues
ruaridhg Aug 7, 2025
995ad1b
Merge branch 'zarr-developers:main' into rmg/LRUStoreCache
ruaridhg Aug 27, 2025
e84ebbe
Fix doctest errors
ruaridhg Aug 28, 2025
8b22c6b
Update docs/user-guide/lrustorecache.rst
ruaridhg Aug 29, 2025
715296e
Update LRUStoreCache docstring and modify max_size to remove None as …
ruaridhg Aug 29, 2025
34328f4
Expand changes description
ruaridhg Aug 29, 2025
6033416
Improve wording in lrustorecache.rst
ruaridhg Aug 29, 2025
b41d9e4
Fix pre-commit errors and failing tests
ruaridhg Aug 29, 2025
54322d2
Remove asyncio marker from pyproject.toml
ruaridhg Sep 1, 2025
ae65b38
Apply suggestions from code review
ruaridhg Sep 1, 2025
94634b3
Fixed failing tests with some PR review comments addressed
ruaridhg Sep 1, 2025
b31fd7c
Modify **_item before potential deletion
ruaridhg Sep 1, 2025
f211f9a
Remove **_item methods
ruaridhg Sep 1, 2025
5431e41
Add warning for data exceeding cache and test
ruaridhg Sep 1, 2025
fcab264
Remove unused functions
ruaridhg Sep 3, 2025
b27014d
Fix linting
ruaridhg Sep 3, 2025
aa9f12e
Add tests to increase code coverage
ruaridhg Sep 3, 2025
b7f4458
Add methods for consistency with other stores
ruaridhg Sep 3, 2025
fde1ff7
Add in test for else statement in listdir
ruaridhg Sep 3, 2025
2a2692f
Modify listdir method for LRUStoreCache
ruaridhg Sep 4, 2025
7e5b83d
Apply suggestions from code review
ruaridhg Sep 4, 2025
5915d84
Matching underline lengths for titles
ruaridhg Sep 5, 2025
7c1ff74
Address latest PR comments removing redundant functions and updating …
ruaridhg Sep 5, 2025
8aaef7e
Remove hasattr and dict-like object references
ruaridhg Sep 5, 2025
80fa2b2
Fix remaining mypy issues
ruaridhg Sep 5, 2025
4f4be57
Updated _cache.py to remove redundant functions
ruaridhg Sep 15, 2025
b4c2aca
Add tests for new getsize implementation
ruaridhg Sep 15, 2025
7761c5c
Modify getsize
ruaridhg Sep 15, 2025
95353d9
Delete test files
ruaridhg Sep 15, 2025
5ade440
Delete local tests
ruaridhg Sep 15, 2025
8b46576
Remove dict-like references in LRUStoreCache and tests
ruaridhg Sep 15, 2025
115390f
Remove dimension separator test function
ruaridhg Sep 15, 2025
51aeab7
Remove unused files
ruaridhg Sep 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions changes/3357.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Add LRUStoreCache for improved performance with remote stores

The new ``LRUStoreCache`` provides a least-recently-used (LRU) caching layer that can be wrapped around any zarr store to significantly improve performance, especially for remote stores where network latency is a bottleneck.
186 changes: 186 additions & 0 deletions docs/user-guide/lrustorecache.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
.. only:: doctest

>>> import shutil
>>> shutil.rmtree('test.zarr', ignore_errors=True)

.. _user-guide-lrustorecache:

LRUStoreCache guide
===================

The :class:`zarr.storage.LRUStoreCache` provides a least-recently-used (LRU) cache layer
that can be wrapped around any Zarr store to improve performance for repeated data access.
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network
latency can significantly impact data access speed.

The LRUStoreCache implements a cache that stores frequently accessed data chunks in memory,
automatically evicting the least recently used items when the cache reaches its maximum size.

.. note::
The LRUStoreCache is a wrapper store that maintains compatibility with the full
:class:`zarr.abc.store.Store` API while adding transparent caching functionality.

Basic Usage
-----------

Creating an LRUStoreCache is straightforward - simply wrap any existing store with the cache:

>>> import zarr
>>> import zarr.storage
>>> import numpy as np
>>>
>>> # Create a local store and wrap it with LRU cache
>>> local_store = zarr.storage.LocalStore('test.zarr')
>>> cache = zarr.storage.LRUStoreCache(local_store, max_size=1024 * 1024 * 256) # 256MB cache
>>>
>>> # Create an array using the cached store
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cache, mode='w')
>>>
>>> # Write some data to force chunk creation
>>> zarr_array[:] = np.random.random((100, 100))

The ``max_size`` parameter controls the maximum memory usage of the cache in bytes. Set it to
``None`` for unlimited cache size (use with caution).

Performance Benefits
--------------------

The LRUStoreCache provides significant performance improvements for repeated data access:

>>> import time
>>>
>>> # Benchmark reading with cache
>>> start = time.time()
>>> for _ in range(100):
... _ = zarr_array[:]
>>> elapsed_cache = time.time() - start
>>>
>>> # Compare with direct store access (without cache)
>>> zarr_array_nocache = zarr.open('test.zarr', mode='r')
>>> start = time.time()
>>> for _ in range(100):
... _ = zarr_array_nocache[:]
>>> elapsed_nocache = time.time() - start
>>>
>>> speedup = elapsed_nocache/elapsed_cache

Cache effectiveness is particularly pronounced with repeated access to the same data chunks.

Remote Store Caching
--------------------

The LRUStoreCache is most beneficial when used with remote stores where network latency
is a significant factor. Here's a conceptual example::

# Example with a remote store (requires gcsfs)
import gcsfs

# Create a remote store (Google Cloud Storage example)
gcs = gcsfs.GCSFileSystem(token='anon')
remote_store = gcsfs.GCSMap(
root='your-bucket/data.zarr',
gcs=gcs,
check=False
)

# Wrap with LRU cache for better performance
cached_store = zarr.storage.LRUStoreCache(remote_store, max_size=2**28)

# Open array through cached store
z = zarr.open(cached_store)

The first access to any chunk will be slow (network retrieval), but subsequent accesses
to the same chunk will be served from the local cache, providing dramatic speedup.

Cache Configuration
-------------------

The LRUStoreCache can be configured with several parameters:

**max_size**: Controls the maximum memory usage of the cache in bytes

>>> # Create a base store for demonstration
>>> store = zarr.storage.LocalStore('config_example.zarr')
>>>
>>> # 256MB cache
>>> cache = zarr.storage.LRUStoreCache(store, max_size=2**28)
>>>
>>> # Unlimited cache size (use with caution)
>>> cache = zarr.storage.LRUStoreCache(store, max_size=None)

**read_only**: Create a read-only cache

>>> cache = zarr.storage.LRUStoreCache(store, max_size=2**28, read_only=True)

Cache Statistics
----------------

The LRUStoreCache provides statistics to monitor cache performance:

>>> # Access some data to generate cache activity
>>> data = zarr_array[0:50, 0:50] # First access - cache miss
>>> data = zarr_array[0:50, 0:50] # Second access - cache hit
>>>
>>> cache_hits = cache.hits
>>> cache_misses = cache.misses
>>> total_requests = cache.hits + cache.misses
>>> cache_hit_ratio = cache.hits / total_requests if total_requests > 0 else 0
>>> # Typical hit ratio is > 50% with repeated access patterns

Cache Management
----------------

The cache provides methods for manual cache management:

>>> # Clear all cached values but keep keys cache
>>> cache.invalidate_values()
>>>
>>> # Clear keys cache
>>> cache.invalidate_keys()
>>>
>>> # Clear entire cache
>>> cache.invalidate()

Best Practices
--------------

1. **Size the cache appropriately**: Set ``max_size`` based on available memory and expected data access patterns
2. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores
3. **Monitor cache statistics**: Use hit/miss ratios to tune cache size and access patterns
4. **Consider data locality**: Access data in chunks sequentially rather than jumping around randomly to maximize cache reuse

Examples from Real Usage
------------------------

Here's a complete example demonstrating cache effectiveness:

>>> import zarr
>>> import zarr.storage
>>> import time
>>> import numpy as np
>>>
>>> # Create test data
>>> local_store = zarr.storage.LocalStore('benchmark.zarr')
>>> cache = zarr.storage.LRUStoreCache(local_store, max_size=2**28)
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cache, mode='w')
>>> zarr_array[:] = np.random.random((100, 100))
>>>
>>> # Demonstrate cache effectiveness with repeated access
>>> # First access (cache miss):
>>> start = time.time()
>>> data = zarr_array[20:30, 20:30]
>>> first_access = time.time() - start
>>>
>>> # Second access (cache hit):
>>> start = time.time()
>>> data = zarr_array[20:30, 20:30] # Same data should be cached
>>> second_access = time.time() - start
>>>
>>> # Calculate cache performance metrics
>>> cache_speedup = first_access/second_access

This example shows how the LRUStoreCache can significantly reduce access times for repeated
data reads, particularly important when working with remote data sources.

.. _Zip Store Specification: https://github.com/zarr-developers/zarr-specs/pull/311
.. _fsspec: https://filesystem-spec.readthedocs.io
2 changes: 2 additions & 0 deletions src/zarr/storage/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from typing import Any

from zarr.errors import ZarrDeprecationWarning
from zarr.storage._cache import LRUStoreCache
from zarr.storage._common import StoreLike, StorePath
from zarr.storage._fsspec import FsspecStore
from zarr.storage._local import LocalStore
Expand All @@ -16,6 +17,7 @@
__all__ = [
"FsspecStore",
"GpuMemoryStore",
"LRUStoreCache",
"LocalStore",
"LoggingStore",
"MemoryStore",
Expand Down
Loading