-
Notifications
You must be signed in to change notification settings - Fork 0
LRUStoreCache #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 15 commits
abb764e
d72078f
e1266b4
40e6f46
eadc7bb
5f90a71
ae51d23
e58329a
995ad1b
e84ebbe
8b22c6b
715296e
34328f4
6033416
b41d9e4
54322d2
ae65b38
94634b3
b31fd7c
f211f9a
5431e41
fcab264
b27014d
aa9f12e
b7f4458
fde1ff7
2a2692f
7e5b83d
5915d84
7c1ff74
8aaef7e
80fa2b2
4f4be57
b4c2aca
7761c5c
95353d9
5ade440
8b46576
115390f
51aeab7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Add LRUStoreCache for improved performance with remote stores | ||
|
||
The new ``LRUStoreCache`` provides a least-recently-used (LRU) caching layer that can be wrapped around any zarr store to significantly improve performance, especially for remote stores where network latency is a bottleneck. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,186 @@ | ||
.. only:: doctest | ||
|
||
>>> import shutil | ||
>>> shutil.rmtree('test.zarr', ignore_errors=True) | ||
|
||
.. _user-guide-lrustorecache: | ||
|
||
LRUStoreCache guide | ||
=================== | ||
|
||
The :class:`zarr.storage.LRUStoreCache` provides a least-recently-used (LRU) cache layer | ||
that can be wrapped around any Zarr store to improve performance for repeated data access. | ||
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network | ||
latency can significantly impact data access speed. | ||
|
||
The LRUStoreCache implements a cache that stores frequently accessed data chunks in memory, | ||
automatically evicting the least recently used items when the cache reaches its maximum size. | ||
|
||
.. note:: | ||
The LRUStoreCache is a wrapper store that maintains compatibility with the full | ||
:class:`zarr.abc.store.Store` API while adding transparent caching functionality. | ||
|
||
Basic Usage | ||
----------- | ||
|
||
Creating an LRUStoreCache is straightforward - simply wrap any existing store with the cache: | ||
|
||
>>> import zarr | ||
>>> import zarr.storage | ||
>>> import numpy as np | ||
>>> | ||
>>> # Create a local store and wrap it with LRU cache | ||
>>> local_store = zarr.storage.LocalStore('test.zarr') | ||
>>> cache = zarr.storage.LRUStoreCache(local_store, max_size=1024 * 1024 * 256) # 256MB cache | ||
>>> | ||
>>> # Create an array using the cached store | ||
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cache, mode='w') | ||
>>> | ||
>>> # Write some data to force chunk creation | ||
>>> zarr_array[:] = np.random.random((100, 100)) | ||
|
||
The ``max_size`` parameter controls the maximum memory usage of the cache in bytes. Set it to | ||
``None`` for unlimited cache size (use with caution). | ||
|
||
Performance Benefits | ||
------------------- | ||
|
||
The LRUStoreCache provides significant performance improvements for repeated data access: | ||
|
||
>>> import time | ||
>>> | ||
>>> # Benchmark reading with cache | ||
>>> start = time.time() | ||
>>> for _ in range(100): | ||
... _ = zarr_array[:] | ||
>>> elapsed_cache = time.time() - start | ||
>>> | ||
>>> # Compare with direct store access (without cache) | ||
>>> zarr_array_nocache = zarr.open('test.zarr', mode='r') | ||
>>> start = time.time() | ||
>>> for _ in range(100): | ||
... _ = zarr_array_nocache[:] | ||
>>> elapsed_nocache = time.time() - start | ||
>>> | ||
>>> speedup = elapsed_nocache/elapsed_cache | ||
|
||
Cache effectiveness is particularly pronounced with repeated access to the same data chunks. | ||
|
||
Remote Store Caching | ||
-------------------- | ||
|
||
The LRUStoreCache is most beneficial when used with remote stores where network latency | ||
is a significant factor. Here's a conceptual example:: | ||
|
||
# Example with a remote store (requires gcsfs) | ||
import gcsfs | ||
# Create a remote store (Google Cloud Storage example) | ||
gcs = gcsfs.GCSFileSystem(token='anon') | ||
remote_store = gcsfs.GCSMap( | ||
root='your-bucket/data.zarr', | ||
gcs=gcs, | ||
check=False | ||
) | ||
# Wrap with LRU cache for better performance | ||
cached_store = zarr.storage.LRUStoreCache(remote_store, max_size=2**28) | ||
# Open array through cached store | ||
z = zarr.open(cached_store) | ||
|
||
The first access to any chunk will be slow (network retrieval), but subsequent accesses | ||
to the same chunk will be served from the local cache, providing dramatic speedup. | ||
|
||
Cache Configuration | ||
------------------ | ||
|
||
The LRUStoreCache can be configured with several parameters: | ||
|
||
**max_size**: Controls the maximum memory usage of the cache in bytes | ||
|
||
>>> # Create a base store for demonstration | ||
>>> store = zarr.storage.LocalStore('config_example.zarr') | ||
>>> | ||
>>> # 256MB cache | ||
>>> cache = zarr.storage.LRUStoreCache(store, max_size=2**28) | ||
>>> | ||
>>> # Unlimited cache size (use with caution) | ||
>>> cache = zarr.storage.LRUStoreCache(store, max_size=None) | ||
|
||
**read_only**: Create a read-only cache | ||
|
||
>>> cache = zarr.storage.LRUStoreCache(store, max_size=2**28, read_only=True) | ||
|
||
Cache Statistics | ||
--------------- | ||
|
||
The LRUStoreCache provides statistics to monitor cache performance: | ||
|
||
>>> # Access some data to generate cache activity | ||
>>> data = zarr_array[0:50, 0:50] # First access - cache miss | ||
>>> data = zarr_array[0:50, 0:50] # Second access - cache hit | ||
>>> | ||
>>> cache_hits = cache.hits | ||
>>> cache_misses = cache.misses | ||
>>> total_requests = cache.hits + cache.misses | ||
>>> cache_hit_ratio = cache.hits / total_requests if total_requests > 0 else 0 | ||
>>> # Typical hit ratio is > 50% with repeated access patterns | ||
|
||
Cache Management | ||
---------------- | ||
|
||
The cache provides methods for manual cache management: | ||
|
||
>>> # Clear all cached values but keep keys cache | ||
>>> cache.invalidate_values() | ||
>>> | ||
>>> # Clear keys cache | ||
>>> cache.invalidate_keys() | ||
>>> | ||
>>> # Clear entire cache | ||
>>> cache.invalidate() | ||
|
||
Best Practices | ||
-------------- | ||
|
||
1. **Size the cache appropriately**: Set ``max_size`` based on available memory and expected data access patterns | ||
2. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores | ||
3. **Monitor cache statistics**: Use hit/miss ratios to tune cache size and access patterns | ||
4. **Consider data locality**: Access data in chunks sequentially rather than jumping around randomly to maximize cache reuse | ||
|
||
Examples from Real Usage | ||
----------------------- | ||
|
||
Here's a complete example demonstrating cache effectiveness: | ||
|
||
>>> import zarr | ||
>>> import zarr.storage | ||
>>> import time | ||
>>> import numpy as np | ||
>>> | ||
>>> # Create test data | ||
>>> local_store = zarr.storage.LocalStore('benchmark.zarr') | ||
>>> cache = zarr.storage.LRUStoreCache(local_store, max_size=2**28) | ||
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cache, mode='w') | ||
>>> zarr_array[:] = np.random.random((100, 100)) | ||
>>> | ||
>>> # Demonstrate cache effectiveness with repeated access | ||
>>> # First access (cache miss): | ||
>>> start = time.time() | ||
>>> data = zarr_array[20:30, 20:30] | ||
>>> first_access = time.time() - start | ||
>>> | ||
>>> # Second access (cache hit): | ||
>>> start = time.time() | ||
>>> data = zarr_array[20:30, 20:30] # Same data should be cached | ||
>>> second_access = time.time() - start | ||
>>> | ||
>>> # Calculate cache performance metrics | ||
>>> cache_speedup = first_access/second_access | ||
|
||
This example shows how the LRUStoreCache can significantly reduce access times for repeated | ||
data reads, particularly important when working with remote data sources. | ||
|
||
.. _Zip Store Specification: https://github.com/zarr-developers/zarr-specs/pull/311 | ||
.. _fsspec: https://filesystem-spec.readthedocs.io |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -392,6 +392,7 @@ filterwarnings = [ | |
"ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning" | ||
] | ||
markers = [ | ||
"asyncio: mark test as asyncio test", | ||
|
||
"gpu: mark a test as requiring CuPy and GPU", | ||
"slow_hypothesis: slow hypothesis tests", | ||
] | ||
|
Uh oh!
There was an error while loading. Please reload this page.