forked from zarr-developers/zarr-python
-
Notifications
You must be signed in to change notification settings - Fork 0
LRUStoreCache #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ruaridhg
wants to merge
40
commits into
main
Choose a base branch
from
rmg/LRUStoreCache
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 32 commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
abb764e
Add _cache.py first attempt
ruaridhg d72078f
test.py ran without error, creating test.zarr/
ruaridhg e1266b4
Added testing for cache.py LRUStoreCache for v3
ruaridhg 40e6f46
Fix ruff errors
ruaridhg eadc7bb
Add working example comparing LocalStore to LRUStoreCache
ruaridhg 5f90a71
Delete test.py to clean-up
ruaridhg ae51d23
Added lrustorecache to changes and user-guide docs
ruaridhg e58329a
Fix linting issues
ruaridhg 995ad1b
Merge branch 'zarr-developers:main' into rmg/LRUStoreCache
ruaridhg e84ebbe
Fix doctest errors
ruaridhg 8b22c6b
Update docs/user-guide/lrustorecache.rst
ruaridhg 715296e
Update LRUStoreCache docstring and modify max_size to remove None as …
ruaridhg 34328f4
Expand changes description
ruaridhg 6033416
Improve wording in lrustorecache.rst
ruaridhg b41d9e4
Fix pre-commit errors and failing tests
ruaridhg 54322d2
Remove asyncio marker from pyproject.toml
ruaridhg ae65b38
Apply suggestions from code review
ruaridhg 94634b3
Fixed failing tests with some PR review comments addressed
ruaridhg b31fd7c
Modify **_item before potential deletion
ruaridhg f211f9a
Remove **_item methods
ruaridhg 5431e41
Add warning for data exceeding cache and test
ruaridhg fcab264
Remove unused functions
ruaridhg b27014d
Fix linting
ruaridhg aa9f12e
Add tests to increase code coverage
ruaridhg b7f4458
Add methods for consistency with other stores
ruaridhg fde1ff7
Add in test for else statement in listdir
ruaridhg 2a2692f
Modify listdir method for LRUStoreCache
ruaridhg 7e5b83d
Apply suggestions from code review
ruaridhg 5915d84
Matching underline lengths for titles
ruaridhg 7c1ff74
Address latest PR comments removing redundant functions and updating …
ruaridhg 8aaef7e
Remove hasattr and dict-like object references
ruaridhg 80fa2b2
Fix remaining mypy issues
ruaridhg 4f4be57
Updated _cache.py to remove redundant functions
ruaridhg b4c2aca
Add tests for new getsize implementation
ruaridhg 7761c5c
Modify getsize
ruaridhg 95353d9
Delete test files
ruaridhg 5ade440
Delete local tests
ruaridhg 8b46576
Remove dict-like references in LRUStoreCache and tests
ruaridhg 115390f
Remove dimension separator test function
ruaridhg 51aeab7
Remove unused files
ruaridhg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Add LRUStoreCache for improved performance with remote stores | ||
|
||
The new ``LRUStoreCache`` provides a least-recently-used (LRU) caching layer that can be wrapped around any zarr store to significantly improve performance, especially for remote stores where network latency is a bottleneck. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,186 @@ | ||
.. only:: doctest | ||
|
||
>>> import shutil | ||
>>> shutil.rmtree('test.zarr', ignore_errors=True) | ||
|
||
.. _user-guide-lrustorecache: | ||
|
||
LRUStoreCache guide | ||
=================== | ||
|
||
The :class:`zarr.storage.LRUStoreCache` provides a least-recently-used (LRU) cache layer | ||
that can be wrapped around any Zarr store to improve performance for repeated data access. | ||
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network | ||
latency can significantly impact data access speed. | ||
|
||
The LRUStoreCache implements a cache that stores frequently accessed data chunks in memory, | ||
automatically evicting the least recently used items when the cache reaches its maximum size. | ||
|
||
.. note:: | ||
The LRUStoreCache is a wrapper store that maintains compatibility with the full | ||
:class:`zarr.abc.store.Store` API while adding transparent caching functionality. | ||
|
||
Basic Usage | ||
----------- | ||
|
||
Creating an LRUStoreCache is straightforward - simply wrap any existing store with the cache: | ||
|
||
>>> import zarr | ||
>>> import zarr.storage | ||
>>> import numpy as np | ||
>>> | ||
>>> # Create a local store and wrap it with LRU cache | ||
>>> local_store = zarr.storage.LocalStore('test.zarr') | ||
>>> cache = zarr.storage.LRUStoreCache(local_store, max_size=1024 * 1024 * 256) # 256MB cache | ||
>>> | ||
>>> # Create an array using the cached store | ||
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cache, mode='w') | ||
>>> | ||
>>> # Write some data to force chunk creation | ||
>>> zarr_array[:] = np.random.random((100, 100)) | ||
|
||
The ``max_size`` parameter controls the maximum memory usage of the cache in bytes. Set it to | ||
``None`` for unlimited cache size (use with caution). | ||
|
||
Performance Benefits | ||
-------------------- | ||
|
||
The LRUStoreCache provides significant performance improvements for repeated data access: | ||
|
||
>>> import time | ||
>>> | ||
>>> # Benchmark reading with cache | ||
>>> start = time.time() | ||
>>> for _ in range(100): | ||
... _ = zarr_array[:] | ||
>>> elapsed_cache = time.time() - start | ||
>>> | ||
>>> # Compare with direct store access (without cache) | ||
>>> zarr_array_nocache = zarr.open('test.zarr', mode='r') | ||
>>> start = time.time() | ||
>>> for _ in range(100): | ||
... _ = zarr_array_nocache[:] | ||
>>> elapsed_nocache = time.time() - start | ||
>>> | ||
>>> speedup = elapsed_nocache/elapsed_cache | ||
|
||
Cache effectiveness is particularly pronounced with repeated access to the same data chunks. | ||
|
||
Remote Store Caching | ||
-------------------- | ||
|
||
The LRUStoreCache is most beneficial when used with remote stores where network latency | ||
is a significant factor. Here's a conceptual example:: | ||
|
||
# Example with a remote store (requires gcsfs) | ||
import gcsfs | ||
|
||
# Create a remote store (Google Cloud Storage example) | ||
gcs = gcsfs.GCSFileSystem(token='anon') | ||
remote_store = gcsfs.GCSMap( | ||
root='your-bucket/data.zarr', | ||
gcs=gcs, | ||
check=False | ||
) | ||
|
||
# Wrap with LRU cache for better performance | ||
cached_store = zarr.storage.LRUStoreCache(remote_store, max_size=2**28) | ||
|
||
# Open array through cached store | ||
z = zarr.open(cached_store) | ||
|
||
The first access to any chunk will be slow (network retrieval), but subsequent accesses | ||
to the same chunk will be served from the local cache, providing dramatic speedup. | ||
|
||
Cache Configuration | ||
------------------- | ||
|
||
The LRUStoreCache can be configured with several parameters: | ||
|
||
**max_size**: Controls the maximum memory usage of the cache in bytes | ||
|
||
>>> # Create a base store for demonstration | ||
>>> store = zarr.storage.LocalStore('config_example.zarr') | ||
>>> | ||
>>> # 256MB cache | ||
>>> cache = zarr.storage.LRUStoreCache(store, max_size=2**28) | ||
>>> | ||
>>> # Unlimited cache size (use with caution) | ||
>>> cache = zarr.storage.LRUStoreCache(store, max_size=None) | ||
|
||
**read_only**: Create a read-only cache | ||
|
||
>>> cache = zarr.storage.LRUStoreCache(store, max_size=2**28, read_only=True) | ||
|
||
Cache Statistics | ||
---------------- | ||
|
||
The LRUStoreCache provides statistics to monitor cache performance: | ||
|
||
>>> # Access some data to generate cache activity | ||
>>> data = zarr_array[0:50, 0:50] # First access - cache miss | ||
>>> data = zarr_array[0:50, 0:50] # Second access - cache hit | ||
>>> | ||
>>> cache_hits = cache.hits | ||
>>> cache_misses = cache.misses | ||
>>> total_requests = cache.hits + cache.misses | ||
>>> cache_hit_ratio = cache.hits / total_requests if total_requests > 0 else 0 | ||
>>> # Typical hit ratio is > 50% with repeated access patterns | ||
|
||
Cache Management | ||
---------------- | ||
|
||
The cache provides methods for manual cache management: | ||
|
||
>>> # Clear all cached values but keep keys cache | ||
>>> cache.invalidate_values() | ||
>>> | ||
>>> # Clear keys cache | ||
>>> cache.invalidate_keys() | ||
>>> | ||
>>> # Clear entire cache | ||
>>> cache.invalidate() | ||
|
||
Best Practices | ||
-------------- | ||
|
||
1. **Size the cache appropriately**: Set ``max_size`` based on available memory and expected data access patterns | ||
2. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores | ||
3. **Monitor cache statistics**: Use hit/miss ratios to tune cache size and access patterns | ||
4. **Consider data locality**: Access data in chunks sequentially rather than jumping around randomly to maximize cache reuse | ||
|
||
Examples from Real Usage | ||
------------------------ | ||
|
||
Here's a complete example demonstrating cache effectiveness: | ||
|
||
>>> import zarr | ||
>>> import zarr.storage | ||
>>> import time | ||
>>> import numpy as np | ||
>>> | ||
>>> # Create test data | ||
>>> local_store = zarr.storage.LocalStore('benchmark.zarr') | ||
>>> cache = zarr.storage.LRUStoreCache(local_store, max_size=2**28) | ||
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cache, mode='w') | ||
>>> zarr_array[:] = np.random.random((100, 100)) | ||
>>> | ||
>>> # Demonstrate cache effectiveness with repeated access | ||
>>> # First access (cache miss): | ||
>>> start = time.time() | ||
>>> data = zarr_array[20:30, 20:30] | ||
>>> first_access = time.time() - start | ||
>>> | ||
>>> # Second access (cache hit): | ||
>>> start = time.time() | ||
>>> data = zarr_array[20:30, 20:30] # Same data should be cached | ||
>>> second_access = time.time() - start | ||
>>> | ||
>>> # Calculate cache performance metrics | ||
>>> cache_speedup = first_access/second_access | ||
|
||
This example shows how the LRUStoreCache can significantly reduce access times for repeated | ||
data reads, particularly important when working with remote data sources. | ||
|
||
.. _Zip Store Specification: https://github.com/zarr-developers/zarr-specs/pull/311 | ||
.. _fsspec: https://filesystem-spec.readthedocs.io |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.