Skip to content

Commit 2eb89e1

Browse files
ruaridhgd-v-b
andauthored
CacheStore containing source store and cache store (#3366)
* Add _cache.py first attempt * test.py ran without error, creating test.zarr/ * Added testing for cache.py LRUStoreCache for v3 * Fix ruff errors * Add working example comparing LocalStore to LRUStoreCache * Delete test.py to clean-up * Added lrustorecache to changes and user-guide docs * Fix linting issues * Implement dual store cache * Fixed failing tests * Fix linting errors * Add logger info * Delete unnecessary extra functionality * Rename to caching_store * Add test_storage.py * Fix logic in _caching_store.py * Update tests to match caching_store implemtation * Delete LRUStoreCache files * Update __init__ * Add functionality for max_size * Add tests for cache_info and clear_cache * Delete test.py * Fix linting errors * Update feature description * Fix errors * Fix cachingstore.rst errors * Fix cachingstore.rst errors * Fixed eviction key logic with proper size tracking * Increase code coverage to 98% * Fix linting errors * move cache store to experimental, fix bugs * update changelog * remove logging config override, remove dead code, adjust evict_key logic, and avoid calling exists unnecessarily * add docs * add tests for relaxed cache coherency * adjust code examples (but we don't know if they work, because we don't have doctests working) * apply changes based on AI code review, and move tests into tests/test_experimental * update changelog * fix exception log --------- Co-authored-by: ruaridhg <[email protected]> Co-authored-by: Davis Bennett <[email protected]>
1 parent 1cd4f7e commit 2eb89e1

File tree

6 files changed

+1522
-0
lines changed

6 files changed

+1522
-0
lines changed

changes/3366.feature.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Adds `zarr.experimental.cache_store.CacheStore`, a `Store` that implements caching by combining two other `Store` instances. See the [docs page](https://zarr.readthedocs.io/en/latest/user-guide/experimental#cachestore) for more information about this feature.

docs/user-guide/experimental.md

Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
# Experimental features
2+
3+
This section contains documentation for experimental Zarr Python features. The features described here are exciting and potentially useful, but also volatile -- we might change them at any time. Take this into account if you consider depending on these features.
4+
5+
## `CacheStore`
6+
7+
Zarr Python 3.1.4 adds `zarr.experimental.cache_store.CacheStore` provides a dual-store caching implementation
8+
that can be wrapped around any Zarr store to improve performance for repeated data access.
9+
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network
10+
latency can significantly impact data access speed.
11+
12+
The CacheStore implements a cache that uses a separate Store instance as the cache backend,
13+
providing persistent caching capabilities with time-based expiration, size-based eviction,
14+
and flexible cache storage options. It automatically evicts the least recently used items
15+
when the cache reaches its maximum size.
16+
17+
Because the `CacheStore` uses an ordinary Zarr `Store` object as the caching layer, you can reuse the data stored in the cache later.
18+
19+
> **Note:** The CacheStore is a wrapper store that maintains compatibility with the full
20+
> `zarr.abc.store.Store` API while adding transparent caching functionality.
21+
22+
## Basic Usage
23+
24+
Creating a CacheStore requires both a source store and a cache store. The cache store
25+
can be any Store implementation, providing flexibility in cache persistence:
26+
27+
```python exec="true" session="experimental" source="above" result="ansi"
28+
import zarr
29+
from zarr.storage import LocalStore
30+
import numpy as np
31+
from tempfile import mkdtemp
32+
from zarr.experimental.cache_store import CacheStore
33+
34+
# Create a local store and a separate cache store
35+
local_store_path = mkdtemp(suffix='.zarr')
36+
source_store = LocalStore(local_store_path)
37+
cache_store = zarr.storage.MemoryStore() # In-memory cache
38+
cached_store = CacheStore(
39+
store=source_store,
40+
cache_store=cache_store,
41+
max_size=256*1024*1024 # 256MB cache
42+
)
43+
44+
# Create an array using the cached store
45+
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
46+
47+
# Write some data to force chunk creation
48+
zarr_array[:] = np.random.random((100, 100))
49+
```
50+
51+
The dual-store architecture allows you to use different store types for source and cache,
52+
such as a remote store for source data and a local store for persistent caching.
53+
54+
## Performance Benefits
55+
56+
The CacheStore provides significant performance improvements for repeated data access:
57+
58+
```python exec="true" session="experimental" source="above" result="ansi"
59+
import time
60+
61+
# Benchmark reading with cache
62+
start = time.time()
63+
for _ in range(100):
64+
_ = zarr_array[:]
65+
elapsed_cache = time.time() - start
66+
67+
# Compare with direct store access (without cache)
68+
zarr_array_nocache = zarr.open(local_store_path, mode='r')
69+
start = time.time()
70+
for _ in range(100):
71+
_ = zarr_array_nocache[:]
72+
elapsed_nocache = time.time() - start
73+
74+
# Cache provides speedup for repeated access
75+
speedup = elapsed_nocache / elapsed_cache
76+
```
77+
78+
Cache effectiveness is particularly pronounced with repeated access to the same data chunks.
79+
80+
81+
## Cache Configuration
82+
83+
The CacheStore can be configured with several parameters:
84+
85+
**max_size**: Controls the maximum size of cached data in bytes
86+
87+
```python exec="true" session="experimental" source="above" result="ansi"
88+
# 256MB cache with size limit
89+
cache = CacheStore(
90+
store=source_store,
91+
cache_store=cache_store,
92+
max_size=256*1024*1024
93+
)
94+
95+
# Unlimited cache size (use with caution)
96+
cache = CacheStore(
97+
store=source_store,
98+
cache_store=cache_store,
99+
max_size=None
100+
)
101+
```
102+
103+
**max_age_seconds**: Controls time-based cache expiration
104+
105+
```python exec="true" session="experimental" source="above" result="ansi"
106+
# Cache expires after 1 hour
107+
cache = CacheStore(
108+
store=source_store,
109+
cache_store=cache_store,
110+
max_age_seconds=3600
111+
)
112+
113+
# Cache never expires
114+
cache = CacheStore(
115+
store=source_store,
116+
cache_store=cache_store,
117+
max_age_seconds="infinity"
118+
)
119+
```
120+
121+
**cache_set_data**: Controls whether written data is cached
122+
123+
```python exec="true" session="experimental" source="above" result="ansi"
124+
# Cache data when writing (default)
125+
cache = CacheStore(
126+
store=source_store,
127+
cache_store=cache_store,
128+
cache_set_data=True
129+
)
130+
131+
# Don't cache written data (read-only cache)
132+
cache = CacheStore(
133+
store=source_store,
134+
cache_store=cache_store,
135+
cache_set_data=False
136+
)
137+
```
138+
139+
## Cache Statistics
140+
141+
The CacheStore provides statistics to monitor cache performance and state:
142+
143+
```python exec="true" session="experimental" source="above" result="ansi"
144+
# Access some data to generate cache activity
145+
data = zarr_array[0:50, 0:50] # First access - cache miss
146+
data = zarr_array[0:50, 0:50] # Second access - cache hit
147+
148+
# Get comprehensive cache information
149+
info = cached_store.cache_info()
150+
print(info['cache_store_type']) # e.g., 'MemoryStore'
151+
print(info['max_age_seconds'])
152+
print(info['max_size'])
153+
print(info['current_size'])
154+
print(info['tracked_keys'])
155+
print(info['cached_keys'])
156+
print(info['cache_set_data'])
157+
```
158+
159+
The `cache_info()` method returns a dictionary with detailed information about the cache state.
160+
161+
## Cache Management
162+
163+
The CacheStore provides methods for manual cache management:
164+
165+
```python exec="true" session="experimental" source="above" result="ansi"
166+
# Clear all cached data and tracking information
167+
import asyncio
168+
asyncio.run(cached_store.clear_cache())
169+
170+
# Check cache info after clearing
171+
info = cached_store.cache_info()
172+
assert info['tracked_keys'] == 0
173+
assert info['current_size'] == 0
174+
```
175+
176+
The `clear_cache()` method is an async method that clears both the cache store
177+
(if it supports the `clear` method) and all internal tracking data.
178+
179+
## Best Practices
180+
181+
1. **Choose appropriate cache store**: Use MemoryStore for fast temporary caching or LocalStore for persistent caching
182+
2. **Size the cache appropriately**: Set `max_size` based on available storage and expected data access patterns
183+
3. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores
184+
4. **Monitor cache statistics**: Use `cache_info()` to tune cache size and access patterns
185+
5. **Consider data locality**: Group related data accesses together to improve cache efficiency
186+
6. **Set appropriate expiration**: Use `max_age_seconds` for time-sensitive data or "infinity" for static data
187+
188+
## Working with Different Store Types
189+
190+
The CacheStore can wrap any store that implements the `zarr.abc.store.Store` interface
191+
and use any store type for the cache backend:
192+
193+
### Local Store with Memory Cache
194+
195+
```python exec="true" session="experimental-memory-cache" source="above" result="ansi"
196+
from zarr.storage import LocalStore, MemoryStore
197+
from zarr.experimental.cache_store import CacheStore
198+
from tempfile import mkdtemp
199+
200+
local_store_path = mkdtemp(suffix='.zarr')
201+
source_store = LocalStore(local_store_path)
202+
cache_store = MemoryStore()
203+
cached_store = CacheStore(
204+
store=source_store,
205+
cache_store=cache_store,
206+
max_size=128*1024*1024
207+
)
208+
```
209+
210+
### Memory Store with Persistent Cache
211+
212+
```python exec="true" session="experimental-local-cache" source="above" result="ansi"
213+
from tempfile import mkdtemp
214+
from zarr.storage import MemoryStore, LocalStore
215+
from zarr.experimental.cache_store import CacheStore
216+
217+
memory_store = MemoryStore()
218+
local_store_path = mkdtemp(suffix='.zarr')
219+
persistent_cache = LocalStore(local_store_path)
220+
cached_store = CacheStore(
221+
store=memory_store,
222+
cache_store=persistent_cache,
223+
max_size=256*1024*1024
224+
)
225+
```
226+
227+
The dual-store architecture provides flexibility in choosing the best combination
228+
of source and cache stores for your specific use case.
229+
230+
## Examples from Real Usage
231+
232+
Here's a complete example demonstrating cache effectiveness:
233+
234+
```python exec="true" session="experimental-final" source="above" result="ansi"
235+
import numpy as np
236+
import time
237+
from tempfile import mkdtemp
238+
import zarr
239+
import zarr.storage
240+
from zarr.experimental.cache_store import CacheStore
241+
242+
# Create test data with dual-store cache
243+
local_store_path = mkdtemp(suffix='.zarr')
244+
source_store = zarr.storage.LocalStore(local_store_path)
245+
cache_store = zarr.storage.MemoryStore()
246+
cached_store = CacheStore(
247+
store=source_store,
248+
cache_store=cache_store,
249+
max_size=256*1024*1024
250+
)
251+
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
252+
zarr_array[:] = np.random.random((100, 100))
253+
254+
# Demonstrate cache effectiveness with repeated access
255+
start = time.time()
256+
data = zarr_array[20:30, 20:30] # First access (cache miss)
257+
first_access = time.time() - start
258+
259+
start = time.time()
260+
data = zarr_array[20:30, 20:30] # Second access (cache hit)
261+
second_access = time.time() - start
262+
263+
# Check cache statistics
264+
info = cached_store.cache_info()
265+
assert info['cached_keys'] > 0 # Should have cached keys
266+
assert info['current_size'] > 0 # Should have cached data
267+
print(f"Cache contains {info['cached_keys']} keys with {info['current_size']} bytes")
268+
```
269+
270+
This example shows how the CacheStore can significantly reduce access times for repeated
271+
data reads, particularly important when working with remote data sources. The dual-store
272+
architecture allows for flexible cache persistence and management.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ nav:
2525
- user-guide/extending.md
2626
- user-guide/gpu.md
2727
- user-guide/consolidated_metadata.md
28+
- user-guide/experimental.md
2829
- API Reference:
2930
- api/index.md
3031
- api/array.md

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -399,6 +399,7 @@ filterwarnings = [
399399
"ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning"
400400
]
401401
markers = [
402+
"asyncio: mark test as asyncio test",
402403
"gpu: mark a test as requiring CuPy and GPU",
403404
"slow_hypothesis: slow hypothesis tests",
404405
]

0 commit comments

Comments
 (0)