Skip to content

Commit 16ae3bd

Browse files
committed
Update feature description
1 parent 1d9a1f7 commit 16ae3bd

File tree

3 files changed

+298
-211
lines changed

3 files changed

+298
-211
lines changed

changes/3357.feature.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
Add LRUStoreCache to Zarr 3.0
1+
Add CacheStore to Zarr 3.0

docs/user-guide/cachingstore.rst

Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
.. only:: doctest
2+
3+
>>> import shutil
4+
>>> shutil.rmtree('test.zarr', ignore_errors=True)
5+
6+
.. _user-guide-cachestore:
7+
8+
CacheStore guide
9+
================
10+
11+
The :class:`zarr.storage.CacheStore` provides a dual-store caching implementation
12+
that can be wrapped around any Zarr store to improve performance for repeated data access.
13+
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network
14+
latency can significantly impact data access speed.
15+
16+
The CacheStore implements a cache that uses a separate Store instance as the cache backend,
17+
providing persistent caching capabilities with time-based expiration, size-based eviction,
18+
and flexible cache storage options. It automatically evicts the least recently used items
19+
when the cache reaches its maximum size.
20+
21+
.. note::
22+
The CacheStore is a wrapper store that maintains compatibility with the full
23+
:class:`zarr.abc.store.Store` API while adding transparent caching functionality.
24+
25+
Basic Usage
26+
-----------
27+
28+
Creating a CacheStore requires both a source store and a cache store. The cache store
29+
can be any Store implementation, providing flexibility in cache persistence:
30+
31+
>>> import zarr
32+
>>> import zarr.storage
33+
>>> import numpy as np
34+
>>>
35+
>>> # Create a local store and a separate cache store
36+
>>> source_store = zarr.storage.LocalStore('test.zarr')
37+
>>> cache_store = zarr.storage.MemoryStore() # In-memory cache
38+
>>> cached_store = zarr.storage.CacheStore(
39+
... store=source_store,
40+
... cache_store=cache_store,
41+
... max_size=256*1024*1024 # 256MB cache
42+
... )
43+
>>>
44+
>>> # Create an array using the cached store
45+
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
46+
>>>
47+
>>> # Write some data to force chunk creation
48+
>>> zarr_array[:] = np.random.random((100, 100))
49+
50+
The dual-store architecture allows you to use different store types for source and cache,
51+
such as a remote store for source data and a local store for persistent caching.
52+
53+
Performance Benefits
54+
-------------------
55+
56+
The CacheStore provides significant performance improvements for repeated data access:
57+
58+
>>> import time
59+
>>>
60+
>>> # Benchmark reading with cache
61+
>>> start = time.time()
62+
>>> for _ in range(100):
63+
... _ = zarr_array[:]
64+
>>> elapsed_cache = time.time() - start
65+
>>>
66+
>>> # Compare with direct store access (without cache)
67+
>>> zarr_array_nocache = zarr.open('test.zarr', mode='r')
68+
>>> start = time.time()
69+
>>> for _ in range(100):
70+
... _ = zarr_array_nocache[:]
71+
>>> elapsed_nocache = time.time() - start
72+
>>>
73+
>>> print(f"Speedup: {elapsed_nocache/elapsed_cache:.2f}x")
74+
75+
Cache effectiveness is particularly pronounced with repeated access to the same data chunks.
76+
77+
Remote Store Caching
78+
--------------------
79+
80+
The CacheStore is most beneficial when used with remote stores where network latency
81+
is a significant factor. You can use different store types for source and cache:
82+
83+
>>> from zarr.storage import FsspecStore, LocalStore
84+
>>>
85+
>>> # Create a remote store (S3 example)
86+
>>> remote_store = FsspecStore.from_url('s3://bucket/data.zarr', storage_options={'anon': True})
87+
>>>
88+
>>> # Use a local store for persistent caching
89+
>>> local_cache_store = LocalStore('cache_data')
90+
>>>
91+
>>> # Create cached store with persistent local cache
92+
>>> cached_store = zarr.storage.CacheStore(
93+
... store=remote_store,
94+
... cache_store=local_cache_store,
95+
... max_size=512*1024*1024 # 512MB cache
96+
... )
97+
>>>
98+
>>> # Open array through cached store
99+
>>> z = zarr.open(cached_store)
100+
101+
The first access to any chunk will be slow (network retrieval), but subsequent accesses
102+
to the same chunk will be served from the local cache, providing dramatic speedup.
103+
The cache persists between sessions when using a LocalStore for the cache backend.
104+
105+
Cache Configuration
106+
------------------
107+
108+
The CacheStore can be configured with several parameters:
109+
110+
**max_size**: Controls the maximum size of cached data in bytes
111+
112+
>>> # 256MB cache with size limit
113+
>>> cache = zarr.storage.CacheStore(
114+
... store=source_store,
115+
... cache_store=cache_store,
116+
... max_size=256*1024*1024
117+
... )
118+
>>>
119+
>>> # Unlimited cache size (use with caution)
120+
>>> cache = zarr.storage.CacheStore(
121+
... store=source_store,
122+
... cache_store=cache_store,
123+
... max_size=None
124+
... )
125+
126+
**max_age_seconds**: Controls time-based cache expiration
127+
128+
>>> # Cache expires after 1 hour
129+
>>> cache = zarr.storage.CacheStore(
130+
... store=source_store,
131+
... cache_store=cache_store,
132+
... max_age_seconds=3600
133+
... )
134+
>>>
135+
>>> # Cache never expires
136+
>>> cache = zarr.storage.CacheStore(
137+
... store=source_store,
138+
... cache_store=cache_store,
139+
... max_age_seconds="infinity"
140+
... )
141+
142+
**cache_set_data**: Controls whether written data is cached
143+
144+
>>> # Cache data when writing (default)
145+
>>> cache = zarr.storage.CacheStore(
146+
... store=source_store,
147+
... cache_store=cache_store,
148+
... cache_set_data=True
149+
... )
150+
>>>
151+
>>> # Don't cache written data (read-only cache)
152+
>>> cache = zarr.storage.CacheStore(
153+
... store=source_store,
154+
... cache_store=cache_store,
155+
... cache_set_data=False
156+
... )
157+
158+
Cache Statistics
159+
---------------
160+
161+
The CacheStore provides statistics to monitor cache performance and state:
162+
163+
>>> # Access some data to generate cache activity
164+
>>> data = zarr_array[0:50, 0:50] # First access - cache miss
165+
>>> data = zarr_array[0:50, 0:50] # Second access - cache hit
166+
>>>
167+
>>> # Get comprehensive cache information
168+
>>> info = cached_store.cache_info()
169+
>>> print(f"Cache store type: {info['cache_store_type']}")
170+
>>> print(f"Max age: {info['max_age_seconds']} seconds")
171+
>>> print(f"Max size: {info['max_size']} bytes")
172+
>>> print(f"Current size: {info['current_size']} bytes")
173+
>>> print(f"Tracked keys: {info['tracked_keys']}")
174+
>>> print(f"Cached keys: {info['cached_keys']}")
175+
>>> print(f"Cache set data: {info['cache_set_data']}")
176+
177+
The `cache_info()` method returns a dictionary with detailed information about the cache state.
178+
179+
Cache Management
180+
---------------
181+
182+
The CacheStore provides methods for manual cache management:
183+
184+
>>> # Clear all cached data and tracking information
185+
>>> await cached_store.clear_cache()
186+
>>>
187+
>>> # Check cache info after clearing
188+
>>> info = cached_store.cache_info()
189+
>>> print(f"Tracked keys after clear: {info['tracked_keys']}") # Should be 0
190+
>>> print(f"Current size after clear: {info['current_size']}") # Should be 0
191+
192+
The `clear_cache()` method is an async method that clears both the cache store
193+
(if it supports the `clear` method) and all internal tracking data.
194+
195+
Best Practices
196+
--------------
197+
198+
1. **Choose appropriate cache store**: Use MemoryStore for fast temporary caching or LocalStore for persistent caching
199+
2. **Size the cache appropriately**: Set ``max_size`` based on available storage and expected data access patterns
200+
3. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores
201+
4. **Monitor cache statistics**: Use `cache_info()` to tune cache size and access patterns
202+
5. **Consider data locality**: Group related data accesses together to improve cache efficiency
203+
6. **Set appropriate expiration**: Use `max_age_seconds` for time-sensitive data or "infinity" for static data
204+
205+
Working with Different Store Types
206+
----------------------------------
207+
208+
The CacheStore can wrap any store that implements the :class:`zarr.abc.store.Store` interface
209+
and use any store type for the cache backend:
210+
211+
Local Store with Memory Cache
212+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
213+
214+
>>> from zarr.storage import LocalStore, MemoryStore
215+
>>> source_store = LocalStore('data.zarr')
216+
>>> cache_store = MemoryStore()
217+
>>> cached_store = zarr.storage.CacheStore(
218+
... store=source_store,
219+
... cache_store=cache_store,
220+
... max_size=128*1024*1024
221+
... )
222+
223+
Remote Store with Local Cache
224+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
225+
226+
>>> from zarr.storage import FsspecStore, LocalStore
227+
>>> remote_store = FsspecStore.from_url('s3://bucket/data.zarr', storage_options={'anon': True})
228+
>>> local_cache = LocalStore('local_cache')
229+
>>> cached_store = zarr.storage.CacheStore(
230+
... store=remote_store,
231+
... cache_store=local_cache,
232+
... max_size=1024*1024*1024,
233+
... max_age_seconds=3600
234+
... )
235+
236+
Memory Store with Persistent Cache
237+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
238+
239+
>>> from zarr.storage import MemoryStore, LocalStore
240+
>>> memory_store = MemoryStore()
241+
>>> persistent_cache = LocalStore('persistent_cache')
242+
>>> cached_store = zarr.storage.CacheStore(
243+
... store=memory_store,
244+
... cache_store=persistent_cache,
245+
... max_size=256*1024*1024
246+
... )
247+
248+
The dual-store architecture provides flexibility in choosing the best combination
249+
of source and cache stores for your specific use case.
250+
251+
Examples from Real Usage
252+
-----------------------
253+
254+
Here's a complete example demonstrating cache effectiveness:
255+
256+
>>> import zarr
257+
>>> import zarr.storage
258+
>>> import time
259+
>>> import numpy as np
260+
>>>
261+
>>> # Create test data with dual-store cache
262+
>>> source_store = zarr.storage.LocalStore('benchmark.zarr')
263+
>>> cache_store = zarr.storage.MemoryStore()
264+
>>> cached_store = zarr.storage.CacheStore(
265+
... store=source_store,
266+
... cache_store=cache_store,
267+
... max_size=256*1024*1024
268+
... )
269+
>>> zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
270+
>>> zarr_array[:] = np.random.random((100, 100))
271+
>>>
272+
>>> # Demonstrate cache effectiveness with repeated access
273+
>>> print("First access (cache miss):")
274+
>>> start = time.time()
275+
>>> data = zarr_array[20:30, 20:30]
276+
>>> first_access = time.time() - start
277+
>>>
278+
>>> print("Second access (cache hit):")
279+
>>> start = time.time()
280+
>>> data = zarr_array[20:30, 20:30] # Same data should be cached
281+
>>> second_access = time.time() - start
282+
>>>
283+
>>> print(f"First access time: {first_access:.4f} s")
284+
>>> print(f"Second access time: {second_access:.4f} s")
285+
>>> print(f"Cache speedup: {first_access/second_access:.2f}x")
286+
>>>
287+
>>> # Check cache statistics
288+
>>> info = cached_store.cache_info()
289+
>>> print(f"Cached keys: {info['cached_keys']}")
290+
>>> print(f"Current cache size: {info['current_size']} bytes")
291+
292+
This example shows how the CacheStore can significantly reduce access times for repeated
293+
data reads, particularly important when working with remote data sources. The dual-store
294+
architecture allows for flexible cache persistence and management.
295+
296+
.. _Zip Store Specification: https://github.com/zarr-developers/zarr-specs/pull/311
297+
.. _fsspec: https://filesystem-spec.readthedocs.io

0 commit comments

Comments
 (0)