Skip to content

Commit f5b3be5

Browse files
authored
Merge branch 'main' into obstore-generic
2 parents ea18a1b + 2eb89e1 commit f5b3be5

File tree

13 files changed

+1676
-0
lines changed

13 files changed

+1676
-0
lines changed

changes/3366.feature.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Adds `zarr.experimental.cache_store.CacheStore`, a `Store` that implements caching by combining two other `Store` instances. See the [docs page](https://zarr.readthedocs.io/en/latest/user-guide/experimental#cachestore) for more information about this feature.

changes/3490.feature.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Adds a `zarr.experimental` module for unstable user-facing features.

docs/user-guide/experimental.md

Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
# Experimental features
2+
3+
This section contains documentation for experimental Zarr Python features. The features described here are exciting and potentially useful, but also volatile -- we might change them at any time. Take this into account if you consider depending on these features.
4+
5+
## `CacheStore`
6+
7+
Zarr Python 3.1.4 adds `zarr.experimental.cache_store.CacheStore` provides a dual-store caching implementation
8+
that can be wrapped around any Zarr store to improve performance for repeated data access.
9+
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network
10+
latency can significantly impact data access speed.
11+
12+
The CacheStore implements a cache that uses a separate Store instance as the cache backend,
13+
providing persistent caching capabilities with time-based expiration, size-based eviction,
14+
and flexible cache storage options. It automatically evicts the least recently used items
15+
when the cache reaches its maximum size.
16+
17+
Because the `CacheStore` uses an ordinary Zarr `Store` object as the caching layer, you can reuse the data stored in the cache later.
18+
19+
> **Note:** The CacheStore is a wrapper store that maintains compatibility with the full
20+
> `zarr.abc.store.Store` API while adding transparent caching functionality.
21+
22+
## Basic Usage
23+
24+
Creating a CacheStore requires both a source store and a cache store. The cache store
25+
can be any Store implementation, providing flexibility in cache persistence:
26+
27+
```python exec="true" session="experimental" source="above" result="ansi"
28+
import zarr
29+
from zarr.storage import LocalStore
30+
import numpy as np
31+
from tempfile import mkdtemp
32+
from zarr.experimental.cache_store import CacheStore
33+
34+
# Create a local store and a separate cache store
35+
local_store_path = mkdtemp(suffix='.zarr')
36+
source_store = LocalStore(local_store_path)
37+
cache_store = zarr.storage.MemoryStore() # In-memory cache
38+
cached_store = CacheStore(
39+
store=source_store,
40+
cache_store=cache_store,
41+
max_size=256*1024*1024 # 256MB cache
42+
)
43+
44+
# Create an array using the cached store
45+
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
46+
47+
# Write some data to force chunk creation
48+
zarr_array[:] = np.random.random((100, 100))
49+
```
50+
51+
The dual-store architecture allows you to use different store types for source and cache,
52+
such as a remote store for source data and a local store for persistent caching.
53+
54+
## Performance Benefits
55+
56+
The CacheStore provides significant performance improvements for repeated data access:
57+
58+
```python exec="true" session="experimental" source="above" result="ansi"
59+
import time
60+
61+
# Benchmark reading with cache
62+
start = time.time()
63+
for _ in range(100):
64+
_ = zarr_array[:]
65+
elapsed_cache = time.time() - start
66+
67+
# Compare with direct store access (without cache)
68+
zarr_array_nocache = zarr.open(local_store_path, mode='r')
69+
start = time.time()
70+
for _ in range(100):
71+
_ = zarr_array_nocache[:]
72+
elapsed_nocache = time.time() - start
73+
74+
# Cache provides speedup for repeated access
75+
speedup = elapsed_nocache / elapsed_cache
76+
```
77+
78+
Cache effectiveness is particularly pronounced with repeated access to the same data chunks.
79+
80+
81+
## Cache Configuration
82+
83+
The CacheStore can be configured with several parameters:
84+
85+
**max_size**: Controls the maximum size of cached data in bytes
86+
87+
```python exec="true" session="experimental" source="above" result="ansi"
88+
# 256MB cache with size limit
89+
cache = CacheStore(
90+
store=source_store,
91+
cache_store=cache_store,
92+
max_size=256*1024*1024
93+
)
94+
95+
# Unlimited cache size (use with caution)
96+
cache = CacheStore(
97+
store=source_store,
98+
cache_store=cache_store,
99+
max_size=None
100+
)
101+
```
102+
103+
**max_age_seconds**: Controls time-based cache expiration
104+
105+
```python exec="true" session="experimental" source="above" result="ansi"
106+
# Cache expires after 1 hour
107+
cache = CacheStore(
108+
store=source_store,
109+
cache_store=cache_store,
110+
max_age_seconds=3600
111+
)
112+
113+
# Cache never expires
114+
cache = CacheStore(
115+
store=source_store,
116+
cache_store=cache_store,
117+
max_age_seconds="infinity"
118+
)
119+
```
120+
121+
**cache_set_data**: Controls whether written data is cached
122+
123+
```python exec="true" session="experimental" source="above" result="ansi"
124+
# Cache data when writing (default)
125+
cache = CacheStore(
126+
store=source_store,
127+
cache_store=cache_store,
128+
cache_set_data=True
129+
)
130+
131+
# Don't cache written data (read-only cache)
132+
cache = CacheStore(
133+
store=source_store,
134+
cache_store=cache_store,
135+
cache_set_data=False
136+
)
137+
```
138+
139+
## Cache Statistics
140+
141+
The CacheStore provides statistics to monitor cache performance and state:
142+
143+
```python exec="true" session="experimental" source="above" result="ansi"
144+
# Access some data to generate cache activity
145+
data = zarr_array[0:50, 0:50] # First access - cache miss
146+
data = zarr_array[0:50, 0:50] # Second access - cache hit
147+
148+
# Get comprehensive cache information
149+
info = cached_store.cache_info()
150+
print(info['cache_store_type']) # e.g., 'MemoryStore'
151+
print(info['max_age_seconds'])
152+
print(info['max_size'])
153+
print(info['current_size'])
154+
print(info['tracked_keys'])
155+
print(info['cached_keys'])
156+
print(info['cache_set_data'])
157+
```
158+
159+
The `cache_info()` method returns a dictionary with detailed information about the cache state.
160+
161+
## Cache Management
162+
163+
The CacheStore provides methods for manual cache management:
164+
165+
```python exec="true" session="experimental" source="above" result="ansi"
166+
# Clear all cached data and tracking information
167+
import asyncio
168+
asyncio.run(cached_store.clear_cache())
169+
170+
# Check cache info after clearing
171+
info = cached_store.cache_info()
172+
assert info['tracked_keys'] == 0
173+
assert info['current_size'] == 0
174+
```
175+
176+
The `clear_cache()` method is an async method that clears both the cache store
177+
(if it supports the `clear` method) and all internal tracking data.
178+
179+
## Best Practices
180+
181+
1. **Choose appropriate cache store**: Use MemoryStore for fast temporary caching or LocalStore for persistent caching
182+
2. **Size the cache appropriately**: Set `max_size` based on available storage and expected data access patterns
183+
3. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores
184+
4. **Monitor cache statistics**: Use `cache_info()` to tune cache size and access patterns
185+
5. **Consider data locality**: Group related data accesses together to improve cache efficiency
186+
6. **Set appropriate expiration**: Use `max_age_seconds` for time-sensitive data or "infinity" for static data
187+
188+
## Working with Different Store Types
189+
190+
The CacheStore can wrap any store that implements the `zarr.abc.store.Store` interface
191+
and use any store type for the cache backend:
192+
193+
### Local Store with Memory Cache
194+
195+
```python exec="true" session="experimental-memory-cache" source="above" result="ansi"
196+
from zarr.storage import LocalStore, MemoryStore
197+
from zarr.experimental.cache_store import CacheStore
198+
from tempfile import mkdtemp
199+
200+
local_store_path = mkdtemp(suffix='.zarr')
201+
source_store = LocalStore(local_store_path)
202+
cache_store = MemoryStore()
203+
cached_store = CacheStore(
204+
store=source_store,
205+
cache_store=cache_store,
206+
max_size=128*1024*1024
207+
)
208+
```
209+
210+
### Memory Store with Persistent Cache
211+
212+
```python exec="true" session="experimental-local-cache" source="above" result="ansi"
213+
from tempfile import mkdtemp
214+
from zarr.storage import MemoryStore, LocalStore
215+
from zarr.experimental.cache_store import CacheStore
216+
217+
memory_store = MemoryStore()
218+
local_store_path = mkdtemp(suffix='.zarr')
219+
persistent_cache = LocalStore(local_store_path)
220+
cached_store = CacheStore(
221+
store=memory_store,
222+
cache_store=persistent_cache,
223+
max_size=256*1024*1024
224+
)
225+
```
226+
227+
The dual-store architecture provides flexibility in choosing the best combination
228+
of source and cache stores for your specific use case.
229+
230+
## Examples from Real Usage
231+
232+
Here's a complete example demonstrating cache effectiveness:
233+
234+
```python exec="true" session="experimental-final" source="above" result="ansi"
235+
import numpy as np
236+
import time
237+
from tempfile import mkdtemp
238+
import zarr
239+
import zarr.storage
240+
from zarr.experimental.cache_store import CacheStore
241+
242+
# Create test data with dual-store cache
243+
local_store_path = mkdtemp(suffix='.zarr')
244+
source_store = zarr.storage.LocalStore(local_store_path)
245+
cache_store = zarr.storage.MemoryStore()
246+
cached_store = CacheStore(
247+
store=source_store,
248+
cache_store=cache_store,
249+
max_size=256*1024*1024
250+
)
251+
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
252+
zarr_array[:] = np.random.random((100, 100))
253+
254+
# Demonstrate cache effectiveness with repeated access
255+
start = time.time()
256+
data = zarr_array[20:30, 20:30] # First access (cache miss)
257+
first_access = time.time() - start
258+
259+
start = time.time()
260+
data = zarr_array[20:30, 20:30] # Second access (cache hit)
261+
second_access = time.time() - start
262+
263+
# Check cache statistics
264+
info = cached_store.cache_info()
265+
assert info['cached_keys'] > 0 # Should have cached keys
266+
assert info['current_size'] > 0 # Should have cached data
267+
print(f"Cache contains {info['cached_keys']} keys with {info['current_size']} bytes")
268+
```
269+
270+
This example shows how the CacheStore can significantly reduce access times for repeated
271+
data reads, particularly important when working with remote data sources. The dual-store
272+
architecture allows for flexible cache persistence and management.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ nav:
2525
- user-guide/extending.md
2626
- user-guide/gpu.md
2727
- user-guide/consolidated_metadata.md
28+
- user-guide/experimental.md
2829
- API Reference:
2930
- api/index.md
3031
- api/array.md

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -399,6 +399,7 @@ filterwarnings = [
399399
"ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning"
400400
]
401401
markers = [
402+
"asyncio: mark test as asyncio test",
402403
"gpu: mark a test as requiring CuPy and GPU",
403404
"slow_hypothesis: slow hypothesis tests",
404405
]

src/zarr/core/dtype/npy/common.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,12 @@
5858
IntishFloat = NewType("IntishFloat", float)
5959
"""A type for floats that represent integers, like 1.0 (but not 1.1)."""
6060

61+
IntishStr = NewType("IntishStr", str)
62+
"""A type for strings that represent integers, like "0" or "42"."""
63+
64+
FloatishStr = NewType("FloatishStr", str)
65+
"""A type for strings that represent floats, like "3.14" or "-2.5"."""
66+
6167
NumpyEndiannessStr = Literal[">", "<", "="]
6268
NUMPY_ENDIANNESS_STR: Final = ">", "<", "="
6369

@@ -488,6 +494,59 @@ def check_json_intish_float(data: JSON) -> TypeGuard[IntishFloat]:
488494
return isinstance(data, float) and data.is_integer()
489495

490496

497+
def check_json_intish_str(data: JSON) -> TypeGuard[IntishStr]:
498+
"""
499+
Check if a JSON value is a string that represents an integer, like "0", "42", or "-5".
500+
501+
Parameters
502+
----------
503+
data : JSON
504+
The JSON value to check.
505+
506+
Returns
507+
-------
508+
bool
509+
True if the data is a string representing an integer, False otherwise.
510+
"""
511+
if not isinstance(data, str):
512+
return False
513+
514+
try:
515+
int(data)
516+
except ValueError:
517+
return False
518+
else:
519+
return True
520+
521+
522+
def check_json_floatish_str(data: JSON) -> TypeGuard[FloatishStr]:
523+
"""
524+
Check if a JSON value is a string that represents a float, like "3.14", "-2.5", or "0.0".
525+
526+
Note: This function is intended to be used AFTER check_json_float_v2/v3, so it only
527+
handles regular string representations that those functions don't cover.
528+
529+
Parameters
530+
----------
531+
data : JSON
532+
The JSON value to check.
533+
534+
Returns
535+
-------
536+
bool
537+
True if the data is a string representing a regular float, False otherwise.
538+
"""
539+
if not isinstance(data, str):
540+
return False
541+
542+
try:
543+
float(data)
544+
except ValueError:
545+
return False
546+
else:
547+
return True
548+
549+
491550
def check_json_str(data: JSON) -> TypeGuard[str]:
492551
"""
493552
Check if a JSON value is a string.

src/zarr/core/dtype/npy/float.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
TFloatScalar_co,
2020
check_json_float_v2,
2121
check_json_float_v3,
22+
check_json_floatish_str,
2223
endianness_to_numpy_str,
2324
float_from_json_v2,
2425
float_from_json_v3,
@@ -270,13 +271,17 @@ def from_json_scalar(self, data: JSON, *, zarr_format: ZarrFormat) -> TFloatScal
270271
if zarr_format == 2:
271272
if check_json_float_v2(data):
272273
return self._cast_scalar_unchecked(float_from_json_v2(data))
274+
elif check_json_floatish_str(data):
275+
return self._cast_scalar_unchecked(float(data))
273276
else:
274277
raise TypeError(
275278
f"Invalid type: {data}. Expected a float or a special string encoding of a float."
276279
)
277280
elif zarr_format == 3:
278281
if check_json_float_v3(data):
279282
return self._cast_scalar_unchecked(float_from_json_v3(data))
283+
elif check_json_floatish_str(data):
284+
return self._cast_scalar_unchecked(float(data))
280285
else:
281286
raise TypeError(
282287
f"Invalid type: {data}. Expected a float or a special string encoding of a float."

0 commit comments

Comments
 (0)