Skip to content

Commit 1111c93

Browse files
authored
Merge branch 'main' into feat/v2-v3-codecs
2 parents b2526d2 + dc5334e commit 1111c93

38 files changed

+2194
-320
lines changed

.github/workflows/test.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -129,11 +129,10 @@ jobs:
129129
pip install hatch
130130
- name: Set Up Hatch Env
131131
run: |
132-
hatch env create docs
133-
hatch env run -e docs list-env
132+
hatch run doctest:pip list
134133
- name: Run Tests
135134
run: |
136-
hatch env run --env docs check
135+
hatch run doctest:test
137136
138137
test-complete:
139138
name: Test complete

changes/3366.feature.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Adds `zarr.experimental.cache_store.CacheStore`, a `Store` that implements caching by combining two other `Store` instances. See the [docs page](https://zarr.readthedocs.io/en/latest/user-guide/experimental#cachestore) for more information about this feature.

changes/3490.feature.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Adds a `zarr.experimental` module for unstable user-facing features.

changes/3502.doc.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Reorganize the top-level `examples` directory to give each example its own sub-directory. Adds content to the docs for each example.

docs/user-guide/data_types.md

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -298,14 +298,8 @@ assert scalar_value == np.int8(42)
298298
Each Zarr data type is a separate Python class that inherits from
299299
[ZDType][zarr.dtype.ZDType]. You can define a custom data type by
300300
writing your own subclass of [ZDType][zarr.dtype.ZDType] and adding
301-
your data type to the data type registry. A complete example of this process is included below.
302-
303-
The source code for this example can be found in the `examples/custom_dtype.py` file in the Zarr
304-
Python project directory.
305-
306-
```python
307-
--8<-- "examples/custom_dtype.py"
308-
```
301+
your data type to the data type registry. To see an executable demonstration
302+
of this process, see the [`custom_dtype` example](../user-guide/examples/custom_dtype.md).
309303

310304
### Data Type Resolution
311305

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
--8<-- "examples/custom_dtype/README.md"
2+
3+
## Source Code
4+
5+
```python
6+
--8<-- "examples/custom_dtype/custom_dtype.py"
7+
```

docs/user-guide/experimental.md

Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
# Experimental features
2+
3+
This section contains documentation for experimental Zarr Python features. The features described here are exciting and potentially useful, but also volatile -- we might change them at any time. Take this into account if you consider depending on these features.
4+
5+
## `CacheStore`
6+
7+
Zarr Python 3.1.4 adds `zarr.experimental.cache_store.CacheStore` provides a dual-store caching implementation
8+
that can be wrapped around any Zarr store to improve performance for repeated data access.
9+
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network
10+
latency can significantly impact data access speed.
11+
12+
The CacheStore implements a cache that uses a separate Store instance as the cache backend,
13+
providing persistent caching capabilities with time-based expiration, size-based eviction,
14+
and flexible cache storage options. It automatically evicts the least recently used items
15+
when the cache reaches its maximum size.
16+
17+
Because the `CacheStore` uses an ordinary Zarr `Store` object as the caching layer, you can reuse the data stored in the cache later.
18+
19+
> **Note:** The CacheStore is a wrapper store that maintains compatibility with the full
20+
> `zarr.abc.store.Store` API while adding transparent caching functionality.
21+
22+
## Basic Usage
23+
24+
Creating a CacheStore requires both a source store and a cache store. The cache store
25+
can be any Store implementation, providing flexibility in cache persistence:
26+
27+
```python exec="true" session="experimental" source="above" result="ansi"
28+
import zarr
29+
from zarr.storage import LocalStore
30+
import numpy as np
31+
from tempfile import mkdtemp
32+
from zarr.experimental.cache_store import CacheStore
33+
34+
# Create a local store and a separate cache store
35+
local_store_path = mkdtemp(suffix='.zarr')
36+
source_store = LocalStore(local_store_path)
37+
cache_store = zarr.storage.MemoryStore() # In-memory cache
38+
cached_store = CacheStore(
39+
store=source_store,
40+
cache_store=cache_store,
41+
max_size=256*1024*1024 # 256MB cache
42+
)
43+
44+
# Create an array using the cached store
45+
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
46+
47+
# Write some data to force chunk creation
48+
zarr_array[:] = np.random.random((100, 100))
49+
```
50+
51+
The dual-store architecture allows you to use different store types for source and cache,
52+
such as a remote store for source data and a local store for persistent caching.
53+
54+
## Performance Benefits
55+
56+
The CacheStore provides significant performance improvements for repeated data access:
57+
58+
```python exec="true" session="experimental" source="above" result="ansi"
59+
import time
60+
61+
# Benchmark reading with cache
62+
start = time.time()
63+
for _ in range(100):
64+
_ = zarr_array[:]
65+
elapsed_cache = time.time() - start
66+
67+
# Compare with direct store access (without cache)
68+
zarr_array_nocache = zarr.open(local_store_path, mode='r')
69+
start = time.time()
70+
for _ in range(100):
71+
_ = zarr_array_nocache[:]
72+
elapsed_nocache = time.time() - start
73+
74+
# Cache provides speedup for repeated access
75+
speedup = elapsed_nocache / elapsed_cache
76+
```
77+
78+
Cache effectiveness is particularly pronounced with repeated access to the same data chunks.
79+
80+
81+
## Cache Configuration
82+
83+
The CacheStore can be configured with several parameters:
84+
85+
**max_size**: Controls the maximum size of cached data in bytes
86+
87+
```python exec="true" session="experimental" source="above" result="ansi"
88+
# 256MB cache with size limit
89+
cache = CacheStore(
90+
store=source_store,
91+
cache_store=cache_store,
92+
max_size=256*1024*1024
93+
)
94+
95+
# Unlimited cache size (use with caution)
96+
cache = CacheStore(
97+
store=source_store,
98+
cache_store=cache_store,
99+
max_size=None
100+
)
101+
```
102+
103+
**max_age_seconds**: Controls time-based cache expiration
104+
105+
```python exec="true" session="experimental" source="above" result="ansi"
106+
# Cache expires after 1 hour
107+
cache = CacheStore(
108+
store=source_store,
109+
cache_store=cache_store,
110+
max_age_seconds=3600
111+
)
112+
113+
# Cache never expires
114+
cache = CacheStore(
115+
store=source_store,
116+
cache_store=cache_store,
117+
max_age_seconds="infinity"
118+
)
119+
```
120+
121+
**cache_set_data**: Controls whether written data is cached
122+
123+
```python exec="true" session="experimental" source="above" result="ansi"
124+
# Cache data when writing (default)
125+
cache = CacheStore(
126+
store=source_store,
127+
cache_store=cache_store,
128+
cache_set_data=True
129+
)
130+
131+
# Don't cache written data (read-only cache)
132+
cache = CacheStore(
133+
store=source_store,
134+
cache_store=cache_store,
135+
cache_set_data=False
136+
)
137+
```
138+
139+
## Cache Statistics
140+
141+
The CacheStore provides statistics to monitor cache performance and state:
142+
143+
```python exec="true" session="experimental" source="above" result="ansi"
144+
# Access some data to generate cache activity
145+
data = zarr_array[0:50, 0:50] # First access - cache miss
146+
data = zarr_array[0:50, 0:50] # Second access - cache hit
147+
148+
# Get comprehensive cache information
149+
info = cached_store.cache_info()
150+
print(info['cache_store_type']) # e.g., 'MemoryStore'
151+
print(info['max_age_seconds'])
152+
print(info['max_size'])
153+
print(info['current_size'])
154+
print(info['tracked_keys'])
155+
print(info['cached_keys'])
156+
print(info['cache_set_data'])
157+
```
158+
159+
The `cache_info()` method returns a dictionary with detailed information about the cache state.
160+
161+
## Cache Management
162+
163+
The CacheStore provides methods for manual cache management:
164+
165+
```python exec="true" session="experimental" source="above" result="ansi"
166+
# Clear all cached data and tracking information
167+
import asyncio
168+
asyncio.run(cached_store.clear_cache())
169+
170+
# Check cache info after clearing
171+
info = cached_store.cache_info()
172+
assert info['tracked_keys'] == 0
173+
assert info['current_size'] == 0
174+
```
175+
176+
The `clear_cache()` method is an async method that clears both the cache store
177+
(if it supports the `clear` method) and all internal tracking data.
178+
179+
## Best Practices
180+
181+
1. **Choose appropriate cache store**: Use MemoryStore for fast temporary caching or LocalStore for persistent caching
182+
2. **Size the cache appropriately**: Set `max_size` based on available storage and expected data access patterns
183+
3. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores
184+
4. **Monitor cache statistics**: Use `cache_info()` to tune cache size and access patterns
185+
5. **Consider data locality**: Group related data accesses together to improve cache efficiency
186+
6. **Set appropriate expiration**: Use `max_age_seconds` for time-sensitive data or "infinity" for static data
187+
188+
## Working with Different Store Types
189+
190+
The CacheStore can wrap any store that implements the `zarr.abc.store.Store` interface
191+
and use any store type for the cache backend:
192+
193+
### Local Store with Memory Cache
194+
195+
```python exec="true" session="experimental-memory-cache" source="above" result="ansi"
196+
from zarr.storage import LocalStore, MemoryStore
197+
from zarr.experimental.cache_store import CacheStore
198+
from tempfile import mkdtemp
199+
200+
local_store_path = mkdtemp(suffix='.zarr')
201+
source_store = LocalStore(local_store_path)
202+
cache_store = MemoryStore()
203+
cached_store = CacheStore(
204+
store=source_store,
205+
cache_store=cache_store,
206+
max_size=128*1024*1024
207+
)
208+
```
209+
210+
### Memory Store with Persistent Cache
211+
212+
```python exec="true" session="experimental-local-cache" source="above" result="ansi"
213+
from tempfile import mkdtemp
214+
from zarr.storage import MemoryStore, LocalStore
215+
from zarr.experimental.cache_store import CacheStore
216+
217+
memory_store = MemoryStore()
218+
local_store_path = mkdtemp(suffix='.zarr')
219+
persistent_cache = LocalStore(local_store_path)
220+
cached_store = CacheStore(
221+
store=memory_store,
222+
cache_store=persistent_cache,
223+
max_size=256*1024*1024
224+
)
225+
```
226+
227+
The dual-store architecture provides flexibility in choosing the best combination
228+
of source and cache stores for your specific use case.
229+
230+
## Examples from Real Usage
231+
232+
Here's a complete example demonstrating cache effectiveness:
233+
234+
```python exec="true" session="experimental-final" source="above" result="ansi"
235+
import numpy as np
236+
import time
237+
from tempfile import mkdtemp
238+
import zarr
239+
import zarr.storage
240+
from zarr.experimental.cache_store import CacheStore
241+
242+
# Create test data with dual-store cache
243+
local_store_path = mkdtemp(suffix='.zarr')
244+
source_store = zarr.storage.LocalStore(local_store_path)
245+
cache_store = zarr.storage.MemoryStore()
246+
cached_store = CacheStore(
247+
store=source_store,
248+
cache_store=cache_store,
249+
max_size=256*1024*1024
250+
)
251+
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
252+
zarr_array[:] = np.random.random((100, 100))
253+
254+
# Demonstrate cache effectiveness with repeated access
255+
start = time.time()
256+
data = zarr_array[20:30, 20:30] # First access (cache miss)
257+
first_access = time.time() - start
258+
259+
start = time.time()
260+
data = zarr_array[20:30, 20:30] # Second access (cache hit)
261+
second_access = time.time() - start
262+
263+
# Check cache statistics
264+
info = cached_store.cache_info()
265+
assert info['cached_keys'] > 0 # Should have cached keys
266+
assert info['current_size'] > 0 # Should have cached data
267+
print(f"Cache contains {info['cached_keys']} keys with {info['current_size']} bytes")
268+
```
269+
270+
This example shows how the CacheStore can significantly reduce access times for repeated
271+
data reads, particularly important when working with remote data sources. The dual-store
272+
architecture allows for flexible cache persistence and management.

docs/user-guide/storage.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ print(group)
2525

2626
```python exec="true" session="storage" source="above" result="ansi"
2727
# Implicitly create a read-only FsspecStore
28+
# Note: requires s3fs to be installed
2829
group = zarr.open_group(
2930
store='s3://noaa-nwm-retro-v2-zarr-pds',
3031
mode='r',
@@ -59,6 +60,7 @@ print(group)
5960

6061
- an FSSpec URI string, indicating a [remote store](#remote-store) location:
6162
```python exec="true" session="storage" source="above" result="ansi"
63+
# Note: requires s3fs to be installed
6264
group = zarr.open_group(
6365
store='s3://noaa-nwm-retro-v2-zarr-pds',
6466
mode='r',
@@ -125,6 +127,7 @@ that implements the [AbstractFileSystem](https://filesystem-spec.readthedocs.io/
125127
API. `storage_options` can be used to configure the fsspec backend:
126128

127129
```python exec="true" session="storage" source="above" result="ansi"
130+
# Note: requires s3fs to be installed
128131
store = zarr.storage.FsspecStore.from_url(
129132
's3://noaa-nwm-retro-v2-zarr-pds',
130133
read_only=True,
@@ -138,6 +141,7 @@ The type of filesystem (e.g. S3, https, etc..) is inferred from the scheme of th
138141
In case a specific filesystem is needed, one can explicitly create it. For example to create a S3 filesystem:
139142

140143
```python exec="true" session="storage" source="above" result="ansi"
144+
# Note: requires s3fs to be installed
141145
import fsspec
142146
fs = fsspec.filesystem(
143147
's3', anon=True, asynchronous=True,

examples/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Zarr Python Examples
2+
3+
This directory contains complete, runnable examples demonstrating various features and use cases of Zarr Python.
4+
5+
## Directory Structure
6+
7+
Each example is organized in its own subdirectory with the following structure:
8+
9+
```
10+
examples/
11+
├── example_name/
12+
│ ├── README.md # Documentation for the example
13+
│ └── example_name.py # Python source code
14+
└── ...
15+
```
16+
17+
## Adding New Examples
18+
19+
To add a new example:
20+
21+
1. Create a new subdirectory: `examples/my_example/`
22+
2. Add your Python code: `examples/my_example/my_example.py`
23+
3. Create documentation: `examples/my_example/README.md`
24+
4. Create a documentation page at `docs/user-guide/examples/my_example.md`. The documentation page should simply link to the `README.md` and the source code, e.g.:
25+
26+
````
27+
# docs/user-guide/examples/my_example.md
28+
--8<-- "examples/my_example/README.md"
29+
30+
## Source Code
31+
32+
```python
33+
--8<-- "examples/my_example/my_example.py"
34+
```
35+
````
36+
5. Update `mkdocs.yml` to include the new example in the navigation.
37+
38+
### Example README.md Format
39+
40+
Your README.md should include:
41+
42+
- A title (`# Example Name`)
43+
- Description of what the example demonstrates
44+
- Instructions for running the example

0 commit comments

Comments
 (0)