Skip to content

Commit a6c3769

Browse files
authored
Feat: Add caching and request splitting readable stores (#27)
* Feat: Add CachingReadableStore * Add SplittingReadableStore
1 parent 3c30d8c commit a6c3769

File tree

9 files changed

+1522
-0
lines changed

9 files changed

+1522
-0
lines changed

docs/api/cache.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
::: obspec_utils.cache.CachingReadableStore

docs/api/splitting.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
::: obspec_utils.splitting.SplittingReadableStore

docs/caching-architecture.md

Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
# Caching Architecture
2+
3+
This document describes the caching architecture in obspec-utils, the design decisions behind it, and guidance on when to use different caching strategies.
4+
5+
## Overview
6+
7+
obspec-utils provides caching at two levels:
8+
9+
1. **Reader-level caching**: Temporary, scoped to a single file reader's lifetime
10+
2. **Store-level caching**: Shared across all consumers of a store
11+
12+
These serve different phases of a typical VirtualiZarr workflow and have different trade-offs.
13+
14+
## Two-Phase Workflow
15+
16+
When working with virtual datasets (e.g., via VirtualiZarr), data access happens in two distinct phases:
17+
18+
### Phase 1: Parsing (Reader-Level)
19+
20+
```
21+
┌─────────────────────────────────────────────────────────────┐
22+
│ PARSING PHASE │
23+
│ (per-file, short-lived) │
24+
│ │
25+
│ open_virtual_dataset() │
26+
│ │ │
27+
│ ▼ │
28+
│ Reader ──────► Store ──────► Network │
29+
│ │ │
30+
│ └── Reader-level caching │
31+
│ - Caches full file for HDF5/NetCDF parsing │
32+
│ - Released when reader closes │
33+
│ - Isolated per reader instance │
34+
└─────────────────────────────────────────────────────────────┘
35+
```
36+
37+
During parsing:
38+
39+
- A reader opens the file and parses headers/metadata
40+
- The file may be read multiple times during parsing (HDF5 structure traversal)
41+
- Once parsing completes, the reader closes and cache is released
42+
- Result: Virtual dataset with chunk *references* (not actual data)
43+
44+
**Reader-level caching is appropriate here because:**
45+
46+
- File metadata is read once per file, then discarded
47+
- Cache lifecycle matches reader lifecycle (automatic cleanup)
48+
- No cross-contamination between different files being parsed
49+
- Memory is freed immediately when parsing completes
50+
51+
### Phase 2: Data Access (Store-Level)
52+
53+
```
54+
┌─────────────────────────────────────────────────────────────┐
55+
│ DATA ACCESS PHASE │
56+
│ (shared, long-lived) │
57+
│ │
58+
│ xarray computation / .load() / .compute() │
59+
│ │ │
60+
│ ▼ │
61+
│ ManifestStore ──────► Store ──────► Network │
62+
│ │ │
63+
│ └── Store-level caching │
64+
│ - Shared across calls │
65+
│ - Persists across │
66+
│ different consumers │
67+
└─────────────────────────────────────────────────────────────┘
68+
```
69+
70+
During data access:
71+
72+
- Computations trigger reads of actual chunk data
73+
- Same chunks may be accessed multiple times (overlapping operations, retries)
74+
- Cache should persist and be shared across different access patterns
75+
76+
**Store-level caching is appropriate here because:**
77+
78+
- Chunks may be re-read by different computations
79+
- Cache should be shared across all consumers of the store
80+
- Lifecycle is independent of any single reader
81+
82+
## Caching Implementations
83+
84+
### Reader-Level: EagerStoreReader
85+
86+
`EagerStoreReader` loads the entire file into memory on construction:
87+
88+
```python
89+
from obspec_utils.obspec import EagerStoreReader
90+
91+
# File is fully loaded into memory
92+
reader = EagerStoreReader(store, "file.nc")
93+
94+
# All reads served from memory
95+
data = reader.read(1000)
96+
reader.seek(0)
97+
more_data = reader.read(500)
98+
99+
# Cache released when reader closes
100+
reader.close()
101+
```
102+
103+
**Characteristics:**
104+
105+
- Fetches file using parallel `get_ranges()` for speed
106+
- Caches in `BytesIO` buffer
107+
- Cache is isolated to this reader instance
108+
- Memory freed on `close()` or context manager exit
109+
110+
**When to use:**
111+
112+
- Parsing HDF5/NetCDF files (need random access during parsing)
113+
- Small-to-medium files that fit in memory
114+
- When you'll read most of the file anyway
115+
116+
### Reader-Level: ParallelStoreReader
117+
118+
`ParallelStoreReader` uses chunk-based LRU caching:
119+
120+
```python
121+
from obspec_utils.obspec import ParallelStoreReader
122+
123+
reader = ParallelStoreReader(
124+
store, "file.nc",
125+
chunk_size=256 * 1024, # 256 KB chunks
126+
max_cached_chunks=64, # Up to 64 chunks cached
127+
)
128+
129+
# Chunks fetched on demand via get_ranges()
130+
data = reader.read(1000)
131+
```
132+
133+
**Characteristics:**
134+
135+
- Bounded memory usage: `chunk_size * max_cached_chunks`
136+
- LRU eviction when cache is full
137+
- Good for sparse/random access patterns
138+
139+
**When to use:**
140+
141+
- Large files with sparse access
142+
- Memory-constrained environments
143+
- Unknown access patterns
144+
145+
### Reader-Level: BufferedStoreReader
146+
147+
`BufferedStoreReader` provides read-ahead buffering for sequential access:
148+
149+
```python
150+
from obspec_utils.obspec import BufferedStoreReader
151+
152+
reader = BufferedStoreReader(store, "file.nc", buffer_size=1024 * 1024)
153+
154+
# Sequential reads benefit from buffering
155+
while chunk := reader.read(4096):
156+
process(chunk)
157+
```
158+
159+
**Characteristics:**
160+
161+
- Position-aware buffering (read-ahead)
162+
- Best for sequential/streaming access
163+
- Minimal memory overhead
164+
165+
**When to use:**
166+
167+
- Sequential file processing
168+
- Streaming workloads
169+
- When you won't revisit earlier data
170+
171+
### Store-Level: CachingReadableStore
172+
173+
`CachingReadableStore` wraps any store to cache full objects:
174+
175+
```python
176+
from obspec_utils.cache import CachingReadableStore
177+
178+
# Wrap the store with caching
179+
cached_store = CachingReadableStore(
180+
store,
181+
max_size=256 * 1024 * 1024, # 256 MB cache
182+
)
183+
184+
# All consumers share the same cache
185+
data1 = cached_store.get("file1.nc") # Fetched from network, cached
186+
data2 = cached_store.get("file1.nc") # Served from cache
187+
```
188+
189+
**Characteristics:**
190+
191+
- Caches full objects (entire files)
192+
- LRU eviction when `max_size` exceeded
193+
- Thread-safe (works with `ThreadPoolExecutor`)
194+
- Shared across all consumers of the wrapped store
195+
196+
**When to use:**
197+
198+
- Multiple consumers reading the same files
199+
- Repeated access to small-to-medium files
200+
- When store-level sharing is beneficial
201+
202+
**Limitations:**
203+
204+
- Not shared across processes (Dask workers, ProcessPoolExecutor)
205+
- Each process maintains its own cache
206+
- Full-object granularity (not ideal for partial reads of large files)
207+
208+
## Distributed Considerations
209+
210+
### Threading (Shared Memory)
211+
212+
With `ThreadPoolExecutor` or similar:
213+
214+
- Store-level caching IS shared across threads
215+
- All threads benefit from the same cache
216+
- Thread-safe implementations required (provided)
217+
218+
### Multi-Process (Separate Memory)
219+
220+
With `ProcessPoolExecutor`, Dask distributed, or Lithops:
221+
222+
- Each worker process has its own memory space
223+
- Store wrappers are serialized and copied to each worker
224+
- **Caches are NOT shared** across workers
225+
- Each worker maintains an independent cache
226+
227+
This is typically acceptable when:
228+
229+
- Workloads are partitioned by file (each worker processes different files)
230+
- The alternative (no caching) would be worse
231+
232+
For workloads requiring cross-worker cache sharing, consider:
233+
234+
- External caching (Redis, memcached)
235+
- Shared filesystem caching
236+
- Restructuring workloads to minimize cross-worker file access
237+
238+
## Decision Guide
239+
240+
### Which reader should I use?
241+
242+
| Access Pattern | Recommended Reader |
243+
|---------------|-------------------|
244+
| Parse HDF5/NetCDF file | `EagerStoreReader` |
245+
| Sequential streaming | `BufferedStoreReader` |
246+
| Sparse random access | `ParallelStoreReader` |
247+
| Unknown pattern, large file | `ParallelStoreReader` |
248+
| Small file, repeated access | `EagerStoreReader` |
249+
250+
### Should I use store-level caching?
251+
252+
| Scenario | Recommendation |
253+
|----------|---------------|
254+
| Single-threaded, repeated file access | Yes, `CachingReadableStore` |
255+
| Multi-threaded, shared files | Yes, `CachingReadableStore` |
256+
| Distributed workers, partitioned by file | Optional (per-worker cache) |
257+
| Distributed workers, shared files | Consider external caching |
258+
| One-time file processing | No (use reader-level only) |
259+
260+
## Store Wrappers
261+
262+
### SplittingReadableStore
263+
264+
`SplittingReadableStore` accelerates `get()` by splitting large requests into parallel `get_ranges()`:
265+
266+
```python
267+
from obspec_utils.splitting import SplittingReadableStore
268+
269+
fast_store = SplittingReadableStore(
270+
store,
271+
request_size=12 * 1024 * 1024, # 12 MB per request
272+
max_concurrent_requests=18,
273+
)
274+
```
275+
276+
This extracts the parallel fetching logic from `EagerStoreReader` into a composable wrapper. It composes naturally with `CachingReadableStore`:
277+
278+
```python
279+
from obspec_utils.splitting import SplittingReadableStore
280+
from obspec_utils.cache import CachingReadableStore
281+
282+
# Compose: fast parallel fetches + caching
283+
store = S3Store(bucket="my-bucket")
284+
store = SplittingReadableStore(store) # Split large fetches
285+
store = CachingReadableStore(store) # Cache results
286+
287+
# First get(): parallel fetch -> cache
288+
# Second get(): served from cache
289+
```
290+
291+
**Characteristics:**
292+
293+
- Only affects `get()` and `get_async()` - range requests pass through unchanged
294+
- Requires `head()` support to determine file size (falls back to single request otherwise)
295+
- Tuned for cloud storage (12 MB chunks, 18 concurrent requests by default)
296+
297+
## Summary
298+
299+
| Layer | Scope | Lifetime | Use Case |
300+
|-------|-------|----------|----------|
301+
| Reader-level | Per-file, per-instance | Reader lifetime | Parsing phase |
302+
| Store-level | Shared across consumers | Application lifetime | Data access phase |
303+
304+
The two-level architecture reflects the reality that:
305+
306+
1. **Parsing** benefits from isolated, temporary caching (reader-level)
307+
2. **Data access** benefits from shared, persistent caching (store-level)
308+
3. **Distributed settings** require understanding that in-memory caches are per-process

mkdocs.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,12 @@ extra:
1414

1515
nav:
1616
- "index.md"
17+
- "Caching Architecture": "caching-architecture.md"
1718
- "API":
1819
- Typing: "api/typing.md"
1920
- Aiohttp Store Adapters: "api/aiohttp.md"
21+
- Caching: "api/cache.md"
22+
- Splitting: "api/splitting.md"
2023
- Obspec File Readers: "api/obspec.md"
2124
- Store Registries: "api/registry.md"
2225
- Tracing: "api/tracing.md"

0 commit comments

Comments
 (0)