|
| 1 | +# Caching Architecture |
| 2 | + |
| 3 | +This document describes the caching architecture in obspec-utils, the design decisions behind it, and guidance on when to use different caching strategies. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +obspec-utils provides caching at two levels: |
| 8 | + |
| 9 | +1. **Reader-level caching**: Temporary, scoped to a single file reader's lifetime |
| 10 | +2. **Store-level caching**: Shared across all consumers of a store |
| 11 | + |
| 12 | +These serve different phases of a typical VirtualiZarr workflow and have different trade-offs. |
| 13 | + |
| 14 | +## Two-Phase Workflow |
| 15 | + |
| 16 | +When working with virtual datasets (e.g., via VirtualiZarr), data access happens in two distinct phases: |
| 17 | + |
| 18 | +### Phase 1: Parsing (Reader-Level) |
| 19 | + |
| 20 | +``` |
| 21 | +┌─────────────────────────────────────────────────────────────┐ |
| 22 | +│ PARSING PHASE │ |
| 23 | +│ (per-file, short-lived) │ |
| 24 | +│ │ |
| 25 | +│ open_virtual_dataset() │ |
| 26 | +│ │ │ |
| 27 | +│ ▼ │ |
| 28 | +│ Reader ──────► Store ──────► Network │ |
| 29 | +│ │ │ |
| 30 | +│ └── Reader-level caching │ |
| 31 | +│ - Caches full file for HDF5/NetCDF parsing │ |
| 32 | +│ - Released when reader closes │ |
| 33 | +│ - Isolated per reader instance │ |
| 34 | +└─────────────────────────────────────────────────────────────┘ |
| 35 | +``` |
| 36 | + |
| 37 | +During parsing: |
| 38 | + |
| 39 | +- A reader opens the file and parses headers/metadata |
| 40 | +- The file may be read multiple times during parsing (HDF5 structure traversal) |
| 41 | +- Once parsing completes, the reader closes and cache is released |
| 42 | +- Result: Virtual dataset with chunk *references* (not actual data) |
| 43 | + |
| 44 | +**Reader-level caching is appropriate here because:** |
| 45 | + |
| 46 | +- File metadata is read once per file, then discarded |
| 47 | +- Cache lifecycle matches reader lifecycle (automatic cleanup) |
| 48 | +- No cross-contamination between different files being parsed |
| 49 | +- Memory is freed immediately when parsing completes |
| 50 | + |
| 51 | +### Phase 2: Data Access (Store-Level) |
| 52 | + |
| 53 | +``` |
| 54 | +┌─────────────────────────────────────────────────────────────┐ |
| 55 | +│ DATA ACCESS PHASE │ |
| 56 | +│ (shared, long-lived) │ |
| 57 | +│ │ |
| 58 | +│ xarray computation / .load() / .compute() │ |
| 59 | +│ │ │ |
| 60 | +│ ▼ │ |
| 61 | +│ ManifestStore ──────► Store ──────► Network │ |
| 62 | +│ │ │ |
| 63 | +│ └── Store-level caching │ |
| 64 | +│ - Shared across calls │ |
| 65 | +│ - Persists across │ |
| 66 | +│ different consumers │ |
| 67 | +└─────────────────────────────────────────────────────────────┘ |
| 68 | +``` |
| 69 | + |
| 70 | +During data access: |
| 71 | + |
| 72 | +- Computations trigger reads of actual chunk data |
| 73 | +- Same chunks may be accessed multiple times (overlapping operations, retries) |
| 74 | +- Cache should persist and be shared across different access patterns |
| 75 | + |
| 76 | +**Store-level caching is appropriate here because:** |
| 77 | + |
| 78 | +- Chunks may be re-read by different computations |
| 79 | +- Cache should be shared across all consumers of the store |
| 80 | +- Lifecycle is independent of any single reader |
| 81 | + |
| 82 | +## Caching Implementations |
| 83 | + |
| 84 | +### Reader-Level: EagerStoreReader |
| 85 | + |
| 86 | +`EagerStoreReader` loads the entire file into memory on construction: |
| 87 | + |
| 88 | +```python |
| 89 | +from obspec_utils.obspec import EagerStoreReader |
| 90 | + |
| 91 | +# File is fully loaded into memory |
| 92 | +reader = EagerStoreReader(store, "file.nc") |
| 93 | + |
| 94 | +# All reads served from memory |
| 95 | +data = reader.read(1000) |
| 96 | +reader.seek(0) |
| 97 | +more_data = reader.read(500) |
| 98 | + |
| 99 | +# Cache released when reader closes |
| 100 | +reader.close() |
| 101 | +``` |
| 102 | + |
| 103 | +**Characteristics:** |
| 104 | + |
| 105 | +- Fetches file using parallel `get_ranges()` for speed |
| 106 | +- Caches in `BytesIO` buffer |
| 107 | +- Cache is isolated to this reader instance |
| 108 | +- Memory freed on `close()` or context manager exit |
| 109 | + |
| 110 | +**When to use:** |
| 111 | + |
| 112 | +- Parsing HDF5/NetCDF files (need random access during parsing) |
| 113 | +- Small-to-medium files that fit in memory |
| 114 | +- When you'll read most of the file anyway |
| 115 | + |
| 116 | +### Reader-Level: ParallelStoreReader |
| 117 | + |
| 118 | +`ParallelStoreReader` uses chunk-based LRU caching: |
| 119 | + |
| 120 | +```python |
| 121 | +from obspec_utils.obspec import ParallelStoreReader |
| 122 | + |
| 123 | +reader = ParallelStoreReader( |
| 124 | + store, "file.nc", |
| 125 | + chunk_size=256 * 1024, # 256 KB chunks |
| 126 | + max_cached_chunks=64, # Up to 64 chunks cached |
| 127 | +) |
| 128 | + |
| 129 | +# Chunks fetched on demand via get_ranges() |
| 130 | +data = reader.read(1000) |
| 131 | +``` |
| 132 | + |
| 133 | +**Characteristics:** |
| 134 | + |
| 135 | +- Bounded memory usage: `chunk_size * max_cached_chunks` |
| 136 | +- LRU eviction when cache is full |
| 137 | +- Good for sparse/random access patterns |
| 138 | + |
| 139 | +**When to use:** |
| 140 | + |
| 141 | +- Large files with sparse access |
| 142 | +- Memory-constrained environments |
| 143 | +- Unknown access patterns |
| 144 | + |
| 145 | +### Reader-Level: BufferedStoreReader |
| 146 | + |
| 147 | +`BufferedStoreReader` provides read-ahead buffering for sequential access: |
| 148 | + |
| 149 | +```python |
| 150 | +from obspec_utils.obspec import BufferedStoreReader |
| 151 | + |
| 152 | +reader = BufferedStoreReader(store, "file.nc", buffer_size=1024 * 1024) |
| 153 | + |
| 154 | +# Sequential reads benefit from buffering |
| 155 | +while chunk := reader.read(4096): |
| 156 | + process(chunk) |
| 157 | +``` |
| 158 | + |
| 159 | +**Characteristics:** |
| 160 | + |
| 161 | +- Position-aware buffering (read-ahead) |
| 162 | +- Best for sequential/streaming access |
| 163 | +- Minimal memory overhead |
| 164 | + |
| 165 | +**When to use:** |
| 166 | + |
| 167 | +- Sequential file processing |
| 168 | +- Streaming workloads |
| 169 | +- When you won't revisit earlier data |
| 170 | + |
| 171 | +### Store-Level: CachingReadableStore |
| 172 | + |
| 173 | +`CachingReadableStore` wraps any store to cache full objects: |
| 174 | + |
| 175 | +```python |
| 176 | +from obspec_utils.cache import CachingReadableStore |
| 177 | + |
| 178 | +# Wrap the store with caching |
| 179 | +cached_store = CachingReadableStore( |
| 180 | + store, |
| 181 | + max_size=256 * 1024 * 1024, # 256 MB cache |
| 182 | +) |
| 183 | + |
| 184 | +# All consumers share the same cache |
| 185 | +data1 = cached_store.get("file1.nc") # Fetched from network, cached |
| 186 | +data2 = cached_store.get("file1.nc") # Served from cache |
| 187 | +``` |
| 188 | + |
| 189 | +**Characteristics:** |
| 190 | + |
| 191 | +- Caches full objects (entire files) |
| 192 | +- LRU eviction when `max_size` exceeded |
| 193 | +- Thread-safe (works with `ThreadPoolExecutor`) |
| 194 | +- Shared across all consumers of the wrapped store |
| 195 | + |
| 196 | +**When to use:** |
| 197 | + |
| 198 | +- Multiple consumers reading the same files |
| 199 | +- Repeated access to small-to-medium files |
| 200 | +- When store-level sharing is beneficial |
| 201 | + |
| 202 | +**Limitations:** |
| 203 | + |
| 204 | +- Not shared across processes (Dask workers, ProcessPoolExecutor) |
| 205 | +- Each process maintains its own cache |
| 206 | +- Full-object granularity (not ideal for partial reads of large files) |
| 207 | + |
| 208 | +## Distributed Considerations |
| 209 | + |
| 210 | +### Threading (Shared Memory) |
| 211 | + |
| 212 | +With `ThreadPoolExecutor` or similar: |
| 213 | + |
| 214 | +- Store-level caching IS shared across threads |
| 215 | +- All threads benefit from the same cache |
| 216 | +- Thread-safe implementations required (provided) |
| 217 | + |
| 218 | +### Multi-Process (Separate Memory) |
| 219 | + |
| 220 | +With `ProcessPoolExecutor`, Dask distributed, or Lithops: |
| 221 | + |
| 222 | +- Each worker process has its own memory space |
| 223 | +- Store wrappers are serialized and copied to each worker |
| 224 | +- **Caches are NOT shared** across workers |
| 225 | +- Each worker maintains an independent cache |
| 226 | + |
| 227 | +This is typically acceptable when: |
| 228 | + |
| 229 | +- Workloads are partitioned by file (each worker processes different files) |
| 230 | +- The alternative (no caching) would be worse |
| 231 | + |
| 232 | +For workloads requiring cross-worker cache sharing, consider: |
| 233 | + |
| 234 | +- External caching (Redis, memcached) |
| 235 | +- Shared filesystem caching |
| 236 | +- Restructuring workloads to minimize cross-worker file access |
| 237 | + |
| 238 | +## Decision Guide |
| 239 | + |
| 240 | +### Which reader should I use? |
| 241 | + |
| 242 | +| Access Pattern | Recommended Reader | |
| 243 | +|---------------|-------------------| |
| 244 | +| Parse HDF5/NetCDF file | `EagerStoreReader` | |
| 245 | +| Sequential streaming | `BufferedStoreReader` | |
| 246 | +| Sparse random access | `ParallelStoreReader` | |
| 247 | +| Unknown pattern, large file | `ParallelStoreReader` | |
| 248 | +| Small file, repeated access | `EagerStoreReader` | |
| 249 | + |
| 250 | +### Should I use store-level caching? |
| 251 | + |
| 252 | +| Scenario | Recommendation | |
| 253 | +|----------|---------------| |
| 254 | +| Single-threaded, repeated file access | Yes, `CachingReadableStore` | |
| 255 | +| Multi-threaded, shared files | Yes, `CachingReadableStore` | |
| 256 | +| Distributed workers, partitioned by file | Optional (per-worker cache) | |
| 257 | +| Distributed workers, shared files | Consider external caching | |
| 258 | +| One-time file processing | No (use reader-level only) | |
| 259 | + |
| 260 | +## Store Wrappers |
| 261 | + |
| 262 | +### SplittingReadableStore |
| 263 | + |
| 264 | +`SplittingReadableStore` accelerates `get()` by splitting large requests into parallel `get_ranges()`: |
| 265 | + |
| 266 | +```python |
| 267 | +from obspec_utils.splitting import SplittingReadableStore |
| 268 | + |
| 269 | +fast_store = SplittingReadableStore( |
| 270 | + store, |
| 271 | + request_size=12 * 1024 * 1024, # 12 MB per request |
| 272 | + max_concurrent_requests=18, |
| 273 | +) |
| 274 | +``` |
| 275 | + |
| 276 | +This extracts the parallel fetching logic from `EagerStoreReader` into a composable wrapper. It composes naturally with `CachingReadableStore`: |
| 277 | + |
| 278 | +```python |
| 279 | +from obspec_utils.splitting import SplittingReadableStore |
| 280 | +from obspec_utils.cache import CachingReadableStore |
| 281 | + |
| 282 | +# Compose: fast parallel fetches + caching |
| 283 | +store = S3Store(bucket="my-bucket") |
| 284 | +store = SplittingReadableStore(store) # Split large fetches |
| 285 | +store = CachingReadableStore(store) # Cache results |
| 286 | + |
| 287 | +# First get(): parallel fetch -> cache |
| 288 | +# Second get(): served from cache |
| 289 | +``` |
| 290 | + |
| 291 | +**Characteristics:** |
| 292 | + |
| 293 | +- Only affects `get()` and `get_async()` - range requests pass through unchanged |
| 294 | +- Requires `head()` support to determine file size (falls back to single request otherwise) |
| 295 | +- Tuned for cloud storage (12 MB chunks, 18 concurrent requests by default) |
| 296 | + |
| 297 | +## Summary |
| 298 | + |
| 299 | +| Layer | Scope | Lifetime | Use Case | |
| 300 | +|-------|-------|----------|----------| |
| 301 | +| Reader-level | Per-file, per-instance | Reader lifetime | Parsing phase | |
| 302 | +| Store-level | Shared across consumers | Application lifetime | Data access phase | |
| 303 | + |
| 304 | +The two-level architecture reflects the reality that: |
| 305 | + |
| 306 | +1. **Parsing** benefits from isolated, temporary caching (reader-level) |
| 307 | +2. **Data access** benefits from shared, persistent caching (store-level) |
| 308 | +3. **Distributed settings** require understanding that in-memory caches are per-process |
0 commit comments