-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Description
Create the skill covering chunk size heuristics, formulas, decision trees, and real-world optimization case studies for the zarr-chunk-optimization plugin.
Directory: plugins/zarr-chunk-optimization/skills/chunk-strategy/
Research Reference
Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md
Files to Create
1. SKILL.md (300+ lines)
Frontmatter:
name: chunk-strategy
description: |
Use this skill when the user asks to "choose chunk sizes", "optimize zarr chunking",
"determine chunk dimensions", "analyze access patterns for chunking", "calculate optimal
chunks", or needs guidance on Zarr chunk size selection, access pattern analysis, chunk
alignment with Dask, sharding strategy, or the trade-offs between temporal and spatial
chunking approaches.Content must include:
-
Quick Reference Card — chunk size formula table:
Metric Value Source Minimum uncompressed chunk 1 MB Zarr docs Optimal range (cloud) 100 MB - 1 GB Dask best practices S3 byte-range sweet spot 8-16 MB AWS S3 best practices Max task graph 10K-100K chunks Dask guidelines Parallelism target chunks >= 2 * workers Dask guidelines Total concurrency dask_threads * zarr_async_concurrencyZarr docs Dask alignment Dask chunks = N * Zarr chunks Dask docs Shard reduction shard_volume / chunk_volume Zarr v3 spec -
Decision Tree for Chunk Strategy Selection — flowchart covering:
- Cloud vs local storage?
- Primary access pattern (temporal/spatial/mixed)?
- Using Dask? → alignment rules
- Zarr v3 available? → sharding option
- Data sparse? →
write_empty_chunks=False
-
Core Concepts:
- Chunk alignment with access patterns (the 63x performance evidence)
- The fundamental trade-off: time-series vs spatial access (Nguyen et al. 1405x/713x)
- Versatile middle-range strategies
- Why minimum 1 MB matters for cloud (HTTP request overhead 10-100ms)
-
Sharding Section (Zarr v3):
- When to use: many small logical chunks needed but object count must be low
- Sizing: 100GB array / 1MB chunks = 100K objects; with 1GB shards = 100 objects
- Memory constraint: entire shard must fit in writer memory
- Code example:
z = zarr.create_array(store={}, shape=(10000, 10000, 1000), shards=(1000, 1000, 1000), chunks=(100, 100, 100), dtype='uint8')
-
Access Pattern Taxonomy with Recommendations:
Access Pattern Chunk Strategy Example Single lat/lon time-series Maximize time dim, minimize spatial (T, 10, 10)Spatial maps at single time Minimize time dim, maximize spatial (1, Y, X)Mixed access Balanced across all dims (30, 90, 180)Ensemble/scenario queries Include scenario in fast dims (1, 1, 2, 3, 1, 1) -
Empty Chunk Optimization:
write_empty_chunks=False(default): skips fill-value chunks — benchmark: 0.25swrite_empty_chunks=True: writes all chunks — benchmark: 0.48s (nearly 2x slower for sparse data)
-
Memory Layout:
config={'order': 'C'}(row-major) vsconfig={'order': 'F'}(column-major / Fortran)- Different layouts provide different compression ratios depending on data correlation structure
-
Concurrency Configuration:
zarr.config.set({'async.concurrency': 128}) # Default: 10 (conservative). Increase for cloud, decrease for local. # WARNING: total_concurrency = dask_threads * zarr_async_concurrency
-
Links to
references/PATTERNS.md,references/EXAMPLES.md,references/COMMON_ISSUES.md
2. assets/chunk-calculator.py
Python script that calculates recommended chunk sizes given:
- Inputs:
shape: tuple — array shape (e.g.,(3650, 721, 1440))dtype: string — data type (e.g.,'float32')access_pattern: string —'temporal','spatial', or'balanced'target_chunk_mb: float — target chunk size in MB (default: 100)min_chunk_mb: float — minimum chunk size in MB (default: 1)num_workers: int — number of Dask workers (default: 4)
- Outputs:
- Recommended chunk shape (tuple)
- Estimated chunk size in MB
- Estimated number of chunks
- Estimated task graph size
- Whether sharding is recommended
- Must be a working, runnable Python script using only numpy for calculations
3. assets/chunk-decision-tree.md
ASCII/markdown decision tree in visual format:
Is this for cloud or local storage?
├── Cloud → Minimum 1 MB chunks, target 100 MB+
│ ├── Primary access pattern?
│ │ ├── Temporal (time-series at locations)
│ │ │ → Maximize time dimension, minimize spatial
│ │ ├── Spatial (maps at time steps)
│ │ │ → Minimize time dimension, maximize spatial
│ │ └── Mixed/Unknown
│ │ → Balanced chunks across all dimensions
│ ├── Using Dask?
│ │ ├── Yes → Ensure Dask chunks are integer multiples of Zarr chunks
│ │ │ → Watch total_concurrency = dask_threads * zarr_async_concurrency
│ │ └── No → Focus on Zarr-level chunk optimization
│ ├── Zarr v3 available?
│ │ ├── Yes → Consider sharding if need small logical chunks
│ │ └── No → Optimize chunk count directly
│ └── Data sparse?
│ ├── Yes → Set write_empty_chunks=False (2x write speedup)
│ └── No → Default behavior
└── Local → More flexible sizing, 1-10 MB often sufficient
4. references/PATTERNS.md
Must include 6+ patterns, each with: description, when to use, chunk formula, code example, trade-offs:
- Temporal-First Chunking — climate/weather time-series (e.g.,
(3650, 10, 10)) - Spatial-First Chunking — map generation, regional analysis (e.g.,
(1, 721, 1440)) - Balanced Spatio-Temporal — mixed workloads (e.g.,
(30, 90, 180)) - Ensemble/Scenario Chunking — multi-scenario datasets (e.g.,
(1, 1, 2, 3, 1, 1)) - Sharded Chunks (Zarr v3) — many small logical chunks within larger shards
- Dask-Aligned Chunking — ensuring Dask chunks are multiples of Zarr chunks
5. references/EXAMPLES.md
Must include 4+ complete case studies:
- Climate Dataset Optimization — shape (3650, 721, 1440), temporal vs spatial access, benchmarked results showing 63x difference
- Satellite Imagery — high-resolution spatial data, tile-aligned chunks
- Ensemble Weather Forecasts — scenario dimension chunking
- Pluvial Flooding Dataset — shape (4, 1, 6, 3, 6000, 6000), pinned-location queries, recommended chunks
[1, 1, 2, 3, 1, 1]
Each example: dataset description, shape, access patterns, chunking strategy, rationale, code snippet.
6. references/COMMON_ISSUES.md
Must include 6+ issues:
- Chunks too small for cloud storage (< 1 MB) → excessive HTTP requests, 10-100ms overhead per request
- Chunks too large → excessive memory usage and unnecessary data transfer
- Chunk orientation mismatched with access pattern → orders of magnitude slower (up to 1405x)
- Dask chunks not aligned with Zarr chunks → redundant decompression
- Total concurrency overflow → cloud storage throttling (dask_threads * zarr_async_concurrency)
- Sparse data with
write_empty_chunks=True→ 2x slower writes
Each issue: symptoms, cause, solution with code, prevention.
Acceptance Criteria
- SKILL.md is 300+ lines with complete chunk sizing knowledge
- chunk-calculator.py is a working, runnable Python script
- chunk-decision-tree.md provides clear visual decision guidance
- PATTERNS.md covers 6+ distinct chunking patterns with code examples
- EXAMPLES.md has 4+ real-world case studies with concrete numbers
- COMMON_ISSUES.md covers 6+ common mistakes with solutions
- Follows the skill pattern from
plugins/scientific-domain-applications/skills/xarray-for-multidimensional-data/
Dependencies
- Depends on feat: Create zarr-chunk-optimization plugin scaffold #61 (plugin scaffold)