Skip to content

feat: Create chunk-strategy skill #64

@cdcore09

Description

@cdcore09

Description

Create the skill covering chunk size heuristics, formulas, decision trees, and real-world optimization case studies for the zarr-chunk-optimization plugin.

Directory: plugins/zarr-chunk-optimization/skills/chunk-strategy/

Research Reference

Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md

Files to Create

1. SKILL.md (300+ lines)

Frontmatter:

name: chunk-strategy
description: |
  Use this skill when the user asks to "choose chunk sizes", "optimize zarr chunking",
  "determine chunk dimensions", "analyze access patterns for chunking", "calculate optimal
  chunks", or needs guidance on Zarr chunk size selection, access pattern analysis, chunk
  alignment with Dask, sharding strategy, or the trade-offs between temporal and spatial
  chunking approaches.

Content must include:

  • Quick Reference Card — chunk size formula table:

    Metric Value Source
    Minimum uncompressed chunk 1 MB Zarr docs
    Optimal range (cloud) 100 MB - 1 GB Dask best practices
    S3 byte-range sweet spot 8-16 MB AWS S3 best practices
    Max task graph 10K-100K chunks Dask guidelines
    Parallelism target chunks >= 2 * workers Dask guidelines
    Total concurrency dask_threads * zarr_async_concurrency Zarr docs
    Dask alignment Dask chunks = N * Zarr chunks Dask docs
    Shard reduction shard_volume / chunk_volume Zarr v3 spec
  • Decision Tree for Chunk Strategy Selection — flowchart covering:

    • Cloud vs local storage?
    • Primary access pattern (temporal/spatial/mixed)?
    • Using Dask? → alignment rules
    • Zarr v3 available? → sharding option
    • Data sparse? → write_empty_chunks=False
  • Core Concepts:

    • Chunk alignment with access patterns (the 63x performance evidence)
    • The fundamental trade-off: time-series vs spatial access (Nguyen et al. 1405x/713x)
    • Versatile middle-range strategies
    • Why minimum 1 MB matters for cloud (HTTP request overhead 10-100ms)
  • Sharding Section (Zarr v3):

    • When to use: many small logical chunks needed but object count must be low
    • Sizing: 100GB array / 1MB chunks = 100K objects; with 1GB shards = 100 objects
    • Memory constraint: entire shard must fit in writer memory
    • Code example:
      z = zarr.create_array(store={}, shape=(10000, 10000, 1000),
                            shards=(1000, 1000, 1000),
                            chunks=(100, 100, 100), dtype='uint8')
  • Access Pattern Taxonomy with Recommendations:

    Access Pattern Chunk Strategy Example
    Single lat/lon time-series Maximize time dim, minimize spatial (T, 10, 10)
    Spatial maps at single time Minimize time dim, maximize spatial (1, Y, X)
    Mixed access Balanced across all dims (30, 90, 180)
    Ensemble/scenario queries Include scenario in fast dims (1, 1, 2, 3, 1, 1)
  • Empty Chunk Optimization:

    • write_empty_chunks=False (default): skips fill-value chunks — benchmark: 0.25s
    • write_empty_chunks=True: writes all chunks — benchmark: 0.48s (nearly 2x slower for sparse data)
  • Memory Layout:

    • config={'order': 'C'} (row-major) vs config={'order': 'F'} (column-major / Fortran)
    • Different layouts provide different compression ratios depending on data correlation structure
  • Concurrency Configuration:

    zarr.config.set({'async.concurrency': 128})
    # Default: 10 (conservative). Increase for cloud, decrease for local.
    # WARNING: total_concurrency = dask_threads * zarr_async_concurrency
  • Links to references/PATTERNS.md, references/EXAMPLES.md, references/COMMON_ISSUES.md

2. assets/chunk-calculator.py

Python script that calculates recommended chunk sizes given:

  • Inputs:
    • shape: tuple — array shape (e.g., (3650, 721, 1440))
    • dtype: string — data type (e.g., 'float32')
    • access_pattern: string — 'temporal', 'spatial', or 'balanced'
    • target_chunk_mb: float — target chunk size in MB (default: 100)
    • min_chunk_mb: float — minimum chunk size in MB (default: 1)
    • num_workers: int — number of Dask workers (default: 4)
  • Outputs:
    • Recommended chunk shape (tuple)
    • Estimated chunk size in MB
    • Estimated number of chunks
    • Estimated task graph size
    • Whether sharding is recommended
  • Must be a working, runnable Python script using only numpy for calculations

3. assets/chunk-decision-tree.md

ASCII/markdown decision tree in visual format:

Is this for cloud or local storage?
├── Cloud → Minimum 1 MB chunks, target 100 MB+
│   ├── Primary access pattern?
│   │   ├── Temporal (time-series at locations)
│   │   │   → Maximize time dimension, minimize spatial
│   │   ├── Spatial (maps at time steps)
│   │   │   → Minimize time dimension, maximize spatial
│   │   └── Mixed/Unknown
│   │       → Balanced chunks across all dimensions
│   ├── Using Dask?
│   │   ├── Yes → Ensure Dask chunks are integer multiples of Zarr chunks
│   │   │       → Watch total_concurrency = dask_threads * zarr_async_concurrency
│   │   └── No → Focus on Zarr-level chunk optimization
│   ├── Zarr v3 available?
│   │   ├── Yes → Consider sharding if need small logical chunks
│   │   └── No → Optimize chunk count directly
│   └── Data sparse?
│       ├── Yes → Set write_empty_chunks=False (2x write speedup)
│       └── No → Default behavior
└── Local → More flexible sizing, 1-10 MB often sufficient

4. references/PATTERNS.md

Must include 6+ patterns, each with: description, when to use, chunk formula, code example, trade-offs:

  1. Temporal-First Chunking — climate/weather time-series (e.g., (3650, 10, 10))
  2. Spatial-First Chunking — map generation, regional analysis (e.g., (1, 721, 1440))
  3. Balanced Spatio-Temporal — mixed workloads (e.g., (30, 90, 180))
  4. Ensemble/Scenario Chunking — multi-scenario datasets (e.g., (1, 1, 2, 3, 1, 1))
  5. Sharded Chunks (Zarr v3) — many small logical chunks within larger shards
  6. Dask-Aligned Chunking — ensuring Dask chunks are multiples of Zarr chunks

5. references/EXAMPLES.md

Must include 4+ complete case studies:

  1. Climate Dataset Optimization — shape (3650, 721, 1440), temporal vs spatial access, benchmarked results showing 63x difference
  2. Satellite Imagery — high-resolution spatial data, tile-aligned chunks
  3. Ensemble Weather Forecasts — scenario dimension chunking
  4. Pluvial Flooding Dataset — shape (4, 1, 6, 3, 6000, 6000), pinned-location queries, recommended chunks [1, 1, 2, 3, 1, 1]

Each example: dataset description, shape, access patterns, chunking strategy, rationale, code snippet.

6. references/COMMON_ISSUES.md

Must include 6+ issues:

  1. Chunks too small for cloud storage (< 1 MB) → excessive HTTP requests, 10-100ms overhead per request
  2. Chunks too large → excessive memory usage and unnecessary data transfer
  3. Chunk orientation mismatched with access pattern → orders of magnitude slower (up to 1405x)
  4. Dask chunks not aligned with Zarr chunks → redundant decompression
  5. Total concurrency overflow → cloud storage throttling (dask_threads * zarr_async_concurrency)
  6. Sparse data with write_empty_chunks=True → 2x slower writes

Each issue: symptoms, cause, solution with code, prevention.

Acceptance Criteria

  • SKILL.md is 300+ lines with complete chunk sizing knowledge
  • chunk-calculator.py is a working, runnable Python script
  • chunk-decision-tree.md provides clear visual decision guidance
  • PATTERNS.md covers 6+ distinct chunking patterns with code examples
  • EXAMPLES.md has 4+ real-world case studies with concrete numbers
  • COMMON_ISSUES.md covers 6+ common mistakes with solutions
  • Follows the skill pattern from plugins/scientific-domain-applications/skills/xarray-for-multidimensional-data/

Dependencies

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestskillSkill creation or modification

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions