Skip to content

feat: Create zarr-fundamentals skill #70

@cdcore09

Description

@cdcore09

Description

Create the core Zarr operations skill covering array creation, I/O, metadata, groups, indexing, data types, and synchronization for the zarr-data-format plugin.

Directory: plugins/zarr-data-format/skills/zarr-fundamentals/

Research Reference

Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md

Files to Create

1. SKILL.md (400+ lines)

Frontmatter:

name: zarr-fundamentals
description: |
  Use this skill when the user asks to "create a zarr array", "open a zarr store",
  "read zarr data", "write zarr arrays", "manage zarr groups", "set zarr metadata",
  "use zarr indexing", "understand zarr format", "work with zarr v3", or needs guidance
  on core Zarr operations including array creation, hierarchical groups, metadata/attributes,
  advanced indexing modes, data types, thread/process safety, and Zarr v2 vs v3 differences.

Content must include:

  • Quick Reference: Essential Imports and Operations

    import zarr
    import numpy as np
    
    # Create array (v3)
    z = zarr.create_array(store="data.zarr", shape=(10000, 10000),
                          chunks=(1000, 1000), dtype='float32', zarr_format=3)
    
    # Open existing
    z = zarr.open_array("data.zarr", mode='r')
    
    # Create group hierarchy
    root = zarr.open_group("data.zarr", mode='w')
    grp = root.create_group("temperature")
    arr = grp.create_array("t2m", shape=(365, 721, 1440), chunks=(30, 90, 180))
    
    # Metadata
    arr.attrs['units'] = 'K'
    arr.attrs['standard_name'] = 'air_temperature'
    
    # Inspect
    print(z.info)         # Quick summary
    print(root.tree())    # Group hierarchy
  • Installation:

    # Using pixi (recommended)
    pixi add zarr numpy numcodecs
    # Using pip
    pip install zarr[extra]
  • Zarr v2 vs v3 Differences:

    Feature v2 v3
    Metadata files .zarray, .zattrs, .zgroup zarr.json
    Default compressor Blosc Zstd
    Sharding Not available Supported
    I/O model Synchronous Async (asyncio)
    Python requirement 3.8+ 3.11+
    Format parameter zarr_format=2 zarr_format=3
    Consolidated metadata Supported Not in spec (functionally works)
    Store API Legacy store classes New Store ABC
  • Array Creation: All creation functions with parameters:

    • zarr.create_array() — primary creation function
    • zarr.zeros(), zarr.ones(), zarr.full(), zarr.empty()
    • zarr.open_array() — open existing or create new
    • Key parameters: shape, chunks, dtype, fill_value, compressor/compressors, shards, zarr_format
  • Group Management:

    • zarr.create_group(), zarr.open_group()
    • Nested navigation: root['subgroup/array']
    • .tree() for visualization
    • Recommended hierarchy for scientific data
  • Metadata and Attributes:

    • .attrs dictionary interface
    • CF conventions for scientific data (long_name, units, standard_name, coordinates, grid_mapping)
    • .info for quick summary
    • .info_complete() for detailed metadata (slow for large arrays)
  • All 6 Indexing Modes with Examples:

    1. Basic slicing: z[0, :], z[10:20, :]
    2. Coordinate selection: z.get_coordinate_selection([2, 5]) or z.vindex[[0, 2], [1, 3]]
    3. Mask selection: z.get_mask_selection(mask_array) or z.vindex[boolean_mask]
    4. Orthogonal indexing: z.get_orthogonal_selection(([0, 2], slice(None))) or z.oindex[[0, 2], :]
    5. Block indexing: z.get_block_selection(1) or z.blocks[1] (chunk-aligned)
    6. Structured field selection: z['field_name'], z.get_coordinate_selection([0, 2], fields=['foo'])
  • Supported Data Types:

    • Standard numeric: int8-64, uint8-64, float16-64, complex64-128
    • Fixed-length strings: 'S6', 'U20'
    • Variable-length: VLenUTF8(), VLenBytes()
    • Object arrays: numcodecs.JSON(), MsgPack(), Pickle()
    • Ragged arrays: VLenArray(int)
    • Categorical: Categorize(labels, dtype=object)
    • Datetime/Timedelta: 'M8[D]', 'm8'
  • Thread/Process Safety:

    # Thread-safe
    z = zarr.open_array('data.zarr', synchronizer=zarr.ThreadSynchronizer())
    # Process-safe (file locking)
    sync = zarr.ProcessSynchronizer('data.sync')
    z = zarr.open_array('data.zarr', synchronizer=sync)
    • Arrays thread-safe for concurrent reads/writes within same process
    • Multi-process: requires ProcessSynchronizer or separate chunks per process
  • Sharding (v3):

    z = zarr.create_array(store={}, shape=(10000, 10000, 1000),
                          shards=(1000, 1000, 1000),
                          chunks=(100, 100, 100), dtype='uint8')
    • Shards group multiple chunks into single storage objects
    • Shard is minimum unit of writing
  • v2 → v3 Migration Notes

2. assets/zarr-quickstart.py

Complete quickstart script demonstrating:

  • Creating a Zarr v3 array
  • Writing data
  • Reading data back
  • Creating group hierarchy
  • Setting metadata
  • Inspecting with .info and .tree()
  • Basic and advanced indexing
  • Compression configuration

3. references/PATTERNS.md (6+ patterns)

  1. Creating a Hierarchical Scientific Data Store — group hierarchy with CF-compliant metadata
  2. Opening and Reading Remote Zarr Data — URL access patterns
  3. Appending Data to Existing Arrays — resize + write, append_dim via xarray
  4. Advanced Indexing Patterns — orthogonal, block, coordinate selection use cases
  5. Using Shards in Zarr v3 — configuration and trade-offs
  6. Concurrent Access with Synchronizers — thread and process safety patterns

4. references/EXAMPLES.md (4+ examples)

  1. Creating a Scientific Dataset from Scratch — climate-like data with dimensions, coords, metadata
  2. Reading and Querying a Remote Zarr Store — public data from Pangeo/AWS
  3. Building a Hierarchical Store with Groups — multi-variable, multi-level
  4. Working with Structured and Ragged Arrays — custom dtypes, variable-length data

5. references/COMMON_ISSUES.md (5+ issues)

  1. Zarr v2 vs v3 API confusion — common API differences and how to handle
  2. Metadata not persisting — need explicit .attrs assignment
  3. Memory errors with large arrays — chunking not configured
  4. Concurrent write corruption — missing synchronizer
  5. .info_complete() slow on large arrays — use .info for quick checks

Acceptance Criteria

  • SKILL.md is 400+ lines with comprehensive Zarr fundamentals
  • Covers both v2 and v3 APIs with clear distinctions
  • All 6 indexing modes documented with complete code examples
  • All data types documented
  • Thread/process safety documented
  • Quickstart script works end-to-end
  • Follows the skill pattern from plugins/scientific-domain-applications/skills/xarray-for-multidimensional-data/

Dependencies

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestskillSkill creation or modification

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions