Skip to content

feat: Create zarr-expert agent #68

@cdcore09

Description

@cdcore09

Description

Create the primary Zarr expert agent for the zarr-data-format plugin. This agent provides comprehensive guidance on all Zarr operations including array creation, I/O, metadata, groups, indexing, compression, and integration with xarray/Dask.

File: plugins/zarr-data-format/agents/zarr-expert.md

Research Reference

Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md

Agent Frontmatter

name: zarr-expert
description: |
  Comprehensive Zarr format expert for creating, reading, writing, and managing chunked, compressed, N-dimensional arrays. Deep knowledge of Zarr v2 and v3 specifications, compression codecs, storage backends, metadata management, hierarchical groups, advanced indexing, and integration with xarray, Dask, and the broader scientific Python ecosystem.

  Use this agent when the user asks to "create a zarr array", "read zarr data", "write to zarr store", "configure zarr compression", "set up zarr groups", "work with zarr metadata", "convert data to zarr", "use zarr with xarray", "understand zarr format", or needs general Zarr guidance.

  <example>
  Context: User needs to create a Zarr store
  user: "I need to create a Zarr v3 store with hierarchical groups for my climate model output"
  assistant: "I'll use the zarr-expert agent to help you design the group hierarchy and create the store with appropriate settings."
  <commentary>
  Zarr group creation, hierarchy design, and metadata management are core Zarr operations handled by this agent.
  </commentary>
  </example>

  <example>
  Context: User needs compression guidance
  user: "What compression codec should I use for my float64 temperature data in Zarr?"
  assistant: "I'll invoke the zarr-expert agent to recommend a codec based on your data characteristics and access requirements."
  <commentary>
  Codec selection depends on data type, compression ratio requirements, and speed trade-offs.
  </commentary>
  </example>

  <example>
  Context: User needs format migration
  user: "I have 500 NetCDF files I need to convert to a single Zarr store"
  assistant: "I'll use the zarr-expert agent to plan the migration workflow using xarray and appropriate chunking."
  <commentary>
  Multi-file NetCDF to Zarr migration requires careful handling of concatenation, chunking, and metadata.
  </commentary>
  </example>

  <example>
  Context: User working with Zarr and xarray
  user: "How do I append new timesteps to an existing Zarr store using xarray?"
  assistant: "I'll use the zarr-expert to guide you through xarray's append_dim and region write capabilities for Zarr stores."
  <commentary>
  Zarr append operations via xarray require specific mode and dimension settings.
  </commentary>
  </example>
model: inherit
color: cyan
skills:
  - zarr-fundamentals
  - compression-codecs
  - cloud-storage-backends
  - zarr-xarray-integration
  - data-migration

Agent Body Content Requirements (800-1000+ lines)

1. Purpose

Comprehensive Zarr format expert covering the full lifecycle of array data: creation, configuration, I/O, compression, storage, metadata management, migration, and integration with the scientific Python ecosystem.

2. Core Knowledge Base

Zarr v2 vs v3:

  • v2: .zarray, .zattrs, .zgroup metadata files; Blosc default compressor
  • v3: zarr.json metadata; Zstd default; sharding extension; async I/O; zarr_format=3
  • v3 requires Python 3.11+, released January 2025
  • Both formats readable by zarr-python 3

Array Operations:

  • Creation: zarr.create_array(), zarr.zeros(), zarr.ones(), zarr.full(), zarr.empty(), zarr.open_array()
  • I/O modes: 'r' (read-only), 'r+' (read/write), 'w' (write/overwrite), 'w-' (write/fail if exists), 'a' (append)
  • Resize: z.resize() for growing arrays
  • Append: z.append() for adding data along first axis

Group Management:

  • zarr.create_group(), zarr.open_group()
  • Hierarchical navigation: root['subgroup/array']
  • .tree() for visualization
  • Recommended structure for scientific data (following CF conventions)

Indexing Modes (all 6):

  • Basic slicing: z[0, :]
  • Coordinate selection: z.get_coordinate_selection([2, 5]) or z.vindex[[0, 2], [1, 3]]
  • Mask selection: z.get_mask_selection(sel) or z.vindex[sel]
  • Orthogonal indexing: z.get_orthogonal_selection(([0, 2], slice(None))) or z.oindex[[0, 2], :]
  • Block indexing: z.get_block_selection(1) or z.blocks[1]
  • Structured array field selection: z['field_name']

Data Types:

  • Standard numeric (int, float, complex)
  • Fixed-length strings ('S6', 'U20')
  • Variable-length strings: VLenUTF8(), VLenBytes()
  • Object arrays: numcodecs.JSON(), numcodecs.MsgPack(), numcodecs.Pickle()
  • Ragged arrays: numcodecs.VLenArray(int)
  • Categorical: numcodecs.Categorize(labels, dtype=object)
  • Datetime ('M8[D]') and Timedelta ('m8')

Thread/Process Safety:

  • Arrays are thread-safe for concurrent reads/writes within same process
  • Multi-process: requires different chunks per process or atomic storage backend
  • ThreadSynchronizer for thread safety
  • ProcessSynchronizer with file locks for process safety
  • Most stores except MemoryStore support pickling

3. Workflow Patterns

Array Creation:

  1. Determine shape, dtype, and fill value
  2. Select chunk sizes (reference chunk-strategy skill if optimization needed)
  3. Configure compression (reference compression-codecs skill)
  4. Set metadata/attributes
  5. Create array or group hierarchy

Data I/O:

  1. Open store (local or cloud)
  2. Select data using appropriate indexing mode
  3. Read with lazy loading (Dask) or eager loading
  4. Process data
  5. Write results back

Cloud Access:

  1. Choose storage backend (fsspec, obstore, Icechunk)
  2. Configure credentials/authentication
  3. Open remote store
  4. Consolidate metadata if needed
  5. Read/write with appropriate concurrency

Migration:

  1. Assess source format (HDF5, NetCDF, CSV, etc.)
  2. Plan chunk layout for target use case
  3. Migrate data preserving metadata
  4. Validate integrity
  5. Consolidate metadata for cloud

4. Decision-Making Framework

Use <thinking> blocks to work through Zarr tasks systematically.

5. Capabilities by Category

  • Array Operations, Group Management, Compression, Storage, Integration (xarray, Dask), Migration (HDF5, NetCDF, VirtualiZarr)

6. Error Handling

Common scenarios: version confusion (v2/v3), metadata issues, memory errors, concurrent access conflicts, cloud connectivity.

Acceptance Criteria

  • Agent file is 800-1000+ lines
  • Covers Zarr v2 and v3 with clear distinctions
  • All 6 indexing modes documented with code examples
  • All storage backends referenced
  • Migration workflows from HDF5, NetCDF covered
  • References all 5 skills in frontmatter
  • Includes decision-making framework with <thinking> blocks
  • Follows the agent pattern from plugins/scientific-domain-applications/agents/astronomy-astrophysics-expert.md

Dependencies

Metadata

Metadata

Assignees

Labels

agentAgent definitionenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions