Skip to content

feat: Create zarr-cloud-architect agent #69

@cdcore09

Description

@cdcore09

Description

Create the cloud storage specialist agent for the zarr-data-format plugin. This agent provides expert guidance on integrating Zarr with AWS S3, Google Cloud Storage, and Azure Blob Storage.

File: plugins/zarr-data-format/agents/zarr-cloud-architect.md

Research Reference

Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md

Agent Frontmatter

name: zarr-cloud-architect
description: |
  Specialist in integrating Zarr with cloud object stores (AWS S3, Google Cloud Storage, Azure Blob Storage). Expert in storage backend selection (fsspec, obstore, Icechunk), authentication configuration, metadata consolidation for cloud performance, and cloud-specific Zarr optimization.

  Use this agent when the user asks to "store zarr on S3", "read zarr from GCS", "configure azure blob for zarr", "set up cloud zarr store", "optimize zarr for cloud", "use obstore with zarr", "configure icechunk", or needs cloud-specific Zarr guidance.

  <example>
  Context: User needs to set up S3 access
  user: "I need to read a public Zarr dataset from S3 and write processed results to my own S3 bucket"
  assistant: "I'll use the zarr-cloud-architect to set up both anonymous read access and authenticated write access to S3."
  <commentary>
  Cloud Zarr access requires proper backend configuration, credentials, and potentially different stores for read vs write.
  </commentary>
  </example>

  <example>
  Context: User choosing between storage backends
  user: "Should I use fsspec or obstore to access my Zarr data on GCS?"
  assistant: "I'll invoke the zarr-cloud-architect to compare the backends based on your performance and compatibility requirements."
  <commentary>
  Backend selection involves trade-offs between performance (obstore/Rust), ecosystem maturity (fsspec), and feature needs.
  </commentary>
  </example>

  <example>
  Context: User needs versioned Zarr storage
  user: "I need ACID transactions and version control for my Zarr data on S3"
  assistant: "I'll use the zarr-cloud-architect to guide you through setting up Icechunk as your storage engine."
  <commentary>
  Icechunk provides versioning, ACID transactions, and time-travel for Zarr data on cloud stores.
  </commentary>
  </example>
model: inherit
color: green
skills:
  - cloud-storage-backends
  - zarr-fundamentals

Agent Body Content Requirements (500-800+ lines)

1. Purpose

Cloud storage integration specialist for Zarr, covering all major cloud providers and storage backends.

2. Storage Backend Expertise

fsspec Ecosystem:

  • s3fs — AWS S3 (most mature, widely used)
  • gcsfs — Google Cloud Storage
  • adlfs — Azure Data Lake / Blob Storage
  • aiohttp — HTTP/HTTPS read-only access
  • FsspecStore — Zarr's wrapper for fsspec filesystems
  • Caching protocols: simplecache::, filecache::, blockcache::

obstore (Rust-based):

  • Built on Apache Arrow's object_store crate
  • obstore.store.S3Store, GCSStore, AzureStore
  • zarr.store.ObjectStore(obstore_store) — Zarr integration
  • Performance: can fully saturate EC2↔S3 network bandwidth
  • Smaller ecosystem than fsspec but significantly faster

Icechunk:

  • Versioned storage engine for Zarr (Rust-based)
  • ACID transactions for concurrent writes
  • Time-travel: read data as of any previous version
  • IcechunkStore.open_or_create(storage=StorageConfig.s3_from_env(...))
  • Released 1.0 in July 2025
  • Integrates with VirtualiZarr for zero-copy ingestion

Backend Selection Matrix:

Need Recommended Backend
Maximum throughput obstore
Ecosystem compatibility fsspec (s3fs/gcsfs)
Versioning / ACID Icechunk
Simple URL access fsspec URL shorthand
Caching for repeated reads fsspec with simplecache::

3. Cloud Provider Configuration

AWS S3:

  • Credentials: AWS CLI profile, environment variables, IAM roles, anonymous
  • Regions: endpoint_url for non-default regions
  • Anonymous access: storage_options={'anon': True}
  • fsspec: s3fs.S3FileSystem(anon=True, region_name='us-east-1')
  • obstore: obstore.store.S3Store(bucket, prefix=..., skip_signature=True)

Google Cloud Storage:

  • Authentication: service account JSON, application default credentials, anonymous
  • Project ID requirement for authenticated access
  • gcsfs.GCSFileSystem(project='my-project', token=None) for anonymous
  • obstore.store.GCSStore(bucket, prefix=..., skip_signature=True) for anonymous

Azure Blob Storage:

  • Connection strings, SAS tokens, managed identity, anonymous
  • adlfs.AzureBlobFileSystem(account_name='...', account_key='...')
  • Container + blob path structure

4. Performance Optimization

  • Metadata consolidation — critical for cloud (reduces N metadata reads to 1):

    zarr.consolidate_metadata(store)
    root = zarr.open_consolidated(store)  # v2
    • Note: Not yet in v3 spec but functionally useful
  • Concurrency tuning — per cloud provider:

    zarr.config.set({'async.concurrency': 128})
    • S3: 64-128 typically optimal
    • GCS: 32-64 typically optimal
    • Azure: 32-64 typically optimal
  • Caching layers — for repeated reads:

    g = zarr.open_group("simplecache::s3://bucket/data.zarr",
                        storage_options={"s3": {"anon": True}})

5. Cloud Data Catalogs

  • Microsoft Planetary Computer
  • AWS Registry of Open Data (filter by "Zarr")
  • Google Cloud marketplace datasets
  • Pangeo-Forge Data Catalog
  • How to discover and access public Zarr datasets

6. Security Patterns

  • IAM roles and policies for S3
  • Service accounts for GCS
  • Managed identity for Azure
  • Pre-signed URLs for temporary access
  • Cross-region access considerations and costs

Acceptance Criteria

  • Agent file is 500-800+ lines
  • Covers AWS S3, GCS, and Azure Blob in depth with configuration examples
  • Compares fsspec vs obstore vs Icechunk with clear guidance on when to use each
  • Includes authentication patterns for all three cloud providers
  • Covers metadata consolidation for cloud performance
  • Documents concurrency tuning per cloud provider
  • Includes caching layer configuration
  • Follows the agent pattern from existing plugins

Dependencies

Metadata

Metadata

Assignees

Labels

agentAgent definitionenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions