-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
Description
Description
Create the cloud storage specialist agent for the zarr-data-format plugin. This agent provides expert guidance on integrating Zarr with AWS S3, Google Cloud Storage, and Azure Blob Storage.
File: plugins/zarr-data-format/agents/zarr-cloud-architect.md
Research Reference
Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md
Agent Frontmatter
name: zarr-cloud-architect
description: |
Specialist in integrating Zarr with cloud object stores (AWS S3, Google Cloud Storage, Azure Blob Storage). Expert in storage backend selection (fsspec, obstore, Icechunk), authentication configuration, metadata consolidation for cloud performance, and cloud-specific Zarr optimization.
Use this agent when the user asks to "store zarr on S3", "read zarr from GCS", "configure azure blob for zarr", "set up cloud zarr store", "optimize zarr for cloud", "use obstore with zarr", "configure icechunk", or needs cloud-specific Zarr guidance.
<example>
Context: User needs to set up S3 access
user: "I need to read a public Zarr dataset from S3 and write processed results to my own S3 bucket"
assistant: "I'll use the zarr-cloud-architect to set up both anonymous read access and authenticated write access to S3."
<commentary>
Cloud Zarr access requires proper backend configuration, credentials, and potentially different stores for read vs write.
</commentary>
</example>
<example>
Context: User choosing between storage backends
user: "Should I use fsspec or obstore to access my Zarr data on GCS?"
assistant: "I'll invoke the zarr-cloud-architect to compare the backends based on your performance and compatibility requirements."
<commentary>
Backend selection involves trade-offs between performance (obstore/Rust), ecosystem maturity (fsspec), and feature needs.
</commentary>
</example>
<example>
Context: User needs versioned Zarr storage
user: "I need ACID transactions and version control for my Zarr data on S3"
assistant: "I'll use the zarr-cloud-architect to guide you through setting up Icechunk as your storage engine."
<commentary>
Icechunk provides versioning, ACID transactions, and time-travel for Zarr data on cloud stores.
</commentary>
</example>
model: inherit
color: green
skills:
- cloud-storage-backends
- zarr-fundamentalsAgent Body Content Requirements (500-800+ lines)
1. Purpose
Cloud storage integration specialist for Zarr, covering all major cloud providers and storage backends.
2. Storage Backend Expertise
fsspec Ecosystem:
s3fs— AWS S3 (most mature, widely used)gcsfs— Google Cloud Storageadlfs— Azure Data Lake / Blob Storageaiohttp— HTTP/HTTPS read-only accessFsspecStore— Zarr's wrapper for fsspec filesystems- Caching protocols:
simplecache::,filecache::,blockcache::
obstore (Rust-based):
- Built on Apache Arrow's
object_storecrate obstore.store.S3Store,GCSStore,AzureStorezarr.store.ObjectStore(obstore_store)— Zarr integration- Performance: can fully saturate EC2↔S3 network bandwidth
- Smaller ecosystem than fsspec but significantly faster
Icechunk:
- Versioned storage engine for Zarr (Rust-based)
- ACID transactions for concurrent writes
- Time-travel: read data as of any previous version
IcechunkStore.open_or_create(storage=StorageConfig.s3_from_env(...))- Released 1.0 in July 2025
- Integrates with VirtualiZarr for zero-copy ingestion
Backend Selection Matrix:
| Need | Recommended Backend |
|---|---|
| Maximum throughput | obstore |
| Ecosystem compatibility | fsspec (s3fs/gcsfs) |
| Versioning / ACID | Icechunk |
| Simple URL access | fsspec URL shorthand |
| Caching for repeated reads | fsspec with simplecache:: |
3. Cloud Provider Configuration
AWS S3:
- Credentials: AWS CLI profile, environment variables, IAM roles, anonymous
- Regions:
endpoint_urlfor non-default regions - Anonymous access:
storage_options={'anon': True} - fsspec:
s3fs.S3FileSystem(anon=True, region_name='us-east-1') - obstore:
obstore.store.S3Store(bucket, prefix=..., skip_signature=True)
Google Cloud Storage:
- Authentication: service account JSON, application default credentials, anonymous
- Project ID requirement for authenticated access
gcsfs.GCSFileSystem(project='my-project', token=None)for anonymousobstore.store.GCSStore(bucket, prefix=..., skip_signature=True)for anonymous
Azure Blob Storage:
- Connection strings, SAS tokens, managed identity, anonymous
adlfs.AzureBlobFileSystem(account_name='...', account_key='...')- Container + blob path structure
4. Performance Optimization
-
Metadata consolidation — critical for cloud (reduces N metadata reads to 1):
zarr.consolidate_metadata(store) root = zarr.open_consolidated(store) # v2
- Note: Not yet in v3 spec but functionally useful
-
Concurrency tuning — per cloud provider:
zarr.config.set({'async.concurrency': 128})
- S3: 64-128 typically optimal
- GCS: 32-64 typically optimal
- Azure: 32-64 typically optimal
-
Caching layers — for repeated reads:
g = zarr.open_group("simplecache::s3://bucket/data.zarr", storage_options={"s3": {"anon": True}})
5. Cloud Data Catalogs
- Microsoft Planetary Computer
- AWS Registry of Open Data (filter by "Zarr")
- Google Cloud marketplace datasets
- Pangeo-Forge Data Catalog
- How to discover and access public Zarr datasets
6. Security Patterns
- IAM roles and policies for S3
- Service accounts for GCS
- Managed identity for Azure
- Pre-signed URLs for temporary access
- Cross-region access considerations and costs
Acceptance Criteria
- Agent file is 500-800+ lines
- Covers AWS S3, GCS, and Azure Blob in depth with configuration examples
- Compares fsspec vs obstore vs Icechunk with clear guidance on when to use each
- Includes authentication patterns for all three cloud providers
- Covers metadata consolidation for cloud performance
- Documents concurrency tuning per cloud provider
- Includes caching layer configuration
- Follows the agent pattern from existing plugins
Dependencies
- Depends on feat: Create zarr-data-format plugin scaffold #67 (plugin scaffold)
Reactions are currently unavailable