-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
enhancementNew feature or requestNew feature or requestskillSkill creation or modificationSkill creation or modification
Description
Description
Create the skill covering data migration from HDF5/NetCDF to Zarr, VirtualiZarr for zero-copy access, and validation workflows for the zarr-data-format plugin.
Directory: plugins/zarr-data-format/skills/data-migration/
Research Reference
Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md
Files to Create
1. SKILL.md (300+ lines)
Frontmatter:
name: data-migration
description: |
Use this skill when the user asks to "convert HDF5 to zarr", "migrate NetCDF to zarr",
"use VirtualiZarr", "copy zarr data", "move data to cloud zarr", "convert legacy data
to zarr", or needs guidance on data migration from HDF5/NetCDF/other formats to Zarr,
using zarr copy operations, VirtualiZarr for zero-copy ingestion, Icechunk integration,
or validating data integrity after migration.Content must include:
-
Quick Reference: Migration Options
Source Format Method Rechunking? HDF5 → Zarr zarr.copy()or h5py + zarrYes (via rechunker) NetCDF → Zarr xr.open_dataset().to_zarr()Yes (via encoding) Multi-file NetCDF → Zarr xr.open_mfdataset().to_zarr()Yes (via encoding) HDF5/NetCDF → Virtual Zarr VirtualiZarr No (zero-copy) Zarr → Zarr (different chunks) rechunker Yes Zarr v2 → Zarr v3 zarr.copy()withzarr_format=3Optional -
Zarr Copy Operations:
import zarr # Individual array copy (decompresses/recompresses) zarr.copy(source_array, dest_group) # Group-level copy (all arrays in group) zarr.copy_all(source_group, dest_group) # Store-level copy (no decompression/recompression — fastest) zarr.copy_store(source_store, dest_store)
copy()/copy_all(): decompresses, recompresses — can change codecs, chunks via kwargscopy_store(): binary copy — fastest but no chunk/codec changes- Use
copy_store()for format migration without chunk change;rechunkerfor actual rechunking
-
HDF5 → Zarr Migration:
import h5py import zarr # Direct copy (maintains HDF5 chunk layout) source = h5py.File('data.h5', 'r') dest = zarr.open_group('data.zarr', mode='w') zarr.copy_all(source, dest) # With rechunking (via xarray) import xarray as xr ds = xr.open_dataset('data.h5', engine='h5netcdf', chunks={}) ds.to_zarr('data.zarr', encoding={'temp': {'chunks': (365, 90, 180)}})
-
NetCDF → Zarr Migration:
# Single file ds = xr.open_dataset('data.nc', chunks={}) ds.to_zarr('data.zarr', encoding={...}) # Multi-file (concatenating along time) ds = xr.open_mfdataset('data_*.nc', chunks={}, combine='by_coords') ds.to_zarr('combined.zarr', encoding={...})
-
VirtualiZarr (Zero-Copy Ingestion):
from virtualizarr import open_virtual_dataset # Create virtual dataset from legacy files vds = open_virtual_dataset('data.nc', indexes={}) # Or multiple files import xarray as xr vds_list = [open_virtual_dataset(f, indexes={}) for f in file_list] combined = xr.combine_by_coords(vds_list) # Persist to Icechunk for cloud-optimized access combined.virtualize.to_icechunk(icechunk_store) # Then read as regular Zarr ds = xr.open_zarr(icechunk_store)
- What it does: Creates metadata-only representation with byte-range references to original files
- Benefit: No data duplication, cloud-optimized access patterns
- Limitation: Original files must remain accessible
- Best for: Large archival collections that can't be fully converted
-
Icechunk Integration:
from icechunk import IcechunkStore, StorageConfig # Create Icechunk store on S3 store = IcechunkStore.open_or_create( storage=StorageConfig.s3_from_env("bucket", "prefix"), ) # VirtualiZarr → Icechunk (zero-copy ingestion) vds.virtualize.to_icechunk(store) # Benefits: # - ACID transactions for concurrent writes # - Version history / time-travel # - Zero-copy from archival formats
-
Validation After Migration:
# 1. Shape verification assert source.shape == target.shape # 2. Random sample comparison import numpy as np for _ in range(10): idx = tuple(np.random.randint(0, s) for s in source.shape) assert source[idx] == target[idx] # 3. Metadata comparison for key in source.attrs: assert target.attrs[key] == source.attrs[key] # 4. Coordinate verification (xarray) xr.testing.assert_identical(source_ds, target_ds)
-
Large-Scale Migration Strategies:
- Chunked migration: process dimension-by-dimension
- Parallel migration: use Dask cluster for concurrent conversion
- Incremental migration: convert subsets over time
- Hybrid approach: VirtualiZarr for immediate access + gradual physical conversion
-
Handling Common Source Format Issues:
- HDF5 compression filters not supported in Zarr → decompresses and recompresses
- NetCDF unlimited dimensions → map to resizable Zarr dimensions
- Missing CF metadata → add during migration
- Fill values / missing data → configure
fill_valuein Zarr
2. assets/migration-template.py
Complete migration script template:
- Supports source formats: HDF5 (.h5, .hdf5), NetCDF (.nc, .nc4)
- Configurable target: local path or cloud store (S3/GCS)
- Configurable target chunks (or auto-calculate)
- Configurable compression (default: Zstd)
- Validates data integrity after migration (10 random sample checks)
- Reports: source format, source size, target size, compression ratio, elapsed time
- Handles errors gracefully
- Clean, well-commented, production-ready
3. references/PATTERNS.md (6+ patterns)
- Single HDF5 → Local Zarr — simple h5py + zarr.copy_all
- Multi-file NetCDF → Cloud Zarr — xr.open_mfdataset → to_zarr on S3
- VirtualiZarr Zero-Copy — create virtual dataset, persist to Icechunk
- Zarr v2 → v3 Format Migration — copy with format upgrade
- Zarr → Zarr Cloud Migration — local Zarr → S3/GCS with copy_store
- Incremental Migration — converting large archives in batches
4. references/EXAMPLES.md (4+ examples)
- Converting a Climate HDF5 Archive to Cloud Zarr — full workflow with rechunking
- Multi-file NetCDF → Single Zarr Store — concatenation + cloud write
- VirtualiZarr + Icechunk Pipeline — zero-copy ingestion of 100+ NetCDF files
- Validating a Large Migration — comprehensive validation beyond random sampling
5. references/COMMON_ISSUES.md (6+ issues)
- HDF5 compression filter incompatibility → must decompress and recompress
- Multi-file dimension conflicts → ensure consistent dimensions across files
- Memory overflow during migration → use chunked reads with Dask
- Missing coordinates after migration → explicitly include coords in xarray
- VirtualiZarr original files moved/deleted → references break, need re-virtualization
- Metadata loss during copy_store → use copy_all for metadata preservation
Acceptance Criteria
- SKILL.md is 300+ lines covering all migration pathways
- All zarr copy operations documented (copy, copy_all, copy_store)
- HDF5 and NetCDF migration paths covered with complete code
- VirtualiZarr documented with Icechunk integration
- Validation patterns included (shape, sample, metadata, xr.testing)
- migration-template.py is a working, runnable script
- Follows the skill pattern from existing plugins
Dependencies
- Depends on feat: Create zarr-data-format plugin scaffold #67 (plugin scaffold)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestskillSkill creation or modificationSkill creation or modification