Skip to content

feat: Create data-migration skill #74

@cdcore09

Description

@cdcore09

Description

Create the skill covering data migration from HDF5/NetCDF to Zarr, VirtualiZarr for zero-copy access, and validation workflows for the zarr-data-format plugin.

Directory: plugins/zarr-data-format/skills/data-migration/

Research Reference

Full research document: .agents/research-zarr-chunk-optimization-and-zarr-plugin.md

Files to Create

1. SKILL.md (300+ lines)

Frontmatter:

name: data-migration
description: |
  Use this skill when the user asks to "convert HDF5 to zarr", "migrate NetCDF to zarr",
  "use VirtualiZarr", "copy zarr data", "move data to cloud zarr", "convert legacy data
  to zarr", or needs guidance on data migration from HDF5/NetCDF/other formats to Zarr,
  using zarr copy operations, VirtualiZarr for zero-copy ingestion, Icechunk integration,
  or validating data integrity after migration.

Content must include:

  • Quick Reference: Migration Options

    Source Format Method Rechunking?
    HDF5 → Zarr zarr.copy() or h5py + zarr Yes (via rechunker)
    NetCDF → Zarr xr.open_dataset().to_zarr() Yes (via encoding)
    Multi-file NetCDF → Zarr xr.open_mfdataset().to_zarr() Yes (via encoding)
    HDF5/NetCDF → Virtual Zarr VirtualiZarr No (zero-copy)
    Zarr → Zarr (different chunks) rechunker Yes
    Zarr v2 → Zarr v3 zarr.copy() with zarr_format=3 Optional
  • Zarr Copy Operations:

    import zarr
    
    # Individual array copy (decompresses/recompresses)
    zarr.copy(source_array, dest_group)
    
    # Group-level copy (all arrays in group)
    zarr.copy_all(source_group, dest_group)
    
    # Store-level copy (no decompression/recompression — fastest)
    zarr.copy_store(source_store, dest_store)
    • copy() / copy_all(): decompresses, recompresses — can change codecs, chunks via kwargs
    • copy_store(): binary copy — fastest but no chunk/codec changes
    • Use copy_store() for format migration without chunk change; rechunker for actual rechunking
  • HDF5 → Zarr Migration:

    import h5py
    import zarr
    
    # Direct copy (maintains HDF5 chunk layout)
    source = h5py.File('data.h5', 'r')
    dest = zarr.open_group('data.zarr', mode='w')
    zarr.copy_all(source, dest)
    
    # With rechunking (via xarray)
    import xarray as xr
    ds = xr.open_dataset('data.h5', engine='h5netcdf', chunks={})
    ds.to_zarr('data.zarr', encoding={'temp': {'chunks': (365, 90, 180)}})
  • NetCDF → Zarr Migration:

    # Single file
    ds = xr.open_dataset('data.nc', chunks={})
    ds.to_zarr('data.zarr', encoding={...})
    
    # Multi-file (concatenating along time)
    ds = xr.open_mfdataset('data_*.nc', chunks={}, combine='by_coords')
    ds.to_zarr('combined.zarr', encoding={...})
  • VirtualiZarr (Zero-Copy Ingestion):

    from virtualizarr import open_virtual_dataset
    
    # Create virtual dataset from legacy files
    vds = open_virtual_dataset('data.nc', indexes={})
    
    # Or multiple files
    import xarray as xr
    vds_list = [open_virtual_dataset(f, indexes={}) for f in file_list]
    combined = xr.combine_by_coords(vds_list)
    
    # Persist to Icechunk for cloud-optimized access
    combined.virtualize.to_icechunk(icechunk_store)
    
    # Then read as regular Zarr
    ds = xr.open_zarr(icechunk_store)
    • What it does: Creates metadata-only representation with byte-range references to original files
    • Benefit: No data duplication, cloud-optimized access patterns
    • Limitation: Original files must remain accessible
    • Best for: Large archival collections that can't be fully converted
  • Icechunk Integration:

    from icechunk import IcechunkStore, StorageConfig
    
    # Create Icechunk store on S3
    store = IcechunkStore.open_or_create(
        storage=StorageConfig.s3_from_env("bucket", "prefix"),
    )
    
    # VirtualiZarr → Icechunk (zero-copy ingestion)
    vds.virtualize.to_icechunk(store)
    
    # Benefits:
    # - ACID transactions for concurrent writes
    # - Version history / time-travel
    # - Zero-copy from archival formats
  • Validation After Migration:

    # 1. Shape verification
    assert source.shape == target.shape
    
    # 2. Random sample comparison
    import numpy as np
    for _ in range(10):
        idx = tuple(np.random.randint(0, s) for s in source.shape)
        assert source[idx] == target[idx]
    
    # 3. Metadata comparison
    for key in source.attrs:
        assert target.attrs[key] == source.attrs[key]
    
    # 4. Coordinate verification (xarray)
    xr.testing.assert_identical(source_ds, target_ds)
  • Large-Scale Migration Strategies:

    • Chunked migration: process dimension-by-dimension
    • Parallel migration: use Dask cluster for concurrent conversion
    • Incremental migration: convert subsets over time
    • Hybrid approach: VirtualiZarr for immediate access + gradual physical conversion
  • Handling Common Source Format Issues:

    • HDF5 compression filters not supported in Zarr → decompresses and recompresses
    • NetCDF unlimited dimensions → map to resizable Zarr dimensions
    • Missing CF metadata → add during migration
    • Fill values / missing data → configure fill_value in Zarr

2. assets/migration-template.py

Complete migration script template:

  • Supports source formats: HDF5 (.h5, .hdf5), NetCDF (.nc, .nc4)
  • Configurable target: local path or cloud store (S3/GCS)
  • Configurable target chunks (or auto-calculate)
  • Configurable compression (default: Zstd)
  • Validates data integrity after migration (10 random sample checks)
  • Reports: source format, source size, target size, compression ratio, elapsed time
  • Handles errors gracefully
  • Clean, well-commented, production-ready

3. references/PATTERNS.md (6+ patterns)

  1. Single HDF5 → Local Zarr — simple h5py + zarr.copy_all
  2. Multi-file NetCDF → Cloud Zarr — xr.open_mfdataset → to_zarr on S3
  3. VirtualiZarr Zero-Copy — create virtual dataset, persist to Icechunk
  4. Zarr v2 → v3 Format Migration — copy with format upgrade
  5. Zarr → Zarr Cloud Migration — local Zarr → S3/GCS with copy_store
  6. Incremental Migration — converting large archives in batches

4. references/EXAMPLES.md (4+ examples)

  1. Converting a Climate HDF5 Archive to Cloud Zarr — full workflow with rechunking
  2. Multi-file NetCDF → Single Zarr Store — concatenation + cloud write
  3. VirtualiZarr + Icechunk Pipeline — zero-copy ingestion of 100+ NetCDF files
  4. Validating a Large Migration — comprehensive validation beyond random sampling

5. references/COMMON_ISSUES.md (6+ issues)

  1. HDF5 compression filter incompatibility → must decompress and recompress
  2. Multi-file dimension conflicts → ensure consistent dimensions across files
  3. Memory overflow during migration → use chunked reads with Dask
  4. Missing coordinates after migration → explicitly include coords in xarray
  5. VirtualiZarr original files moved/deleted → references break, need re-virtualization
  6. Metadata loss during copy_store → use copy_all for metadata preservation

Acceptance Criteria

  • SKILL.md is 300+ lines covering all migration pathways
  • All zarr copy operations documented (copy, copy_all, copy_store)
  • HDF5 and NetCDF migration paths covered with complete code
  • VirtualiZarr documented with Icechunk integration
  • Validation patterns included (shape, sample, metadata, xr.testing)
  • migration-template.py is a working, runnable script
  • Follows the skill pattern from existing plugins

Dependencies

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestskillSkill creation or modification

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions