feat: Create data-migration skill

## Description

Create the skill covering data migration from HDF5/NetCDF to Zarr, VirtualiZarr for zero-copy access, and validation workflows for the **zarr-data-format** plugin.

**Directory:** `plugins/zarr-data-format/skills/data-migration/`

## Research Reference

Full research document: `.agents/research-zarr-chunk-optimization-and-zarr-plugin.md`

## Files to Create

### 1. SKILL.md (300+ lines)

**Frontmatter:**
```yaml
name: data-migration
description: |
  Use this skill when the user asks to "convert HDF5 to zarr", "migrate NetCDF to zarr",
  "use VirtualiZarr", "copy zarr data", "move data to cloud zarr", "convert legacy data
  to zarr", or needs guidance on data migration from HDF5/NetCDF/other formats to Zarr,
  using zarr copy operations, VirtualiZarr for zero-copy ingestion, Icechunk integration,
  or validating data integrity after migration.
```

**Content must include:**

- **Quick Reference: Migration Options**

  | Source Format | Method | Rechunking? |
  |--------------|--------|-------------|
  | HDF5 → Zarr | `zarr.copy()` or h5py + zarr | Yes (via rechunker) |
  | NetCDF → Zarr | `xr.open_dataset().to_zarr()` | Yes (via encoding) |
  | Multi-file NetCDF → Zarr | `xr.open_mfdataset().to_zarr()` | Yes (via encoding) |
  | HDF5/NetCDF → Virtual Zarr | VirtualiZarr | No (zero-copy) |
  | Zarr → Zarr (different chunks) | rechunker | Yes |
  | Zarr v2 → Zarr v3 | `zarr.copy()` with `zarr_format=3` | Optional |

- **Zarr Copy Operations:**
  ```python
  import zarr
  
  # Individual array copy (decompresses/recompresses)
  zarr.copy(source_array, dest_group)
  
  # Group-level copy (all arrays in group)
  zarr.copy_all(source_group, dest_group)
  
  # Store-level copy (no decompression/recompression — fastest)
  zarr.copy_store(source_store, dest_store)
  ```
  
  - `copy()` / `copy_all()`: decompresses, recompresses — can change codecs, chunks via kwargs
  - `copy_store()`: binary copy — fastest but no chunk/codec changes
  - Use `copy_store()` for format migration without chunk change; `rechunker` for actual rechunking

- **HDF5 → Zarr Migration:**
  ```python
  import h5py
  import zarr
  
  # Direct copy (maintains HDF5 chunk layout)
  source = h5py.File('data.h5', 'r')
  dest = zarr.open_group('data.zarr', mode='w')
  zarr.copy_all(source, dest)
  
  # With rechunking (via xarray)
  import xarray as xr
  ds = xr.open_dataset('data.h5', engine='h5netcdf', chunks={})
  ds.to_zarr('data.zarr', encoding={'temp': {'chunks': (365, 90, 180)}})
  ```

- **NetCDF → Zarr Migration:**
  ```python
  # Single file
  ds = xr.open_dataset('data.nc', chunks={})
  ds.to_zarr('data.zarr', encoding={...})
  
  # Multi-file (concatenating along time)
  ds = xr.open_mfdataset('data_*.nc', chunks={}, combine='by_coords')
  ds.to_zarr('combined.zarr', encoding={...})
  ```

- **VirtualiZarr (Zero-Copy Ingestion):**
  ```python
  from virtualizarr import open_virtual_dataset
  
  # Create virtual dataset from legacy files
  vds = open_virtual_dataset('data.nc', indexes={})
  
  # Or multiple files
  import xarray as xr
  vds_list = [open_virtual_dataset(f, indexes={}) for f in file_list]
  combined = xr.combine_by_coords(vds_list)
  
  # Persist to Icechunk for cloud-optimized access
  combined.virtualize.to_icechunk(icechunk_store)
  
  # Then read as regular Zarr
  ds = xr.open_zarr(icechunk_store)
  ```
  
  - **What it does:** Creates metadata-only representation with byte-range references to original files
  - **Benefit:** No data duplication, cloud-optimized access patterns
  - **Limitation:** Original files must remain accessible
  - **Best for:** Large archival collections that can't be fully converted

- **Icechunk Integration:**
  ```python
  from icechunk import IcechunkStore, StorageConfig
  
  # Create Icechunk store on S3
  store = IcechunkStore.open_or_create(
      storage=StorageConfig.s3_from_env("bucket", "prefix"),
  )
  
  # VirtualiZarr → Icechunk (zero-copy ingestion)
  vds.virtualize.to_icechunk(store)
  
  # Benefits:
  # - ACID transactions for concurrent writes
  # - Version history / time-travel
  # - Zero-copy from archival formats
  ```

- **Validation After Migration:**
  ```python
  # 1. Shape verification
  assert source.shape == target.shape
  
  # 2. Random sample comparison
  import numpy as np
  for _ in range(10):
      idx = tuple(np.random.randint(0, s) for s in source.shape)
      assert source[idx] == target[idx]
  
  # 3. Metadata comparison
  for key in source.attrs:
      assert target.attrs[key] == source.attrs[key]
  
  # 4. Coordinate verification (xarray)
  xr.testing.assert_identical(source_ds, target_ds)
  ```

- **Large-Scale Migration Strategies:**
  - Chunked migration: process dimension-by-dimension
  - Parallel migration: use Dask cluster for concurrent conversion
  - Incremental migration: convert subsets over time
  - Hybrid approach: VirtualiZarr for immediate access + gradual physical conversion

- **Handling Common Source Format Issues:**
  - HDF5 compression filters not supported in Zarr → decompresses and recompresses
  - NetCDF unlimited dimensions → map to resizable Zarr dimensions
  - Missing CF metadata → add during migration
  - Fill values / missing data → configure `fill_value` in Zarr

### 2. assets/migration-template.py

Complete migration script template:
- Supports source formats: HDF5 (.h5, .hdf5), NetCDF (.nc, .nc4)
- Configurable target: local path or cloud store (S3/GCS)
- Configurable target chunks (or auto-calculate)
- Configurable compression (default: Zstd)
- Validates data integrity after migration (10 random sample checks)
- Reports: source format, source size, target size, compression ratio, elapsed time
- Handles errors gracefully
- Clean, well-commented, production-ready

### 3. references/PATTERNS.md (6+ patterns)

1. **Single HDF5 → Local Zarr** — simple h5py + zarr.copy_all
2. **Multi-file NetCDF → Cloud Zarr** — xr.open_mfdataset → to_zarr on S3
3. **VirtualiZarr Zero-Copy** — create virtual dataset, persist to Icechunk
4. **Zarr v2 → v3 Format Migration** — copy with format upgrade
5. **Zarr → Zarr Cloud Migration** — local Zarr → S3/GCS with copy_store
6. **Incremental Migration** — converting large archives in batches

### 4. references/EXAMPLES.md (4+ examples)

1. **Converting a Climate HDF5 Archive to Cloud Zarr** — full workflow with rechunking
2. **Multi-file NetCDF → Single Zarr Store** — concatenation + cloud write
3. **VirtualiZarr + Icechunk Pipeline** — zero-copy ingestion of 100+ NetCDF files
4. **Validating a Large Migration** — comprehensive validation beyond random sampling

### 5. references/COMMON_ISSUES.md (6+ issues)

1. **HDF5 compression filter incompatibility** → must decompress and recompress
2. **Multi-file dimension conflicts** → ensure consistent dimensions across files
3. **Memory overflow during migration** → use chunked reads with Dask
4. **Missing coordinates after migration** → explicitly include coords in xarray
5. **VirtualiZarr original files moved/deleted** → references break, need re-virtualization
6. **Metadata loss during copy_store** → use copy_all for metadata preservation

## Acceptance Criteria

- [ ] SKILL.md is 300+ lines covering all migration pathways
- [ ] All zarr copy operations documented (copy, copy_all, copy_store)
- [ ] HDF5 and NetCDF migration paths covered with complete code
- [ ] VirtualiZarr documented with Icechunk integration
- [ ] Validation patterns included (shape, sample, metadata, xr.testing)
- [ ] migration-template.py is a working, runnable script
- [ ] Follows the skill pattern from existing plugins

## Dependencies

- Depends on #67 (plugin scaffold)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Create data-migration skill #74

Description

Research Reference

Files to Create

1. SKILL.md (300+ lines)

2. assets/migration-template.py

3. references/PATTERNS.md (6+ patterns)

4. references/EXAMPLES.md (4+ examples)

5. references/COMMON_ISSUES.md (6+ issues)

Acceptance Criteria

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Source Format	Method	Rechunking?
HDF5 → Zarr	`zarr.copy()` or h5py + zarr	Yes (via rechunker)
NetCDF → Zarr	`xr.open_dataset().to_zarr()`	Yes (via encoding)
Multi-file NetCDF → Zarr	`xr.open_mfdataset().to_zarr()`	Yes (via encoding)
HDF5/NetCDF → Virtual Zarr	VirtualiZarr	No (zero-copy)
Zarr → Zarr (different chunks)	rechunker	Yes
Zarr v2 → Zarr v3	`zarr.copy()` with `zarr_format=3`	Optional

feat: Create data-migration skill #74

Description

Description

Research Reference

Files to Create

1. SKILL.md (300+ lines)

2. assets/migration-template.py

3. references/PATTERNS.md (6+ patterns)

4. references/EXAMPLES.md (4+ examples)

5. references/COMMON_ISSUES.md (6+ issues)

Acceptance Criteria

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions