CMORPH in open data on AWS(VirtualiZarr → Icechunk → Pencil Chunking → Return Periods) #884

nishadhka · 2026-02-24T07:35:23Z

nishadhka
Feb 24, 2026

Thanks to the VirtualiZarr library and claude code, could able to do processing of ~25 years of 30-min CMORPH (~0.2M NetCDF files) for East Africa using a scalable VirtualiZarr → Icechunk → Dask pipeline, virtualizarr reference stored as paraquet https://huggingface.co/datasets/E4DRR/virtualizarr-stores.

NOAA's CMORPH archive on S3 contains 236,688 NetCDF files spanning 1998-2024 at 30-minute, 8 km global resolution. The raw global archive is roughly ~400 GB on S3 (each file ~1.7 MB, global grid 1649 lat × 4948 lon). A fully materialized global Zarr/Icechunk store at the same resolution would be ~15 TB (473K timesteps × 8.16M grid cells × 4 bytes).

For East Africa analysis, only need 2.6% of the spatial domain (481 × 439 out of 1649 × 4948 grid cells) but the full 27-year temporal range. The traditional approach — download all files, convert, subset locally — would require downloading hundreds of GB and hours of single-machine processing.

A three-phase cloud-native pipeline that generates a 372 GB analysis-ready Icechunk store in a GCS bucket for East Africa from the raw S3 archive, plus a pencil-chunked Zarr store for time-series access, for approximately $6 in total compute cost.

Phase	What it does	Workers	Duration	Cost
1. Parquet catalog	VirtualiZarr on 236K files → Kerchunk refs in Parquet	20	~1.5 hrs	~$2.50
2. Fill (materialize)	Read S3 → spatial subset → write to Icechunk	5-20	~2.5 hrs	~$3.00
3. Rechunk (P2P)	Pencil chunks (473376, 5, 5) for time-series access	20	8 min	~$0.30
Total	236K files → 372 GB EA store + pencil rechunk		~4.5 hrs	~$5.80

Workers are Coiled-managed Google CLoud Platforms GCP VMs at ~$0.03/worker-hour. Fill uses n2-standard-4, rechunk uses n2-highmem-4.

Following steps followed

Virtual references (Parquet catalog)
Generated row-wise reference entries using:
https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/cmorph_parquet_vds_catalog.py

This avoids rewriting 0.2M files and creates a lightweight, queryable metadata layer as parquet.

Subset + Icechunk materialization
East Africa is subset and written into a versioned Icechunk store:
https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/cmorph_east_africa_icechunk.py

This transitions from virtual layout → physically optimized and realized data store.

Pencil chunking (Dask P2P)
The store is then rechunk into time-contiguous “pencil” layout using Dask P2P shuffle, enabling efficient per-grid time-series analysis. https://discourse.pangeo.io/t/rechunking-large-data-at-constant-memory-in-dask-experimental/3266
Return periods
Extreme value analysis statistics are computed via:
https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/cmorph_return_periods.py

Role of Coiled/Dask:
All steps are embarrassingly parallel (per grid cell), but rechunk + shuffle require distributed execution as explained in https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/DASK_OPERATIONS_NOTES.md#2-p2p-rechunking-for-phase-3-pencil-chunks. Coiled-backed Dask makes the pipeline feasible at this scale.

TomNicholas · 2026-02-24T16:27:29Z

TomNicholas
Feb 24, 2026
Collaborator

This is awesome! Thanks for sharing.

One question: How come you appear to be using both Kerchunk's Parquet format and the Icechunk format? At one point you create a materialized Icechunk store, but you choose not to use Icechunk for serializing the virtual references. Why not just use Icechunk for both?

0 replies

nishadhka · 2026-02-25T06:44:21Z

nishadhka
Feb 25, 2026
Author

Thanks for raising this important question which is missed in the note. I initially tried using the Icechunk virtual store directly, but ran into repeated OOM failures on the coordinator machine. The earlier attempts are documented in https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/dev-test/cmorph_s3_to_gcs_icechunk_parallel.py
and docs https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/dev-test/CMORPH_PARQUET_CATALOG_APPROACH.md, https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/dev-test/CMORPH_TWO_PHASE_RATIONALE.md and https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/dev-test/ICECHUNK_PARALLEL_ERRORS_ANALYSIS.md

where I tried three different strategies to write virtual references directly to Icechunk:

Batch mode — each Coiled worker virtualizes files, concatenates with xr.concat, and writes to Icechunk on a separate
branch via to_icechunk() for later merging
Collect mode — workers return Kerchunk refs to the coordinator, which reconstructs VDS objects, concatenates, and writes
a single to_icechunk() commit
Direct mode — each worker opens the Icechunk repo independently and commits per-file

All three hit the same wall: the coordinator accumulated ~85 KB of Kerchunk JSON per file in memory. At 236,688 files, that's ~20 GB — the process got OOM-killed (exit code 137) at around 28,000 files every time. The append_dim='time' approach also failed because each append reads back O(n) metadata where n = current time dimension size, blowing up memory at ~87,000 timesteps.

The Parquet catalog approach solved this by using PyArrow's streaming ParquetWriter — each batch from Coiled workers gets flushed to disk immediately and freed from memory, keeping coordinator RAM constant at ~10 MB regardless of total file count. Moreover, the Parquet file gives a simpler mental model: it's a flat table where each row says "this file exists at this S3 URL, at this timestamp, and here are its byte-range references" one can filter by year, month, or day with predicate pushdown, skip the heavy kerchunk_refs column entirely when you just need file listings, and query it with standard pandas. It also decouples discovery from access — the same catalog can feed materialization for any region (East Africa or Africa), not just one predetermined spatial subset baked into a single virtual store.
Please let me know is this attempts or any other way could have tried.

2 replies

TomNicholas Feb 25, 2026
Collaborator

Very very interesting, thank you for sharing @nishadhka !

keeping coordinator RAM constant at ~10 MB regardless of total file count

This is also the behaviour that Icechunk is supposed to have, but there might be some subtle reason why that is apparently not happening.

Have you tried manifest splitting? You would want to split along the time dimension.

TomNicholas Feb 25, 2026
Collaborator

Also you should be batching the files, and making partial, frequent commits, each of which appends new virtual refs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMORPH in open data on AWS(VirtualiZarr → Icechunk → Pencil Chunking → Return Periods) #884

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CMORPH in open data on AWS(VirtualiZarr → Icechunk → Pencil Chunking → Return Periods) #884

Uh oh!

nishadhka Feb 24, 2026

Replies: 2 comments · 2 replies

Uh oh!

TomNicholas Feb 24, 2026 Collaborator

Uh oh!

Uh oh!

nishadhka Feb 25, 2026 Author

Uh oh!

TomNicholas Feb 25, 2026 Collaborator

Uh oh!

TomNicholas Feb 25, 2026 Collaborator

nishadhka
Feb 24, 2026

Replies: 2 comments 2 replies

TomNicholas
Feb 24, 2026
Collaborator

nishadhka
Feb 25, 2026
Author

TomNicholas Feb 25, 2026
Collaborator

TomNicholas Feb 25, 2026
Collaborator