CMORPH in open data on AWS(VirtualiZarr → Icechunk → Pencil Chunking → Return Periods) #884
Replies: 2 comments 2 replies
-
|
This is awesome! Thanks for sharing. One question: How come you appear to be using both Kerchunk's Parquet format and the Icechunk format? At one point you create a materialized Icechunk store, but you choose not to use Icechunk for serializing the virtual references. Why not just use Icechunk for both? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for raising this important question which is missed in the note. I initially tried using the Icechunk virtual store directly, but ran into repeated OOM failures on the coordinator machine. The earlier attempts are documented in https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/dev-test/cmorph_s3_to_gcs_icechunk_parallel.py where I tried three different strategies to write virtual references directly to Icechunk:
All three hit the same wall: the coordinator accumulated ~85 KB of Kerchunk JSON per file in memory. At 236,688 files, that's ~20 GB — the process got OOM-killed (exit code 137) at around 28,000 files every time. The append_dim='time' approach also failed because each append reads back O(n) metadata where n = current time dimension size, blowing up memory at ~87,000 timesteps. The Parquet catalog approach solved this by using PyArrow's streaming ParquetWriter — each batch from Coiled workers gets flushed to disk immediately and freed from memory, keeping coordinator RAM constant at ~10 MB regardless of total file count. Moreover, the Parquet file gives a simpler mental model: it's a flat table where each row says "this file exists at this S3 URL, at this timestamp, and here are its byte-range references" one can filter by year, month, or day with predicate pushdown, skip the heavy kerchunk_refs column entirely when you just need file listings, and query it with standard pandas. It also decouples discovery from access — the same catalog can feed materialization for any region (East Africa or Africa), not just one predetermined spatial subset baked into a single virtual store. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Thanks to the VirtualiZarr library and claude code, could able to do processing of ~25 years of 30-min CMORPH (~0.2M NetCDF files) for East Africa using a scalable VirtualiZarr → Icechunk → Dask pipeline, virtualizarr reference stored as paraquet https://huggingface.co/datasets/E4DRR/virtualizarr-stores.
NOAA's CMORPH archive on S3 contains 236,688 NetCDF files spanning 1998-2024 at 30-minute, 8 km global resolution. The raw global archive is roughly ~400 GB on S3 (each file ~1.7 MB, global grid 1649 lat × 4948 lon). A fully materialized global Zarr/Icechunk store at the same resolution would be ~15 TB (473K timesteps × 8.16M grid cells × 4 bytes).
For East Africa analysis, only need 2.6% of the spatial domain (481 × 439 out of 1649 × 4948 grid cells) but the full 27-year temporal range. The traditional approach — download all files, convert, subset locally — would require downloading hundreds of GB and hours of single-machine processing.
A three-phase cloud-native pipeline that generates a 372 GB analysis-ready Icechunk store in a GCS bucket for East Africa from the raw S3 archive, plus a pencil-chunked Zarr store for time-series access, for approximately $6 in total compute cost.
Workers are Coiled-managed Google CLoud Platforms GCP VMs at ~$0.03/worker-hour. Fill uses
n2-standard-4, rechunk usesn2-highmem-4.Following steps followed
Generated row-wise reference entries using:
https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/cmorph_parquet_vds_catalog.py
This avoids rewriting 0.2M files and creates a lightweight, queryable metadata layer as parquet.
East Africa is subset and written into a versioned Icechunk store:
https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/cmorph_east_africa_icechunk.py
This transitions from virtual layout → physically optimized and realized data store.
Pencil chunking (Dask P2P)
The store is then rechunk into time-contiguous “pencil” layout using Dask P2P shuffle, enabling efficient per-grid time-series analysis. https://discourse.pangeo.io/t/rechunking-large-data-at-constant-memory-in-dask-experimental/3266
Return periods
Extreme value analysis statistics are computed via:
https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/cmorph_return_periods.py
Role of Coiled/Dask:

All steps are embarrassingly parallel (per grid cell), but rechunk + shuffle require distributed execution as explained in https://github.com/icpac-igad/ibf-thresholds-triggers/blob/xarray-method/thresholds/CMORPH/DASK_OPERATIONS_NOTES.md#2-p2p-rechunking-for-phase-3-pencil-chunks. Coiled-backed Dask makes the pipeline feasible at this scale.
Beta Was this translation helpful? Give feedback.
All reactions