-
Notifications
You must be signed in to change notification settings - Fork 57
Description
I'm wondering how we can build virtualization pipelines as a map rather than a map-reduce process. The map paradigm would follow the structure below from Earthmover's excellent blog post on serverless datacube pipelines, where the virtual dataset from each worker would get written directly to an Icechunk Virtual Store rather than being transferred back to the coordination node for a concatenation step. Similar to their post, the parallelization between workers during virtualization could leverage a serverless approach like lithops or coiled. Dask concurrency could speed up virtualization within each worker. I think the main feature needed for this to work is an analogue to xr.Dataset.to_zarr(region="..."), complementary to xr.Dataset.to_zarr(append_dim="...")(documented in https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) (xref #21, #272).
I tried to check that this feature request doesn't already exist, but apologies if I missed something.
