Dask struggles with Parquet to Zarr conversion when pointing to compressed NetCDF data #563
Replies: 1 comment 11 replies
-
To be sure: it's the compression of the data we are talking about? The dask dashboard, if you have it, might have some information about what's taking up resources. At a guess, it might have something to do with when references get dropped and released, which will not be the same for compressed or not. Question: if the end state is zarr, does kerchunk make any improvement compared to xarray alone? |
Beta Was this translation helpful? Give feedback.
11 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have the following workflow :
Rechunk raw mixed chunk sizes NetCDF files [N1] to uniform chunks NetCDF files [N2]
h5netcdf
or thenetcdf4
drivera. compressed, level 1 for example
b. uncompressed
Generate single Parquet reference stores for each NetCDF file [Px]
a. using compressed input N2 - a
b. using uncompressed input N2 - b
Combine many Parquet stores to a single Parquet reference store [P]
Convert Parquet store to a Zarr store [Z]
Workflow variations
N1 ➡ N2 b ➡ Px ➡ P b ➡ Z b : works nicely.
N1 ➡ N2 a ➡ Px ➡ P a ➡ Z a : while it does work, it just requires way more resources, hence takes way longer.
Question
Why does the process (dask?) struggle so much with compressed input data ? Or is this not the right question ?
Beta Was this translation helpful? Give feedback.
All reactions