Dask struggles with Parquet to Zarr conversion when pointing to compressed NetCDF data #563

NikosAlexandris · 2025-07-01T12:06:14Z

NikosAlexandris
Jul 1, 2025

I have the following workflow :

Rechunk raw mixed chunk sizes NetCDF files [N1] to uniform chunks NetCDF files [N2]
- using Xarray, just works with either the h5netcdf or the netcdf4 driver
- Raw NetCDF files [N1] are compressed, level 4
- Rechunked NetCDF files [N2] may be :
  a. compressed, level 1 for example
  b. uncompressed
Generate single Parquet reference stores for each NetCDF file [Px]
- using Kerchunk
  a. using compressed input N2 - a
  b. using uncompressed input N2 - b
Combine many Parquet stores to a single Parquet reference store [P]
- using Kerchunk
- Pa comes from N2 - a
- Pb comes from N2 - b
Convert Parquet store to a Zarr store [Z]
- Xarray and Zarr v3 !
- Using Dask
- With manual or automatic memory limits

Workflow variations

N1 ➡ N2 b ➡ Px ➡ P b ➡ Z b : works nicely.
N1 ➡ N2 a ➡ Px ➡ P a ➡ Z a : while it does work, it just requires way more resources, hence takes way longer.

Question

Why does the process (dask?) struggle so much with compressed input data ? Or is this not the right question ?

martindurant · 2025-07-01T13:34:01Z

martindurant
Jul 1, 2025
Maintainer

To be sure: it's the compression of the data we are talking about? The dask dashboard, if you have it, might have some information about what's taking up resources. At a guess, it might have something to do with when references get dropped and released, which will not be the same for compressed or not.

Question: if the end state is zarr, does kerchunk make any improvement compared to xarray alone?

11 replies

NikosAlexandris Jul 4, 2025
Author

is lazy_output.flush()

Yes, I think so.

Why is then no such call in auto_dask() ? :

kerchunk/kerchunk/combine.py

Lines 868 to 872 in eff8648

    
           final_task = dask.delayed( 
        
               lambda x: MultiZarrToZarr( 
        
                   x, remote_options=remote_options, remote_protocol=remote_protocol, **kwargs 
        
               ).translate(filename, output_options) 
        
           )

.

The only flush()

kerchunk/kerchunk/combine.py

Line 657 in eff8648

self.out.flush()

is inside the translate() function itself https://github.com/fsspec/kerchunk/blob/eff86485d8f46da8a6d83578359b5601841330c9/kerchunk/combine.py#L638C9-L638C18.

I guess I don't understand it well enough yet.

martindurant Jul 4, 2025
Maintainer

I might have been wrong. For sure flush() needs to be called, but if that happens inside MZZ, that's fine.

I don't think that the combination of dask and parquet references is well-trodden, maybe @rsignell-usgs has some example workflows.

NikosAlexandris Jul 4, 2025
Author

Thanks for the update.

ps- We are one step away to conclude this mini workflow... Sometimes "my" use of auto-dask works, and others it does not. When it does, it's fast.-

NikosAlexandris Jul 7, 2025
Author

I will have to re-try, but my past experience with openmf_dataset() counts many, many failures and hours spent.

Indeed, open_mfdataset() does its work.- Time has passed and things changed. Using it directly now instead to walk-around.

martindurant Jul 8, 2025
Maintainer

OK, I'll mark this as closed, but try to see if I can get time specifically to run parquet-dask workflows and see what does and doesn't work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dask struggles with Parquet to Zarr conversion when pointing to compressed NetCDF data #563

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 11 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Dask struggles with Parquet to Zarr conversion when pointing to compressed NetCDF data #563

Uh oh!

Uh oh!

NikosAlexandris Jul 1, 2025

Replies: 1 comment · 11 replies

Uh oh!

martindurant Jul 1, 2025 Maintainer

Uh oh!

NikosAlexandris Jul 4, 2025 Author

Uh oh!

martindurant Jul 4, 2025 Maintainer

Uh oh!

Uh oh!

NikosAlexandris Jul 4, 2025 Author

Uh oh!

NikosAlexandris Jul 7, 2025 Author

Uh oh!

martindurant Jul 8, 2025 Maintainer

NikosAlexandris
Jul 1, 2025

Replies: 1 comment 11 replies

martindurant
Jul 1, 2025
Maintainer

NikosAlexandris Jul 4, 2025
Author

martindurant Jul 4, 2025
Maintainer

NikosAlexandris Jul 4, 2025
Author

NikosAlexandris Jul 7, 2025
Author

martindurant Jul 8, 2025
Maintainer