-
Notifications
You must be signed in to change notification settings - Fork 54
Description
I pointed out this possible bug in another more complex issue yesterday, but wanted to isolate this in a smaller reproducer.
The basic issue is that using all defaults the parallel='dask' option does not actually parallelize the read at all! But when I manually initiate a distributed client before it works as expected.
This example here uses public CMIP6 netcdf data so it should be reproducible anywhere. The machine I used had 16 cores so theoretically opening the 4 files used here should not take longer than a single file if parallelization works perfectly.
from obstore.store import S3Store
from virtualizarr.parsers import HDFParser
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
bucket = "s3://esgf-world/"
store = S3Store.from_url(url=bucket, skip_signature=True, region="us-east-2")
files = [
"uo_Omon_CESM2_historical_r10i1p1f1_gn_185001-189912.nc",
"uo_Omon_CESM2_historical_r10i1p1f1_gn_190001-194912.nc",
"uo_Omon_CESM2_historical_r10i1p1f1_gn_195001-199912.nc",
"uo_Omon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc"
]
urls = [f"{bucket}CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Omon/uo/gn/v20190313/{file}" for file in files]When I time a single file read
#baseline timing, single url
vds_single = open_virtual_dataset(urls[0], object_store=store, parser=HDFParser())It takes
Wall time: 24.5 s
But for all 4 files
#Using the default dask option for open_virtual_mfdataset
vds_dask_default_distributed = open_virtual_mfdataset(
urls,
object_store=store,
parser=HDFParser(),
parallel='dask',
concat_dim="time",
combine="nested",
data_vars="minimal",
coords="minimal",
compat="override",
join="exact",
)
vds_dask_default_distributedI get
Wall time: 1min 16s
This tells me that the files are opened in serial
Interestingly if I modify the first code cell to set up a distributed client (and with that LocalCluster):
from obstore.store import S3Store
from virtualizarr.parsers import HDFParser
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from dask.distributed import Client
#✨ the only line changed✨
client = Client()
bucket = "s3://esgf-world/"
store = S3Store.from_url(url=bucket, skip_signature=True, region="us-east-2")
files = [
"uo_Omon_CESM2_historical_r10i1p1f1_gn_185001-189912.nc",
"uo_Omon_CESM2_historical_r10i1p1f1_gn_190001-194912.nc",
"uo_Omon_CESM2_historical_r10i1p1f1_gn_195001-199912.nc",
"uo_Omon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc"
]
urls = [f"{bucket}CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Omon/uo/gn/v20190313/{file}" for file in files]I get 22s and 26s respectively, which I interpret as a nicely parallel opening + some minor overhead for e.g. concat.
My hunch is that maybe somewhere in the DaskDelayedExecutor something is picking up defaults from the environment and by accident this happens to be a serial executor? Just an idea.
This is not my main priority to solve right now (since lithops works great out of the box), but I wanted to materialize this here, and can probably look into it further in the next days if I find some time.
Versions
virtualizarr==1.3.3.dev73+g7143f40 (from git+https://github.com/zarr-developers/VirtualiZarr.git@7143f40cbc5dc32cdd8958bec130933f912cf092)
obstore==0.7.0
dask==2025.5.1
distributed==2025.5.1