-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Hi all,
Dean here working out of PO.DAAC. We currently host several virtual reference files for some of our data sets made with kerchunk, and are excited to start making the reference files with virtualizarr instead. Love the easy syntax! I did want to bring something to your attention though and learn your thoughts.
Before the code, long story short is that I created two reference files for our OSTIA-UKMO-L4-GLOB-REP-v2.0 data set (global gridded data, ~15,000 netCDF files, ~250 GB on disk, ~11 TB uncompressed), one using kerchunk and the other with virtualizarr. Lazy loading the data with the kerchunk file takes a few seconds, while using the virtualizarr file takes 4-6 minutes. Some how, getting a subset of the data and plotting takes the same amount of time with either reference file. Still though, seems like it should be totally doable to lazy load the data in a shorter amount of time if kerchunk can do it?
As a minimal working example I cannot create the ref files for the full time series here, so I opted for 400 files (ends up taking about 5 minutes to make the files each with virtualizarr and kerchunk. No parallel computing as I'm trying to stick to a minimal working example). Also I use the earthaccess package to manage NASA credentials, find S3 endpoints, etc. So hopefully someone on the virtualizarr team has an earthaccess account?
Here is the minimal working environment
##----------------
## Minimal install (using Python 3.12)
##----------------
# kerchunk==0.2.7
# xarray==2025.1.2
# h5py==3.13.0
# ujson==5.10.0
# earthaccess==0.14.0
# matplotlib==3.10.0
# virtualizarr==1.2.0
import os
import ujson
import fsspec
import earthaccess
import xarray as xr
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
from virtualizarr import open_virtual_dataset
import matplotlib.pyplot as plt
Next, use earthaccess to access data set metadata on PO.DAAC
# Get Earthdata creds
earthaccess.login()
# Get AWS creds
fs_data = earthaccess.get_s3_filesystem(daac="PODAAC")
# Locate file information / metadata:
granule_info = earthaccess.search_data(
short_name="OSTIA-UKMO-L4-GLOB-REP-v2.0",
count=400
)
s3_endpoints = [g.data_links(access="direct")[0] for g in granule_info]
fobjs = earthaccess.open(granule_info) # Gets file-like objects for use with kerchunk
Create ref file with kerchunk
# Individual references:
single_refs_kerchunk = [
SingleHdf5ToZarr(fobj, ep, inline_threshold=0).translate()
for fobj, ep in zip(fobjs, s3_endpoints)
]
# Combined reference:
kwargs_mzz = {'remote_protocol':"s3", 'remote_options':fs_data.storage_options, 'concat_dims':["time"]}
combined_ref_kerchunk = MultiZarrToZarr(single_refs_kerchunk, **kwargs_mzz).translate()
# Save reference to JSON:
with open("ref_combined_kerchunk.json", 'wb') as outf:
outf.write(ujson.dumps(combined_ref_kerchunk).encode())
Create ref file with virtualizarr
# Create single ref files:
singe_refs_virtualizarr = [
open_virtual_dataset(ep, indexes={}, reader_options={"storage_options": fs_data.storage_options})
for ep in s3_endpoints
]
# Create combined ref file and save to json:
combined_ref_virtualizarr = xr.combine_nested(singe_refs_virtualizarr, concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')
combined_ref_virtualizarr.virtualize.to_kerchunk("ref_combined_virtualizarr.json", format='json')
##-----------------------------------------------------------------------------
Last, open the data using each reference file
def opendf_withref(ref, fs_data):
"""
Wrapper function to open data with xarray, using a ref file. "ref" is a reference file or object.
"fs_data" is a filesystem with credentials to access the data files.
"""
storage_opts = {"fo": ref, "remote_protocol": "s3", "remote_options": fs_data.storage_options}
m = fsspec.filesystem('reference', **storage_opts).get_mapper('')
return xr.open_dataset(
m, engine="zarr", chunks={},
backend_kwargs={"consolidated": False}
)
data_kerchunk = opendf_withref("ref_combined_kerchunk.json", fs_data)
print(data_kerchunk)
data_virtualizarr = opendf_withref("ref_combined_virtualizarr.json", fs_data)
print(data_virtualizarr)
I don't have the code that timed the lazy loading in this post, but you can see the results in my Jupyter notebook here (using %%time):
https://github.com/DeanHenze/cloud_earthdata/blob/main/virtualizarr/kerchunk_vs_virtualizarr.ipynb
It takes a few hundred milliseconds for the kerchunk file, vs ~5 seconds for the virtualizarr file.
I did my best to create a minimal working example, but now that I'm typing it all out I can see it's quite long. Still, I hope this can stimulate some discussion. Also, I'm trying to figure out if this is even a relevant issue currently, e.g. will saving things in icechunk format solve this?
Thanks,
- Dean