Skip to content

Lazy loading speed of NASA Earthdata with kerchunk vs virtualizarr reference files #511

@DeanHenze

Description

@DeanHenze

Hi all,

Dean here working out of PO.DAAC. We currently host several virtual reference files for some of our data sets made with kerchunk, and are excited to start making the reference files with virtualizarr instead. Love the easy syntax! I did want to bring something to your attention though and learn your thoughts.

Before the code, long story short is that I created two reference files for our OSTIA-UKMO-L4-GLOB-REP-v2.0 data set (global gridded data, ~15,000 netCDF files, ~250 GB on disk, ~11 TB uncompressed), one using kerchunk and the other with virtualizarr. Lazy loading the data with the kerchunk file takes a few seconds, while using the virtualizarr file takes 4-6 minutes. Some how, getting a subset of the data and plotting takes the same amount of time with either reference file. Still though, seems like it should be totally doable to lazy load the data in a shorter amount of time if kerchunk can do it?

As a minimal working example I cannot create the ref files for the full time series here, so I opted for 400 files (ends up taking about 5 minutes to make the files each with virtualizarr and kerchunk. No parallel computing as I'm trying to stick to a minimal working example). Also I use the earthaccess package to manage NASA credentials, find S3 endpoints, etc. So hopefully someone on the virtualizarr team has an earthaccess account?

Here is the minimal working environment

##----------------
## Minimal install (using Python 3.12)
##----------------
# kerchunk==0.2.7
# xarray==2025.1.2
# h5py==3.13.0
# ujson==5.10.0
# earthaccess==0.14.0
# matplotlib==3.10.0
# virtualizarr==1.2.0

import os
import ujson

import fsspec
import earthaccess

import xarray as xr
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
from virtualizarr import open_virtual_dataset

import matplotlib.pyplot as plt

Next, use earthaccess to access data set metadata on PO.DAAC

# Get Earthdata creds
earthaccess.login()

# Get AWS creds
fs_data = earthaccess.get_s3_filesystem(daac="PODAAC")

# Locate file information / metadata:
granule_info = earthaccess.search_data(
    short_name="OSTIA-UKMO-L4-GLOB-REP-v2.0",
    count=400
    )

s3_endpoints = [g.data_links(access="direct")[0] for g in granule_info]

fobjs = earthaccess.open(granule_info) # Gets file-like objects for use with kerchunk

Create ref file with kerchunk

# Individual references:
single_refs_kerchunk = [
    SingleHdf5ToZarr(fobj, ep, inline_threshold=0).translate() 
    for fobj, ep in zip(fobjs, s3_endpoints)
    ]

# Combined reference:
kwargs_mzz = {'remote_protocol':"s3", 'remote_options':fs_data.storage_options, 'concat_dims':["time"]}
combined_ref_kerchunk = MultiZarrToZarr(single_refs_kerchunk, **kwargs_mzz).translate()

# Save reference to JSON:
with open("ref_combined_kerchunk.json", 'wb') as outf:
    outf.write(ujson.dumps(combined_ref_kerchunk).encode())

Create ref file with virtualizarr

# Create single ref files:
singe_refs_virtualizarr = [
    open_virtual_dataset(ep, indexes={}, reader_options={"storage_options": fs_data.storage_options})
    for ep in s3_endpoints
    ]

# Create combined ref file and save to json:
combined_ref_virtualizarr = xr.combine_nested(singe_refs_virtualizarr, concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')
combined_ref_virtualizarr.virtualize.to_kerchunk("ref_combined_virtualizarr.json", format='json')
##-----------------------------------------------------------------------------

Last, open the data using each reference file

def opendf_withref(ref, fs_data):
    """
    Wrapper function to open data with xarray, using a ref file. "ref" is a reference file or object. 
    "fs_data" is a filesystem with credentials to access the data files. 
    """
    storage_opts = {"fo": ref, "remote_protocol": "s3", "remote_options": fs_data.storage_options}
    m = fsspec.filesystem('reference', **storage_opts).get_mapper('')
    return xr.open_dataset(
        m, engine="zarr", chunks={},
        backend_kwargs={"consolidated": False}
        )

data_kerchunk = opendf_withref("ref_combined_kerchunk.json", fs_data)
print(data_kerchunk)

data_virtualizarr = opendf_withref("ref_combined_virtualizarr.json", fs_data)
print(data_virtualizarr)

I don't have the code that timed the lazy loading in this post, but you can see the results in my Jupyter notebook here (using %%time):
https://github.com/DeanHenze/cloud_earthdata/blob/main/virtualizarr/kerchunk_vs_virtualizarr.ipynb

It takes a few hundred milliseconds for the kerchunk file, vs ~5 seconds for the virtualizarr file.

I did my best to create a minimal working example, but now that I'm typing it all out I can see it's quite long. Still, I hope this can stimulate some discussion. Also, I'm trying to figure out if this is even a relevant issue currently, e.g. will saving things in icechunk format solve this?

Thanks,

  • Dean

Metadata

Metadata

Assignees

No one assigned

    Labels

    KerchunkRelating to the kerchunk library / specification itselfbugSomething isn't workingperformanceusage exampleReal world use case examples

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions