Skip to content

open_virtual_dataset fails when there is a subgroup #336

@rabernat

Description

@rabernat

(This issue is inspired by the NAS GES DISC GPM_3IMERGHH_07 dataset, which has this same structure. cc @abarciauskas-bgse)

Consider a NetCDF dataset that has a valid NetCDF group at one level of the hierarchy and then a sub-group beneath that. We can make one like this:

import xarray as xr
ds = xr.DataArray([1, 2, 3], name="foo").to_dataset()
ds.to_netcdf("test.nc")
# store the same data in a sub group
ds.to_netcdf("test.nc", group="subgroup", mode="a")

Xarray can open either group fine.

xr.open_dataset("test.nc")
xr.open_dataset("test.nc", group="subgroup")

For the root group, it just ignores the sub group.

However, VirtualiZarr doesn't like it

from virtualizarr import open_virtual_dataset
open_virtual_dataset("test.nc", group='')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 dsv = open_virtual_dataset("test.nc", group='')

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/backend.py:200](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/backend.py#line=199), in open_virtual_dataset(filepath, filetype, group, drop_variables, loadable_variables, decode_times, cftime_variables, indexes, virtual_array_class, virtual_backend_kwargs, reader_options, backend)
    197 if backend_cls is None:
    198     raise NotImplementedError(f"Unsupported file type: {filetype.name}")
--> 200 vds = backend_cls.open_virtual_dataset(
    201     filepath,
    202     group=group,
    203     drop_variables=drop_variables,
    204     loadable_variables=loadable_variables,
    205     decode_times=decode_times,
    206     indexes=indexes,
    207     virtual_backend_kwargs=virtual_backend_kwargs,
    208     reader_options=reader_options,
    209 )
    211 return vds

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf5.py:48](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf5.py#line=47), in HDF5VirtualBackend.open_virtual_dataset(filepath, group, drop_variables, loadable_variables, decode_times, indexes, virtual_backend_kwargs, reader_options)
     42 refs = SingleHdf5ToZarr(
     43     filepath, inline_threshold=0, **reader_options
     44 ).translate()
     46 refs = extract_group(refs, group)
---> 48 virtual_vars, attrs, coord_names = virtual_vars_and_metadata_from_kerchunk_refs(
     49     refs,
     50     loadable_variables,
     51     drop_variables,
     52     fs_root=Path.cwd().as_uri(),
     53 )
     55 loadable_vars, indexes = open_loadable_vars_and_indexes(
     56     filepath,
     57     loadable_variables=loadable_variables,
   (...)
     62     decode_times=decode_times,
     63 )
     65 return construct_virtual_dataset(
     66     virtual_vars=virtual_vars,
     67     loadable_vars=loadable_vars,
   (...)
     70     attrs=attrs,
     71 )

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:34](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=33), in virtual_vars_and_metadata_from_kerchunk_refs(vds_refs, loadable_variables, drop_variables, virtual_array_class, fs_root)
     17 def virtual_vars_and_metadata_from_kerchunk_refs(
     18     vds_refs: KerchunkStoreRefs,
     19     loadable_variables,
   (...)
     22     fs_root: str | None = None,
     23 ) -> tuple[Mapping[str, Variable], dict[str, Any], list[str]]:
     24     """
     25     Parses all useful information from a set kerchunk references (for a single group).
     26 
   (...)
     31         Required if any paths are relative in order to turn them into absolute paths (which virtualizarr requires).
     32     """
---> 34     virtual_vars = virtual_vars_from_kerchunk_refs(
     35         vds_refs,
     36         drop_variables=drop_variables + loadable_variables,
     37         virtual_array_class=virtual_array_class,
     38         fs_root=fs_root,
     39     )
     40     ds_attrs = fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})
     41     coord_names = ds_attrs.pop("coordinates", [])

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:110](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=109), in virtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class, fs_root)
    104     drop_variables = []
    105 var_names_to_keep = [
    106     var_name for var_name in var_names if var_name not in drop_variables
    107 ]
    109 vars = {
--> 110     var_name: variable_from_kerchunk_refs(
    111         refs, var_name, virtual_array_class, fs_root=fs_root
    112     )
    113     for var_name in var_names_to_keep
    114 }
    115 return vars

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:164](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=163), in variable_from_kerchunk_refs(refs, var_name, virtual_array_class, fs_root)
    161 """Create a single xarray Variable by reading specific keys of a kerchunk references dict."""
    163 arr_refs = extract_array_refs(refs, var_name)
--> 164 chunk_dict, zarray, zattrs = parse_array_refs(arr_refs)
    165 # we want to remove the _ARRAY_DIMENSIONS from the final variables' .attrs
    166 dims = zattrs.pop("_ARRAY_DIMENSIONS")

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:258](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=257), in parse_array_refs(arr_refs)
    255 def parse_array_refs(
    256     arr_refs: KerchunkArrRefs,
    257 ) -> tuple[dict, ZArray, ZAttrs]:
--> 258     zarray = ZArray.from_kerchunk_refs(arr_refs.pop(".zarray"))
    259     zattrs = arr_refs.pop(".zattrs", {})
    260     chunk_dict = arr_refs

KeyError: '.zarray'

It looks like VirtualiZarr is assuming that all child nodes in the hierarchy are arrays, not groups.


I'm also curious why we are not going through the new Non-kerchunk backend for HDF5/netcdf4 files, rather than the kerchunk backend. How do you turn that on? (I'm on 1.2.0.)

Related to #84

Metadata

Metadata

Assignees

No one assigned

    Labels

    KerchunkRelating to the kerchunk library / specification itselfbugSomething isn't workingreferences generationReading byte ranges from archival files

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions