-
Notifications
You must be signed in to change notification settings - Fork 49
Closed
Labels
KerchunkRelating to the kerchunk library / specification itselfRelating to the kerchunk library / specification itselfbugSomething isn't workingSomething isn't workingreferences generationReading byte ranges from archival filesReading byte ranges from archival files
Description
(This issue is inspired by the NAS GES DISC GPM_3IMERGHH_07 dataset, which has this same structure. cc @abarciauskas-bgse)
Consider a NetCDF dataset that has a valid NetCDF group at one level of the hierarchy and then a sub-group beneath that. We can make one like this:
import xarray as xr
ds = xr.DataArray([1, 2, 3], name="foo").to_dataset()
ds.to_netcdf("test.nc")
# store the same data in a sub group
ds.to_netcdf("test.nc", group="subgroup", mode="a")
Xarray can open either group fine.
xr.open_dataset("test.nc")
xr.open_dataset("test.nc", group="subgroup")
For the root group, it just ignores the sub group.
However, VirtualiZarr doesn't like it
from virtualizarr import open_virtual_dataset
open_virtual_dataset("test.nc", group='')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[7], line 1
----> 1 dsv = open_virtual_dataset("test.nc", group='')
File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/backend.py:200](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/backend.py#line=199), in open_virtual_dataset(filepath, filetype, group, drop_variables, loadable_variables, decode_times, cftime_variables, indexes, virtual_array_class, virtual_backend_kwargs, reader_options, backend)
197 if backend_cls is None:
198 raise NotImplementedError(f"Unsupported file type: {filetype.name}")
--> 200 vds = backend_cls.open_virtual_dataset(
201 filepath,
202 group=group,
203 drop_variables=drop_variables,
204 loadable_variables=loadable_variables,
205 decode_times=decode_times,
206 indexes=indexes,
207 virtual_backend_kwargs=virtual_backend_kwargs,
208 reader_options=reader_options,
209 )
211 return vds
File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf5.py:48](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf5.py#line=47), in HDF5VirtualBackend.open_virtual_dataset(filepath, group, drop_variables, loadable_variables, decode_times, indexes, virtual_backend_kwargs, reader_options)
42 refs = SingleHdf5ToZarr(
43 filepath, inline_threshold=0, **reader_options
44 ).translate()
46 refs = extract_group(refs, group)
---> 48 virtual_vars, attrs, coord_names = virtual_vars_and_metadata_from_kerchunk_refs(
49 refs,
50 loadable_variables,
51 drop_variables,
52 fs_root=Path.cwd().as_uri(),
53 )
55 loadable_vars, indexes = open_loadable_vars_and_indexes(
56 filepath,
57 loadable_variables=loadable_variables,
(...)
62 decode_times=decode_times,
63 )
65 return construct_virtual_dataset(
66 virtual_vars=virtual_vars,
67 loadable_vars=loadable_vars,
(...)
70 attrs=attrs,
71 )
File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:34](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=33), in virtual_vars_and_metadata_from_kerchunk_refs(vds_refs, loadable_variables, drop_variables, virtual_array_class, fs_root)
17 def virtual_vars_and_metadata_from_kerchunk_refs(
18 vds_refs: KerchunkStoreRefs,
19 loadable_variables,
(...)
22 fs_root: str | None = None,
23 ) -> tuple[Mapping[str, Variable], dict[str, Any], list[str]]:
24 """
25 Parses all useful information from a set kerchunk references (for a single group).
26
(...)
31 Required if any paths are relative in order to turn them into absolute paths (which virtualizarr requires).
32 """
---> 34 virtual_vars = virtual_vars_from_kerchunk_refs(
35 vds_refs,
36 drop_variables=drop_variables + loadable_variables,
37 virtual_array_class=virtual_array_class,
38 fs_root=fs_root,
39 )
40 ds_attrs = fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})
41 coord_names = ds_attrs.pop("coordinates", [])
File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:110](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=109), in virtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class, fs_root)
104 drop_variables = []
105 var_names_to_keep = [
106 var_name for var_name in var_names if var_name not in drop_variables
107 ]
109 vars = {
--> 110 var_name: variable_from_kerchunk_refs(
111 refs, var_name, virtual_array_class, fs_root=fs_root
112 )
113 for var_name in var_names_to_keep
114 }
115 return vars
File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:164](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=163), in variable_from_kerchunk_refs(refs, var_name, virtual_array_class, fs_root)
161 """Create a single xarray Variable by reading specific keys of a kerchunk references dict."""
163 arr_refs = extract_array_refs(refs, var_name)
--> 164 chunk_dict, zarray, zattrs = parse_array_refs(arr_refs)
165 # we want to remove the _ARRAY_DIMENSIONS from the final variables' .attrs
166 dims = zattrs.pop("_ARRAY_DIMENSIONS")
File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:258](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=257), in parse_array_refs(arr_refs)
255 def parse_array_refs(
256 arr_refs: KerchunkArrRefs,
257 ) -> tuple[dict, ZArray, ZAttrs]:
--> 258 zarray = ZArray.from_kerchunk_refs(arr_refs.pop(".zarray"))
259 zattrs = arr_refs.pop(".zattrs", {})
260 chunk_dict = arr_refs
KeyError: '.zarray'
It looks like VirtualiZarr is assuming that all child nodes in the hierarchy are arrays, not groups.
I'm also curious why we are not going through the new Non-kerchunk backend for HDF5/netcdf4 files, rather than the kerchunk backend. How do you turn that on? (I'm on 1.2.0.)
Related to #84
Metadata
Metadata
Assignees
Labels
KerchunkRelating to the kerchunk library / specification itselfRelating to the kerchunk library / specification itselfbugSomething isn't workingSomething isn't workingreferences generationReading byte ranges from archival filesReading byte ranges from archival files