Skip to content

hdf parser limited to level-0 h5py.Dataset #664

@ilan-gold

Description

@ilan-gold

I think there is a limitation to the hdf5 parser where it only looks for "shallow" groups i.e., groups with only one level of dataset.

Some example data:

wget https://datasets.cellxgene.cziscience.com/4cc9a6b9-ae3a-4084-b3f0-f19578eb30ac.h5ad -O sc_dataset.h5ad

And then:

import virtualizarr

store = virtualizarr.parsers.HDFParser()("./sc_dataset.h5ad", object_store=obstore.store.LocalStore('./'))
zarr.open_group(store, mode="r", zarr_format=3) # evidently I need to specify v3 otherwise this crashes without finding an group?

outputs

<Group ManifestStore(group=
ManifestGroup(
    arrays={},
    groups={},
    metadata=GroupMetadata(attributes={'encoding-type': 'anndata', 'encoding-version': '0.1.0'}, zarr_format=3, consolidated_metadata=None, node_type='group'),
)
, stores=<virtualizarr.manifests.store.ObjectStoreRegistry object at 0x15b4e2ab0>)>

which lacks the nested group structure:

import h5py

with h5py.File("./sc_dataset.h5ad") as f:
    print(list(f.keys()))
['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp']

I think the culprit is here

def _construct_manifest_group(
filepath: str,
reader: ObstoreReader,
*,
group: str | None = None,
drop_variables: Iterable[str] | None = None,
) -> ManifestGroup:
"""
Construct a virtual Group from a HDF dataset.
"""
import h5py
with h5py.File(reader, mode="r") as f:
if not isinstance(g := f.get(group or "/"), h5py.Group):
raise ValueError(f"Group {group!r} is not an HDF Group")
# Several of our test fixtures which use xr.tutorial data have
# non coord dimensions serialized using big endian dtypes which are not
# yet supported in zarr-python v3. We'll drop these variables for the
# moment until big endian support is included upstream.
non_coordinate_dimension_vars = _find_non_coord_dimension_vars(group=g)
drop_variables = set(drop_variables or ()) | set(non_coordinate_dimension_vars)
group_name = str(g.name) # NOTE: this will always include leading "/"
arrays = {
key: _construct_manifest_array(filepath, dataset, group_name)
for key in g.keys()
if key not in drop_variables and isinstance(dataset := g[key], h5py.Dataset)
}
attributes = _extract_attrs(g)
return ManifestGroup(arrays=arrays, attributes=attributes)

This function is called in HDFParser but only handles level-0 datasets and not nested groups.

Maybe this is intentional, not sure! Thanks for the library, very cool!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions