-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Labels
enhancementNew feature or requestNew feature or request
Milestone
Description
I think there is a limitation to the hdf5
parser where it only looks for "shallow" groups i.e., groups with only one level of dataset.
Some example data:
wget https://datasets.cellxgene.cziscience.com/4cc9a6b9-ae3a-4084-b3f0-f19578eb30ac.h5ad -O sc_dataset.h5ad
And then:
import virtualizarr
store = virtualizarr.parsers.HDFParser()("./sc_dataset.h5ad", object_store=obstore.store.LocalStore('./'))
zarr.open_group(store, mode="r", zarr_format=3) # evidently I need to specify v3 otherwise this crashes without finding an group?
outputs
<Group ManifestStore(group=
ManifestGroup(
arrays={},
groups={},
metadata=GroupMetadata(attributes={'encoding-type': 'anndata', 'encoding-version': '0.1.0'}, zarr_format=3, consolidated_metadata=None, node_type='group'),
)
, stores=<virtualizarr.manifests.store.ObjectStoreRegistry object at 0x15b4e2ab0>)>
which lacks the nested group structure:
import h5py
with h5py.File("./sc_dataset.h5ad") as f:
print(list(f.keys()))
['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp']
I think the culprit is here
VirtualiZarr/virtualizarr/parsers/hdf/hdf.py
Lines 94 to 126 in ce44d1f
def _construct_manifest_group( | |
filepath: str, | |
reader: ObstoreReader, | |
*, | |
group: str | None = None, | |
drop_variables: Iterable[str] | None = None, | |
) -> ManifestGroup: | |
""" | |
Construct a virtual Group from a HDF dataset. | |
""" | |
import h5py | |
with h5py.File(reader, mode="r") as f: | |
if not isinstance(g := f.get(group or "/"), h5py.Group): | |
raise ValueError(f"Group {group!r} is not an HDF Group") | |
# Several of our test fixtures which use xr.tutorial data have | |
# non coord dimensions serialized using big endian dtypes which are not | |
# yet supported in zarr-python v3. We'll drop these variables for the | |
# moment until big endian support is included upstream. | |
non_coordinate_dimension_vars = _find_non_coord_dimension_vars(group=g) | |
drop_variables = set(drop_variables or ()) | set(non_coordinate_dimension_vars) | |
group_name = str(g.name) # NOTE: this will always include leading "/" | |
arrays = { | |
key: _construct_manifest_array(filepath, dataset, group_name) | |
for key in g.keys() | |
if key not in drop_variables and isinstance(dataset := g[key], h5py.Dataset) | |
} | |
attributes = _extract_attrs(g) | |
return ManifestGroup(arrays=arrays, attributes=attributes) |
This function is called in HDFParser
but only handles level-0 datasets and not nested groups.
Maybe this is intentional, not sure! Thanks for the library, very cool!
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request