Skip to content

Commit b47e71c

Browse files
authored
HDF5 support (#7690)
* initial hdf5 support * handle zero dims * add tests * refactor type inference * refactor vlen, drop ragged, add complex/compound * update tests * explicit h5py dependency * allow mismatched lengths if ignored * Update setup.py * Sequence -> List * Sequence -> List cont. * Fix features.List and typing.List conflict * Use `Features(dict)` for complex and compound * Hardcode .hdf5 and .h5 extensions to point to hdf5 * Update type hints from Any to Features * Unsized List in zero dim case
1 parent 985c9be commit b47e71c

File tree

5 files changed

+1221
-0
lines changed

5 files changed

+1221
-0
lines changed

setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,7 @@
166166
"aiohttp",
167167
"elasticsearch>=7.17.12,<8.0.0", # 8.0 asks users to provide hosts or cloud_id when instantiating ElasticSearch(); 7.9.1 has legacy numpy.float_ which was fixed in https://github.com/elastic/elasticsearch-py/pull/2551.
168168
"faiss-cpu>=1.8.0.post1", # Pins numpy < 2
169+
"h5py",
169170
"jax>=0.3.14; sys_platform != 'win32'",
170171
"jaxlib>=0.3.14; sys_platform != 'win32'",
171172
"lz4",

src/datasets/packaged_modules/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from .audiofolder import audiofolder
99
from .cache import cache
1010
from .csv import csv
11+
from .hdf5 import hdf5
1112
from .imagefolder import imagefolder
1213
from .json import json
1314
from .pandas import pandas
@@ -47,6 +48,7 @@ def _hash_python_lines(lines: list[str]) -> str:
4748
"pdffolder": (pdffolder.__name__, _hash_python_lines(inspect.getsource(pdffolder).splitlines())),
4849
"webdataset": (webdataset.__name__, _hash_python_lines(inspect.getsource(webdataset).splitlines())),
4950
"xml": (xml.__name__, _hash_python_lines(inspect.getsource(xml).splitlines())),
51+
"hdf5": (hdf5.__name__, _hash_python_lines(inspect.getsource(hdf5).splitlines())),
5052
}
5153

5254
# get importable module names and hash for caching
@@ -76,6 +78,8 @@ def _hash_python_lines(lines: list[str]) -> str:
7678
".txt": ("text", {}),
7779
".tar": ("webdataset", {}),
7880
".xml": ("xml", {}),
81+
".hdf5": ("hdf5", {}),
82+
".h5": ("hdf5", {}),
7983
}
8084
_EXTENSION_TO_MODULE.update({ext: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
8185
_EXTENSION_TO_MODULE.update({ext.upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})

src/datasets/packaged_modules/hdf5/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)