HDF5 support #7690

klamike · 2025-07-18T21:09:41Z

This PR adds support for tabular HDF5 file(s) by converting each row to an Arrow table. It supports columns with the usual dtypes including up to 5-dimensional arrays as well as support for complex/compound types by using Features(dict). All datasets within the HDF5 file should have rows on the first dimension (groups/subgroups are still allowed). Closes #3113.

Replaces #7625 which only supports a relatively small subset of HDF5.

klamike · 2025-07-23T02:11:02Z

A few to-dos which I think can be left for future PRs (which I am happy to do/help with -- just this one is already huge 😄 ):

Enum types
HDF5 io
dataset-viewer support (not sure if changes are needed with the way it is written now)

setup.py

src/datasets/packaged_modules/hdf5/hdf5.py

klamike · 2025-07-25T15:22:21Z

@lhoestq any interest in merging this? Let me know if I can do anything to make reviewing it easier!

lhoestq · 2025-08-11T10:59:24Z

Sorry for the delay, I'll review your PR soon :)

HuggingFaceDocBuilderDev · 2025-08-11T11:02:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

This is great ! I left a few comments :)

Btw feel free to run make style to fix code formatting

src/datasets/packaged_modules/hdf5/hdf5.py

tests/packaged_modules/test_hdf5.py

src/datasets/packaged_modules/__init__.py

src/datasets/packaged_modules/hdf5/hdf5.py

lhoestq · 2025-08-11T15:36:25Z

src/datasets/packaged_modules/hdf5/hdf5.py

+    features = {}
+    for field_name in field_names:
+        field_dtype = dset.dtype[field_name]
+        field_path = f"{base_path}_{field_name}"


maybe it should create nested features like this instead ?

Features({ "position": { "x": List(Value("int64")), "y": List(Value("int64")), }, "veolocity": { "vx": List(Value("int64")), "vy": List(Value("int64")), }, })

does db2e76a look good? I'm not sure it will be as fast as the separate columns but it does clean up the collision checks.

Looks good ! btw with ds = ds.flatten() you can get a flat structure with columns named "position.x", "position.y" etc.

klamike · 2025-08-11T17:59:54Z

Thanks for the review @lhoestq! Rebased on main and incorporated most of your suggestions.

I believe the only one left is the zero-dim handling with table_cast...

klamike · 2025-08-12T19:47:56Z

@lhoestq is 2c4bfba what you meant?

lhoestq

Yay ! LGTM :)

Let's document this now, this is big !
Would you like to to open a PR for the docs ?

also cc @georgiachanning for viz

klamike · 2025-08-19T14:20:30Z

Awesome! Yes, I'm happy to help with the docs. Would appreciate any pointers, we can discuss in #7740.

It does look like there was a CI test failure, though it seems unrelated?

FAILED tests/test_dataset_dict.py::test_dummy_datasetdict_serialize_fs - ValueError: Protocol not known: mock
FAILED tests/test_arrow_dataset.py::test_dummy_dataset_serialize_fs - ValueError: Protocol not known: mock

Also, what do you think of the todos in #7690 (comment) ? In particular I think support in dataset-viewer would be nice.

lhoestq · 2025-08-19T15:18:58Z

Cool ! Yeah the failure is unrelated

Regarding the Viewer, it should work out of the box when it's updated with the next version of datasets :)

klamike mentioned this pull request Jul 18, 2025

Add documentation for PGLearn AI4OPT/ML4OPF#34

Open

klamike marked this pull request as ready for review July 19, 2025 03:52

klamike commented Jul 23, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

klamike commented Jul 23, 2025

View reviewed changes

src/datasets/packaged_modules/hdf5/hdf5.py Outdated Show resolved Hide resolved

lhoestq reviewed Aug 11, 2025

View reviewed changes

klamike and others added 13 commits August 11, 2025 13:35

initial hdf5 support

e4d5bec

handle zero dims

d3ebf93

add tests

be4adb4

refactor type inference

2408242

refactor vlen, drop ragged, add complex/compound

1c6b1f9

update tests

1e74de6

explicit h5py dependency

f9c7cf3

allow mismatched lengths if ignored

d0315a3

Update setup.py

c650c6f

Sequence -> List

c3c567d

Sequence -> List cont.

babd919

Fix features.List and typing.List conflict

f709dae

Use Features(dict) for complex and compound

db2e76a

klamike force-pushed the mk/hdf5 branch from 2ec6db6 to db2e76a Compare August 11, 2025 17:54

Hardcode .hdf5 and .h5 extensions to point to hdf5

b2e1894

klamike added 2 commits August 11, 2025 14:02

Update type hints from Any to Features

9b1550a

Unsized List in zero dim case

2c4bfba

lhoestq approved these changes Aug 19, 2025

View reviewed changes

lhoestq merged commit b47e71c into huggingface:main Aug 19, 2025
6 of 14 checks passed

klamike mentioned this pull request Aug 19, 2025

Document HDF5 support #7740

Draft

1 task

This was referenced Aug 19, 2025

Preserve tree structure when loading HDF5 #7741

Closed

Refactor HDF5 and preserve tree structure #7743

Merged

HDF5 support #7690

HDF5 support #7690

Uh oh!

Conversation

klamike commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klamike commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

klamike commented Jul 25, 2025

Uh oh!

lhoestq commented Aug 11, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 11, 2025

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

klamike Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

klamike commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klamike commented Aug 12, 2025

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

klamike commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Aug 19, 2025

Uh oh!

Uh oh!

klamike commented Jul 18, 2025 •

edited

Loading

klamike commented Jul 23, 2025 •

edited

Loading

klamike commented Aug 11, 2025 •

edited

Loading

klamike commented Aug 19, 2025 •

edited

Loading