Skip to content

Conversation

klamike
Copy link
Contributor

@klamike klamike commented Jul 18, 2025

This PR adds support for tabular HDF5 file(s) by converting each row to an Arrow table. It supports columns with the usual dtypes including up to 5-dimensional arrays as well as support for complex/compound types by using Features(dict). All datasets within the HDF5 file should have rows on the first dimension (groups/subgroups are still allowed). Closes #3113.

Replaces #7625 which only supports a relatively small subset of HDF5.

@klamike klamike marked this pull request as ready for review July 19, 2025 03:52
@klamike
Copy link
Contributor Author

klamike commented Jul 23, 2025

A few to-dos which I think can be left for future PRs (which I am happy to do/help with -- just this one is already huge 😄 ):

@klamike
Copy link
Contributor Author

klamike commented Jul 25, 2025

@lhoestq any interest in merging this? Let me know if I can do anything to make reviewing it easier!

@lhoestq
Copy link
Member

lhoestq commented Aug 11, 2025

Sorry for the delay, I'll review your PR soon :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great ! I left a few comments :)

Btw feel free to run make style to fix code formatting

features = {}
for field_name in field_names:
field_dtype = dset.dtype[field_name]
field_path = f"{base_path}_{field_name}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it should create nested features like this instead ?

Features({
    "position": {
        "x": List(Value("int64")),
        "y": List(Value("int64")),
    },
    "veolocity": {
        "vx": List(Value("int64")),
        "vy": List(Value("int64")),
    },
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does db2e76a look good? I'm not sure it will be as fast as the separate columns but it does clean up the collision checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good ! btw with ds = ds.flatten() you can get a flat structure with columns named "position.x", "position.y" etc.

@klamike
Copy link
Contributor Author

klamike commented Aug 11, 2025

Thanks for the review @lhoestq! Rebased on main and incorporated most of your suggestions.

I believe the only one left is the zero-dim handling with table_cast...

@klamike
Copy link
Contributor Author

klamike commented Aug 12, 2025

@lhoestq is 2c4bfba what you meant?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay ! LGTM :)

Let's document this now, this is big !
Would you like to to open a PR for the docs ?

also cc @georgiachanning for viz

@lhoestq lhoestq merged commit b47e71c into huggingface:main Aug 19, 2025
6 of 14 checks passed
@klamike
Copy link
Contributor Author

klamike commented Aug 19, 2025

Awesome! Yes, I'm happy to help with the docs. Would appreciate any pointers, we can discuss in #7740.

It does look like there was a CI test failure, though it seems unrelated?

FAILED tests/test_dataset_dict.py::test_dummy_datasetdict_serialize_fs - ValueError: Protocol not known: mock
FAILED tests/test_arrow_dataset.py::test_dummy_dataset_serialize_fs - ValueError: Protocol not known: mock

Also, what do you think of the todos in #7690 (comment) ? In particular I think support in dataset-viewer would be nice.

@klamike klamike mentioned this pull request Aug 19, 2025
1 task
@lhoestq
Copy link
Member

lhoestq commented Aug 19, 2025

Cool ! Yeah the failure is unrelated

Regarding the Viewer, it should work out of the box when it's updated with the next version of datasets :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading Data from HDF files
3 participants