Skip to content

Conversation

klamike
Copy link
Contributor

@klamike klamike commented Aug 21, 2025

Closes #7741. Followup to #7690

  • Recursive parsing and feature inference, to preserve the tree structure of the file. Note this means we now visit all links in the file. It also means we have to call combine_chunks on any large non-root datasets.
  • Support for complex64 (two float32s, used to be converted to two float64s)
  • Support for ndim complex, compound, more field types for compound (due to reusing the main parser, compound types are treated like groups)
  • Cleaned up varlen support
  • Always do feature inference and always cast to features (used to cast to schema)
  • Updated tests to use load_dataset instead of internal APIs
  • Removed columns in config. Have to give Features (i.e., must specify types) if filtering

@klamike klamike marked this pull request as ready for review August 21, 2025 19:17
@klamike klamike changed the title Preserve tree structure in HDF5 Refactor HDF5 and preserve tree structure Aug 21, 2025
@klamike klamike mentioned this pull request Aug 21, 2025
1 task
Comment on lines +286 to +295
should_chunk, keys, values = False, [], []
for k, v in batch_dict.items():
if isinstance(v, pa.ChunkedArray):
should_chunk = True
v = v.combine_chunks()
keys.append(k)
values.append(v)

sarr = pa.StructArray.from_arrays(values, names=keys)
return pa.chunked_array(sarr) if should_chunk else sarr
Copy link
Contributor Author

@klamike klamike Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add this when trying a large dataset.. this is because all groups are represented by StructArray now, which needs non-chunked arrays when constructing

@klamike
Copy link
Contributor Author

klamike commented Aug 25, 2025

@lhoestq this is ready for you now!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! 🙌

@lhoestq lhoestq merged commit 910fab2 into huggingface:main Aug 26, 2025
6 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preserve tree structure when loading HDF5
3 participants