Allow use of filters that expand chunks by a large factor #5939

fortnern · 2025-10-23T21:37:45Z

When using file format 2.0 or above, always encode the size of a filtered chunk using a 64 bit (size of lengths) integer. Chunks are still limited to 2^32 - 1 bytes, but this restriction could be lifted in the future without API or file format changes.

Most of the changes are due to changing H5D_chk_idx_info_t to contain an H5O_layout_t * instead of an H5O_layout_chunk_t * and an H5O_storage_chunk_t *. This was done so the index clients could access the version of the layout message.

Fixes #108

Important

Update chunked dataset file format to use 64-bit encoding for filtered chunk sizes in file format 2.0+, with refactoring and new tests.

Behavior:
- Update chunked dataset file format to use 64 bits for encoding size of filtered chunks in file format 2.0 or above.
- Chunks still limited to 2^32 - 1 bytes.
Code Refactoring:
- Modify assertions and update macros for chunk size calculations in H5Dbtree.c, H5Dbtree2.c, H5Dearray.c, H5Dfarray.c, H5Dchunk.c, and H5Dint.c.
- Refactor H5D_chk_idx_info_t to use H5O_layout_t instead of separate layout and storage.
Testing:
- Add test_chunk_expand2() in dsets.c to test handling of filters that expand chunks too much.
- Register and unregister H5Z_FILTER_EXPAND2 for testing purposes.

^{This description was created by}^{for 06346d3. You can customize this summary. It will automatically update as commits are pushed.}

bit (size of lengths) integer, when using the 2.0 file format.

…to inefficient_compressors

fortnern · 2025-10-24T04:20:43Z

The failure looks unrelated to this change and may be a transient issue

jhendersonHDF · 2025-10-24T15:09:17Z

The failure looks unrelated to this change and may be a transient issue

The MacOS failure is a known issue that happens from time to time

mattjala · 2025-10-24T21:33:43Z

test/dsets.c

+} /* end filter_expand2() */
+
+/*-------------------------------------------------------------------------
+ * Function: test_chunk_expand2


It looks like the main case - a filter that expands chunks by a large amount that wasn't previously allowed - isn't tested.

I don't follow - that's what this does. I verified it failed before I implemented the changes in the library. test_chunk_expand is a preexisting test that tests cases where it's expected to fail, but I didn't want to copy that directly because it would have resulted in files that were too big.

In that case, I think it's just the comment here that's wrong - it says this test is evaluating error handling on expected failure, but the test itself expects every operation to succeed.

Oh sorry I'll fix that

brtnfld · 2025-10-24T21:37:39Z

src/H5Dpkg.h

You could replace the similar macros:
H5D_BT2_COMPUTE_CHUNK_SIZE_LEN
H5D_EARRAY_COMPUTE_CHUNK_SIZE_LEN
H5D_FARRAY_COMPUTE_CHUNK_SIZE_LEN

with a single macro:

#define H5D_COMPUTE_CHUNK_SIZE_LEN(chunk_size_len, idx_info) \ do { \ if ((idx_info)->pline->nused > 0) { \ if ((idx_info)->layout->version > H5O_LAYOUT_VERSION_4) \ (chunk_size_len) = H5F_SIZEOF_SIZE((idx_info)->f); \ else { \ (chunk_size_len) = 1 + ((H5VM_log2_gen((uint64_t)(idx_info)->layout->u.chunk.size) + 8) / 8); \ if ((chunk_size_len) > 8) \ (chunk_size_len) = 8; \ } \ } \ else \ (chunk_size_len) = 0; \ } while(0)

The earray and farray ones are necessarily split into two. I can try to unify them on Monday though.

I'm actually inclined to leave it how it is, since this only applies to earray, farray, and btree2; not btree1, single, or none. It would then be inelegant to create a shared macro that only applies to some indices. There is a spot in H5Dchunk.c that does the calculation independently that would ideally be unified with the calculations in the index codes, but the proper way to do that would be to add an index callback to retrieve the chunk size encoding length. We don't really have time for that right now though, and it only applies to obsolete file formats so it's probably not worth it.

test/dsets.c

mattjala

The 2.0+ file format docs should be updated to replace appropriate mentions of Chunk Size (variable size; at most 8 bytes). This will require updating a couple of byte-layout tables.

fortnern · 2025-10-27T19:29:17Z

The 2.0+ file format docs should be updated to replace appropriate mentions of Chunk Size (variable size; at most 8 bytes). This will require updating a couple of byte-layout tables.

Yes I'm planning to do that, but since IIRC docs changes can go in after the code freeze I'm prioritizing other things right now

fortnern added 3 commits October 23, 2025 16:22

Change new chunk indexing methods to always encode chunk size as a 64

b38cdb4

bit (size of lengths) integer, when using the 2.0 file format.

Merge branch 'develop' into inefficient_compressors

fb67bce

Add CHANGELOG.md note

06346d3

fortnern requested review from bmribler, brtnfld, byrnHDF, derobins, glennsong09, jhendersonHDF, lrknox, mattjala, qkoziol and vchoi-hdfgroup as code owners October 23, 2025 21:37

github-project-automation bot added this to HDF5 - TRIAGE & TRACK Oct 23, 2025

github-project-automation bot moved this to To be triaged in HDF5 - TRIAGE & TRACK Oct 23, 2025

fortnern and others added 8 commits October 23, 2025 16:47

Spelling

e109a46

Fix errors in parallel build

b0942b7

Committing clang-format changes

d813ee7

More parallel fixes.

12ebf32

Merge branch 'inefficient_compressors' of github.com:fortnern/hdf5 in…

0902abb

…to inefficient_compressors

Committing clang-format changes

b111ac5

Another parallel fix

4e1c054

Fix parallel for real this time I hope

b1c6064

lrknox added the Component - C Library Core C library issues (usually in the src directory) label Oct 24, 2025

mattjala reviewed Oct 24, 2025

View reviewed changes

brtnfld reviewed Oct 24, 2025

View reviewed changes

Update function descriptions in dsets.c

1d5efd7

fortnern commented Oct 27, 2025

View reviewed changes

test/dsets.c Outdated Show resolved Hide resolved

Fix spelling

22dfdd4

mattjala requested changes Oct 27, 2025

View reviewed changes

github-project-automation bot moved this from To be triaged to In progress in HDF5 - TRIAGE & TRACK Oct 27, 2025

mattjala approved these changes Oct 27, 2025

View reviewed changes

jhendersonHDF self-assigned this Oct 28, 2025

nbagha1 added this to the Release 2.0.0 milestone Oct 29, 2025

jhendersonHDF approved these changes Oct 30, 2025

View reviewed changes

lrknox merged commit 0441a4b into HDFGroup:develop Oct 30, 2025
91 checks passed

github-project-automation bot moved this from In progress to Done in HDF5 - TRIAGE & TRACK Oct 30, 2025

Uh oh!

Allow use of filters that expand chunks by a large factor #5939

Allow use of filters that expand chunks by a large factor #5939

Conversation

fortnern commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fortnern commented Oct 24, 2025

Uh oh!

jhendersonHDF commented Oct 24, 2025

Uh oh!

mattjala Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

fortnern Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

mattjala Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

fortnern Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

brtnfld Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

fortnern Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

fortnern Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattjala left a comment

Choose a reason for hiding this comment

Uh oh!

fortnern commented Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fortnern commented Oct 23, 2025 •

edited

Loading

fortnern Oct 27, 2025 •

edited

Loading