Skip to content

Conversation

@fortnern
Copy link
Member

@fortnern fortnern commented Oct 23, 2025

When using file format 2.0 or above, always encode the size of a filtered chunk using a 64 bit (size of lengths) integer. Chunks are still limited to 2^32 - 1 bytes, but this restriction could be lifted in the future without API or file format changes.

Most of the changes are due to changing H5D_chk_idx_info_t to contain an H5O_layout_t * instead of an H5O_layout_chunk_t * and an H5O_storage_chunk_t *. This was done so the index clients could access the version of the layout message.

Fixes #108


Important

Update chunked dataset file format to use 64-bit encoding for filtered chunk sizes in file format 2.0+, with refactoring and new tests.

  • Behavior:
    • Update chunked dataset file format to use 64 bits for encoding size of filtered chunks in file format 2.0 or above.
    • Chunks still limited to 2^32 - 1 bytes.
  • Code Refactoring:
    • Modify assertions and update macros for chunk size calculations in H5Dbtree.c, H5Dbtree2.c, H5Dearray.c, H5Dfarray.c, H5Dchunk.c, and H5Dint.c.
    • Refactor H5D_chk_idx_info_t to use H5O_layout_t instead of separate layout and storage.
  • Testing:
    • Add test_chunk_expand2() in dsets.c to test handling of filters that expand chunks too much.
    • Register and unregister H5Z_FILTER_EXPAND2 for testing purposes.

This description was created by Ellipsis for 06346d3. You can customize this summary. It will automatically update as commits are pushed.

@fortnern
Copy link
Member Author

The failure looks unrelated to this change and may be a transient issue

@jhendersonHDF
Copy link
Collaborator

The failure looks unrelated to this change and may be a transient issue

The MacOS failure is a known issue that happens from time to time

@lrknox lrknox added the Component - C Library Core C library issues (usually in the src directory) label Oct 24, 2025
test/dsets.c Outdated
} /* end filter_expand2() */

/*-------------------------------------------------------------------------
* Function: test_chunk_expand2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the main case - a filter that expands chunks by a large amount that wasn't previously allowed - isn't tested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow - that's what this does. I verified it failed before I implemented the changes in the library. test_chunk_expand is a preexisting test that tests cases where it's expected to fail, but I didn't want to copy that directly because it would have resulted in files that were too big.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I think it's just the comment here that's wrong - it says this test is evaluating error handling on expected failure, but the test itself expects every operation to succeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I'll fix that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could replace the similar macros:
H5D_BT2_COMPUTE_CHUNK_SIZE_LEN
H5D_EARRAY_COMPUTE_CHUNK_SIZE_LEN
H5D_FARRAY_COMPUTE_CHUNK_SIZE_LEN

with a single macro:

#define H5D_COMPUTE_CHUNK_SIZE_LEN(chunk_size_len, idx_info) \
    do { \
        if ((idx_info)->pline->nused > 0) { \
            if ((idx_info)->layout->version > H5O_LAYOUT_VERSION_4) \
                (chunk_size_len) = H5F_SIZEOF_SIZE((idx_info)->f); \
            else { \
                (chunk_size_len) = 1 + ((H5VM_log2_gen((uint64_t)(idx_info)->layout->u.chunk.size) + 8) / 8); \
                if ((chunk_size_len) > 8) \
                    (chunk_size_len) = 8; \
            } \
        } \
        else \
            (chunk_size_len) = 0; \
    } while(0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earray and farray ones are necessarily split into two. I can try to unify them on Monday though.

Copy link
Member Author

@fortnern fortnern Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually inclined to leave it how it is, since this only applies to earray, farray, and btree2; not btree1, single, or none. It would then be inelegant to create a shared macro that only applies to some indices. There is a spot in H5Dchunk.c that does the calculation independently that would ideally be unified with the calculations in the index codes, but the proper way to do that would be to add an index callback to retrieve the chunk size encoding length. We don't really have time for that right now though, and it only applies to obsolete file formats so it's probably not worth it.

Copy link
Contributor

@mattjala mattjala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2.0+ file format docs should be updated to replace appropriate mentions of Chunk Size (variable size; at most 8 bytes). This will require updating a couple of byte-layout tables.

@github-project-automation github-project-automation bot moved this from To be triaged to In progress in HDF5 - TRIAGE & TRACK Oct 27, 2025
@fortnern
Copy link
Member Author

The 2.0+ file format docs should be updated to replace appropriate mentions of Chunk Size (variable size; at most 8 bytes). This will require updating a couple of byte-layout tables.

Yes I'm planning to do that, but since IIRC docs changes can go in after the code freeze I'm prioritizing other things right now

@jhendersonHDF jhendersonHDF self-assigned this Oct 28, 2025
@nbagha1 nbagha1 added this to the Release 2.0.0 milestone Oct 29, 2025
@lrknox lrknox merged commit 0441a4b into HDFGroup:develop Oct 30, 2025
91 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in HDF5 - TRIAGE & TRACK Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component - C Library Core C library issues (usually in the src directory)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

HDF5 crashes with inefficient compressors

6 participants