Skip to content

LibBlosc2: New codec #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

eschnett
Copy link
Contributor

@eschnett eschnett commented Jun 9, 2025

This is a first stab at implementing a Blosc2 codec. I believe the implementation is correct. I am looking for feedback.

Copy link

codecov bot commented Jun 9, 2025

Codecov Report

Attention: Patch coverage is 86.55914% with 25 lines in your changes missing coverage. Please review.

Project coverage is 86.55%. Comparing base (4343c2a) to head (1a4b310).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
LibBlosc2/src/encode.jl 82.69% 18 Missing ⚠️
LibBlosc2/src/decode.jl 88.46% 6 Missing ⚠️
LibBlosc2/src/libblosc2.jl 96.29% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #54       +/-   ##
===========================================
- Coverage   98.24%   86.55%   -11.69%     
===========================================
  Files           5        4        -1     
  Lines         456      186      -270     
===========================================
- Hits          448      161      -287     
- Misses          8       25       +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nhz2
Copy link
Member

nhz2 commented Jun 10, 2025

If I understand correctly, the goal is to implement the HDF5 filter 32026 in https://github.com/silx-kit/hdf5plugin/blob/v5.1.0/src/PyTables/hdf5-blosc2/src/blosc2_filter.c

According to my reading of https://github.com/Blosc/c-blosc2/blob/v2.17.1/README_EXTENSION_FILENAMES.rst, there are also .b2frame and .b2nd formats. Therefore, the format here should be called Blosc2HDF5 to distinguish it from these.

@eschnett
Copy link
Contributor Author

My goal is to implement a stand-alone blosc2 compressor/decompressor. I did not intend to connect it to HDF5, although that should be possible.

The format I am implementing uses "super-chunks" which were introduced in blosc2. They allow compressing more than 2 GByte of data. Blosc2 still supports the compression methods used by blosc1 with their size limit. It would be possible to add support for this in LibBlosc2, e.g. by allowing a choice when compressing and choosing automatically when decompressing.

@eschnett eschnett marked this pull request as draft June 10, 2025 16:52
@eschnett
Copy link
Contributor Author

The b2nd format is for storing multi-dimensional arrays. The ChunkCodecs API doesn't easily give access to this information (everything is a stream of bytes) and thus I don't think this format is interesting here. The format I'm implementing is the cframe format, a "contiguous frame" holding the compressed data.

@eschnett eschnett marked this pull request as ready for review June 10, 2025 17:15
@eschnett
Copy link
Contributor Author

ping


# There's more unused/unchecked data
c[end-50] = 0x40
# BROKEN @test_throws Blosc2DecodingError decode(Blosc2DecodeOptions(), c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is okay for a format not to checksum everything.

Comment on lines +99 to +106
# Finally, this corruption has an effect
c[end-100] = 0x40
# Windows segfaults in this call with exit code 3221226356,
# indicating a heap corruption. That's clearly a bug in c-blosc2.
# It seems c-blosc2 does not checksum its compressed data.
if !Sys.iswindows()
@test_throws Blosc2DecodingError decode(Blosc2DecodeOptions(), c)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the file that is causing the segfault.
bad_file.txt

It would be good to see if this crashes https://github.com/Blosc/c-blosc2/blob/main/examples/decompress_file.c as well.

Until this is resolved, the documentation for this package should have warnings not to use the package with potentially invalid inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, decompress_file does segfault on this input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a bug in blosc2 or the example. This issue isn't with checksums, it is probably blosc2 missing a bounds check somewhere. Can you report this upstream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants