-
Notifications
You must be signed in to change notification settings - Fork 102
Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case #707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
d-v-b
merged 15 commits into
zarr-developers:main
from
mkitti:mkitti-zstd-stream-decompress
Jul 10, 2025
Merged
Changes from 8 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
d04e536
Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case
mkitti 1096df7
Add tests and documentation for streaming Zstd
mkitti a903ee5
Add better tests from numcodecs.js
mkitti ccddf41
Merge branch 'main' into mkitti-zstd-stream-decompress
mkitti d3049ca
Adapt zstd.pyx streaming for Py_buffer
mkitti 834b74d
Formatting with ruff
mkitti a2c7fb0
Fix zstd comparison of different signedness
mkitti 0dcffe6
Undo change to unrelated test
mkitti 4edc73a
Add zstd cli tests
mkitti 16cc9b3
Test Zstd against pyzstd
mkitti 308caf3
Apply ruff
mkitti 1f5670f
Fix zstd tests, coverage
mkitti dcf2989
Make imports not optional
mkitti 78f26fc
Add EndlessZstdDecompressor tests
mkitti 313d534
Add docstrings to test_pyzstd.py
mkitti File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did these bytes come from? Ideally we would have a test that generated a streaming output from another zstd tool, and used that as an input. Is this particularly onerous to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, include explanatory reference information from the zstd spec as a comment. basically imagine someone else coming to do maintenance on this test -- how will they know where to look to decipher
bytes_val
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we compare it the
zstd
CLI?or should we break down the frame format:
https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whatever's easiest for you. the goal is to upgrade from a blob of (apparently) random bytes in our test to something that has a clear tie to the zstd spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hard part for me is figuring out how to manage the test dependency and what the basic requirements are there. If we need to cobble this together relying on PyPI alone, then it probably makes sense to pull in python-zstandard or pyzstd as a test dependency
If we could do a conda package, then the conda-forge zstd package would probably the way to go. In this case, we would not need to rely on another 3d party wrapper but could just depend on the 1st party command line interface.
https://anaconda.org/conda-forge/zstd
Alternatively, we could also use the system package manager to install the zstd CLI.
The byte strings are there to decouple the dependency. They are the same as the bytes in the numcodecs.js test implementation.
https://github.com/manzt/numcodecs.js/blob/main/test%2Fzstd.test.js
I suppose what we could do is just add a bunch of tests that are conditional on the available tools or packages which would be optional dependencies.
I still think we should keep the byte strings there for the case when the no other tools are available. We could of course better document how to generate those bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also fine, which is why I said earlier that a comment explaining where the bytes came from would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which one?
https://pypi.org/project/pyzstd/
https://pypi.org/project/zstd/
https://pypi.org/project/zstandard/
I'm leaning towards pyzstd since it is relatively complete and well maintained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't care which one 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added pyzstd tests. I even included a test which decompressing multiple frames in a buffer, which currently fails. I marked it as a xfail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiple frames issue can be resolved by #757