feat: use numcodecs as an optional backend for LZMA, ZSTD, ZLIB #1574
feat: use numcodecs as an optional backend for LZMA, ZSTD, ZLIB #1574Rachit931 wants to merge 21 commits intoscikit-hep:mainfrom
Conversation
|
Looking at the diff, I think it's best to add |
|
Thanks for the suggestion! Routing numcodecs through uproot.extras would definitely be more consistent with how other optional dependencies are handled. And it looks like there’s a CI failure related to a type mismatch in the numcodecs path. I’m working on a small fix to normalize the output and will update shortly for this as of now. |
|
I'm sorry for the ci failures, I will take a deeper look at it today. |
|
@ikrommyd Following up on this, you were right about routing numcodecs through uproot.extras, Maybe I was overthinking before. I’ll refactor the current changes to centralize all numcodecs access there, consistent with how other optional dependencies are handled. And the CI failures were caused by numcodecs returning lists in some encode/decode paths (surfaced by the fsspec + ssh tests). I’m normalizing the output at both boundaries and will include that fix as part of the uproot.extras refactor. I’ll push an updated commit shortly once this is cleaned up. |
|
I think I’m missing something fundamental in the write path and I’d really appreciate guidance here. I’ve normalized This makes me think there’s another write path where a list buffer bypasses |
|
CI showed a |
|
@Rachit931 the fsspec issue is unrelated to what you're doing. They broke something with their latest release. I'm just waiting a couple of days to see if they patch it or if we have to pin to an older version. |
|
Thanks for the clarification! Separate question : Once this settles, I wanted to ask if the extending the numcodecs backend to other ready codecs (eg LZ4 and zlib) or keeping this PR limited to LZMA/ZSTD ? |
ariostas
left a comment
There was a problem hiding this comment.
Thank you, @Rachit931. I left a couple of comments.
I wanted to ask if the extending the numcodecs backend to other ready codecs (eg LZ4 and zlib) or keeping this PR limited to LZMA/ZSTD ?
It would be good to extend them to the full set of codecs.
To make sure that this is being tested by the CI, we should add the numcodecs dependency to the test group in pyproject.toml. Then, in the future when we're ready to fully switch we can make it a strict dependency and remove cramjam.
Lasty, it would be better if you can use commit messages where you describe what changed in that commit instead of using the same message every time. This makes it easier to go back and see how things evolved.
|
… Remove ValueError handling for numcodecs backend, Remove unnecessary bytes() conversions for decoded output
ariostas
left a comment
There was a problem hiding this comment.
When I added numcodecs as a backend for ZLIB, LZ4, multiple tests kept failing with the error : local variable 'out' referenced before assignment and RuntimeError : LZ4 decompression error.
The first error, it sounds like there was something wrong with your code. You should look for why it might be unassigned. For the second one, try to find an example where that happens. If possible, try to find a minimal reproducer of a byte-string that works with cramjam but not with numcodecs. If that's the case, we should report that to them so they can fix it.
To the extent I could find out : For ZLIB and LZ4,uproot already knows how large the data should be after compression and expects the decompression library to check this while decoding.
stdlib and cramjam both allow uproot to pass in this expected size and fails immediately if the decoded data does not match it which helps in catching corrupted or mismatched data early.
Whereas, numcodecs does not provide a way to pass in or check the expected output size during decoding.
Maybe there's a modification which I can't this of, would love to know your opinion.
You are already doing the right approach of just explicitly verifying that the length matches the expected one.
|
Appreciate the feedback ! I’m currently looking into a minimal reproducer for a byte-string for the And I also have a PR ready with the cleanup related to dependency issue, will include that once the changes are settled here. |
0f47c22 to
efbb484
Compare
|
Added the missing changes as per the feedback. |
|
You were right ! My initial code for specifically ZLIB was wrong, after a little modification I am currently not getting any failed tests due to ZLIB but the issue for LZ4 still persists and I will share anything if I can find anything as I dig into it more. |
|
@ariostas This is the reproducer I could find : Output : This proves that the same valid LZ4 byte-string can be decompressed by |
|
Thank you, @Rachit931! That's a good finding. Here is some code that compares three different libraries. It shows that import cramjam
import numcodecs
import os
import lz4.frame
initial_data = os.urandom(128)
compressed = {}
compressed["lz4"] = lz4.frame.compress(initial_data)
compressed["cramjam"] = bytes(cramjam.lz4.compress(initial_data))
compressed["numcodecs"] = numcodecs.LZ4().encode(initial_data)
libraries = ["lz4", "cramjam", "numcodecs"]
for compressor in libraries:
for decompressor in libraries:
print(f"Decompressing data compressed by {compressor} using {decompressor}... ", end="")
try:
if decompressor == "lz4":
decompressed = lz4.frame.decompress(compressed[compressor])
elif decompressor == "cramjam":
decompressed = bytes(cramjam.lz4.decompress(compressed[compressor]))
elif decompressor == "numcodecs":
decompressed = numcodecs.LZ4().decode(compressed[compressor])
assert initial_data == decompressed, f"Decompression failed for {decompressor} with data compressed by {compressor}"
print("Success!")
except Exception as e:
print(f"Failed: {e}")outputs |
|
Yeah, I'll open this issue for |
|
@ariostas The above tests shows that TEST : import uproot
import numcodecs
file = uproot.open(""scikit-hep-testdata/src/skhep_testdata/data/uproot-issue79.root"")
tree = file["taus;1"]
branch = tree["pt"]
basket = branch.basket(0)
compressed = basket.data
expected_size = basket.uncompressed_bytes
print("Compression:",file.file.compression)
codec = numcodecs.LZ4()
try:
decoded = codec.decode(compressed)
print("Decoded size:", len(decoded))
print("Matches expected:", len(decoded) == expected_size)
except Exception as e:
print("Decode failed:", type(e), e)Output : |
|
Thank you, @Rachit931! I'd say we should wait a bit before moving on with this. I'm getting a bit worried because they still haven't commented on the issue you opened in the |
|
Thanks for the feedback ! I'll wait until there's any response on the numcodecs issue before moving forward. |
This PR is towards evaluating numcodecs as an alternative backend for compression in uproot, as discussed in #1568.
It introduces numcodecs as an optional backend for LZMA and ZSTD compression and decompression within compression.py, while preserving existing interfaces, error handling and fallback behavior. The existing cramjam (and stdlib, where applicable) implementations remains unchanged and are used as fallbacks if numcodecs is unavailable or unsuitable.
As of now, I have intentionally limited the scope of this PR to LZMA and ZSTD. While reading through the existng code, it seemed to me that for zlib and LZ4 paths rely on mroe specific behavior and error handling, and I wasn't confident that a straightforward swap to numcodecs would preserve semantics there. To avoid guesing and to keep this change minimal, those codecs are left unchanged for now.