Skip to content

feat: use numcodecs as an optional backend for LZMA, ZSTD, ZLIB #1574

Open
Rachit931 wants to merge 21 commits intoscikit-hep:mainfrom
Rachit931:fix-fifth-uproot-issue
Open

feat: use numcodecs as an optional backend for LZMA, ZSTD, ZLIB #1574
Rachit931 wants to merge 21 commits intoscikit-hep:mainfrom
Rachit931:fix-fifth-uproot-issue

Conversation

@Rachit931
Copy link
Contributor

This PR is towards evaluating numcodecs as an alternative backend for compression in uproot, as discussed in #1568.

It introduces numcodecs as an optional backend for LZMA and ZSTD compression and decompression within compression.py, while preserving existing interfaces, error handling and fallback behavior. The existing cramjam (and stdlib, where applicable) implementations remains unchanged and are used as fallbacks if numcodecs is unavailable or unsuitable.

As of now, I have intentionally limited the scope of this PR to LZMA and ZSTD. While reading through the existng code, it seemed to me that for zlib and LZ4 paths rely on mroe specific behavior and error handling, and I wasn't confident that a straightforward swap to numcodecs would preserve semantics there. To avoid guesing and to keep this change minimal, those codecs are left unchanged for now.

@ikrommyd
Copy link
Collaborator

ikrommyd commented Feb 8, 2026

Looking at the diff, I think it's best to add numcodecs to uproot.extras. All these try except imports can be grouped into that place similarly to how it's done for every other optional dependency.

@Rachit931 Rachit931 changed the title Use numcodecs as an optional backend for LZMA and ZSTD compression. feat: use numcodecs as an optional backend for LZMA and ZSTD Feb 8, 2026
@Rachit931
Copy link
Contributor Author

Thanks for the suggestion! Routing numcodecs through uproot.extras would definitely be more consistent with how other optional dependencies are handled.
@ariostas I’m happy to do the uproot.extras refactor for numcodecs, just wanted to ask.

And it looks like there’s a CI failure related to a type mismatch in the numcodecs path. I’m working on a small fix to normalize the output and will update shortly for this as of now.

@Rachit931
Copy link
Contributor Author

I'm sorry for the ci failures, I will take a deeper look at it today.

@Rachit931
Copy link
Contributor Author

Rachit931 commented Feb 9, 2026

@ikrommyd Following up on this, you were right about routing numcodecs through uproot.extras, Maybe I was overthinking before. I’ll refactor the current changes to centralize all numcodecs access there, consistent with how other optional dependencies are handled.

And the CI failures were caused by numcodecs returning lists in some encode/decode paths (surfaced by the fsspec + ssh tests). I’m normalizing the output at both boundaries and will include that fix as part of the uproot.extras refactor.

I’ll push an updated commit shortly once this is cleaned up.

@Rachit931
Copy link
Contributor Author

I think I’m missing something fundamental in the write path and I’d really appreciate guidance here.

I’ve normalized list[bytes] outputs in the numcodecs encode/decode paths and also flattened list-like inputs at the start of compress(), but CI is still failing with
TypeError: expected string or bytes-like object, got 'list' in the fsspec/ssh write tests.

This makes me think there’s another write path where a list buffer bypasses compress() entirely, or a better place where this invariant should be enforced.

@Rachit931
Copy link
Contributor Author

CI showed a list was still reaching the fsspec write path in test_fsspec_writing_ssh_simple.
Even after adding a small normalization before block splitting in def compress() so the internal logic always works on bytes-like data.
I still can't figure out where the leak is coming from.

@ariostas
Copy link
Member

ariostas commented Feb 9, 2026

@Rachit931 the fsspec issue is unrelated to what you're doing. They broke something with their latest release. I'm just waiting a couple of days to see if they patch it or if we have to pin to an older version.

@Rachit931
Copy link
Contributor Author

Thanks for the clarification!
In that case, I am happy to drop the additional normalization I added while chasing the fsspec failure, if that's preferred.
I'll wait for the fsspec situation to settle before making any further updates.

Separate question : Once this settles, I wanted to ask if the extending the numcodecs backend to other ready codecs (eg LZ4 and zlib) or keeping this PR limited to LZMA/ZSTD ?

Copy link
Member

@ariostas ariostas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @Rachit931. I left a couple of comments.

I wanted to ask if the extending the numcodecs backend to other ready codecs (eg LZ4 and zlib) or keeping this PR limited to LZMA/ZSTD ?

It would be good to extend them to the full set of codecs.

To make sure that this is being tested by the CI, we should add the numcodecs dependency to the test group in pyproject.toml. Then, in the future when we're ready to fully switch we can make it a strict dependency and remove cramjam.

Lasty, it would be better if you can use commit messages where you describe what changed in that commit instead of using the same message every time. This makes it easier to go back and see how things evolved.

@Rachit931
Copy link
Contributor Author

Rachit931 commented Feb 10, 2026

@ariostas

  1. When I added numcodecs as a backend for ZLIB, LZ4, multiple tests kept failing with the error : local variable 'out' referenced before assignment and RuntimeError : LZ4 decompression error.
    To the extent I could find out : For ZLIB and LZ4,uproot already knows how large the data should be after compression and expects the decompression library to check this while decoding.
    stdlib and cramjam both allow uproot to pass in this expected size and fails immediately if the decoded data does not match it which helps in catching corrupted or mismatched data early.
    Whereas, numcodecs does not provide a way to pass in or check the expected output size during decoding.
    Maybe there's a modification which I can't this of, would love to know your opinion.

  2. For the changes as of now: I made slight changes as per the feedback : removing the ValueError, trying to convert codec.encode into a bytes which I pursued because I was chasing the fsspec failure and I have added numcodecs to test in pyproject.toml.
    And I apologize for not keeping the same commit messages. That's on me, won't happen again.

Rachit931 and others added 2 commits February 10, 2026 19:42
… Remove ValueError handling for numcodecs backend, Remove unnecessary bytes() conversions for decoded output
Copy link
Member

@ariostas ariostas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I added numcodecs as a backend for ZLIB, LZ4, multiple tests kept failing with the error : local variable 'out' referenced before assignment and RuntimeError : LZ4 decompression error.

The first error, it sounds like there was something wrong with your code. You should look for why it might be unassigned. For the second one, try to find an example where that happens. If possible, try to find a minimal reproducer of a byte-string that works with cramjam but not with numcodecs. If that's the case, we should report that to them so they can fix it.

To the extent I could find out : For ZLIB and LZ4,uproot already knows how large the data should be after compression and expects the decompression library to check this while decoding.
stdlib and cramjam both allow uproot to pass in this expected size and fails immediately if the decoded data does not match it which helps in catching corrupted or mismatched data early.
Whereas, numcodecs does not provide a way to pass in or check the expected output size during decoding.
Maybe there's a modification which I can't this of, would love to know your opinion.

You are already doing the right approach of just explicitly verifying that the length matches the expected one.

@Rachit931
Copy link
Contributor Author

Rachit931 commented Feb 10, 2026

Appreciate the feedback ! I’m currently looking into a minimal reproducer for a byte-string for the numcodecs case . In the meantime, I’ve simplified the code per the feedback and pushed those changes.

And I also have a PR ready with the cleanup related to dependency issue, will include that once the changes are settled here.

@Rachit931 Rachit931 force-pushed the fix-fifth-uproot-issue branch from 0f47c22 to efbb484 Compare February 10, 2026 17:22
@Rachit931
Copy link
Contributor Author

Rachit931 commented Feb 10, 2026

Added the missing changes as per the feedback.

@Rachit931
Copy link
Contributor Author

You were right ! My initial code for specifically ZLIB was wrong, after a little modification I am currently not getting any failed tests due to ZLIB but the issue for LZ4 still persists and I will share anything if I can find anything as I dig into it more.

@Rachit931
Copy link
Contributor Author

Rachit931 commented Feb 11, 2026

@ariostas This is the reproducer I could find :

import cramjam
import numcodecs
import os

raw = os.urandom(1024)

compressed = cramjam.lz4.compress(raw)

expected_size = len(raw)

# Test cramjam
out1 = cramjam.lz4.decompress(compressed, expected_size)
print("cramjam is functional:", len(out1))

# Test numcodecs
codec = numcodecs.LZ4()
try:
    out2 = codec.decode(compressed)
    print("numcodecs is functional", len(out2))
except Exception as e:
    print("numcodecs failed:", e)

Output :

cramjam  is functional: 1024
numcodecs failed: LZ4 decompression error: -7

This proves that the same valid LZ4 byte-string can be decompressed by cramjam but fails with numcodecs.
Please let me know if I am missing something or misusing the API.

@ariostas
Copy link
Member

Thank you, @Rachit931! That's a good finding. Here is some code that compares three different libraries. It shows that lz4 and cramjam are compatible, whereas numcodecs doesn't work with the other two. You should open an issue in the numcodecs repo reporting this.

import cramjam
import numcodecs
import os
import lz4.frame

initial_data = os.urandom(128)

compressed = {}

compressed["lz4"] = lz4.frame.compress(initial_data)
compressed["cramjam"] = bytes(cramjam.lz4.compress(initial_data))
compressed["numcodecs"] = numcodecs.LZ4().encode(initial_data)

libraries = ["lz4", "cramjam", "numcodecs"]

for compressor in libraries:
    for decompressor in libraries:
        print(f"Decompressing data compressed by {compressor} using {decompressor}... ", end="")
        
        try:
            if decompressor == "lz4":
                decompressed = lz4.frame.decompress(compressed[compressor])
            elif decompressor == "cramjam":
                decompressed = bytes(cramjam.lz4.decompress(compressed[compressor]))
            elif decompressor == "numcodecs":
                decompressed = numcodecs.LZ4().decode(compressed[compressor])
            assert initial_data == decompressed, f"Decompression failed for {decompressor} with data compressed by {compressor}"
            print("Success!")
        except Exception as e:
            print(f"Failed: {e}")

outputs

Decompressing data compressed by lz4 using lz4... Success!
Decompressing data compressed by lz4 using cramjam... Success!
Decompressing data compressed by lz4 using numcodecs... Failed: LZ4 decompression error: -13
Decompressing data compressed by cramjam using lz4... Success!
Decompressing data compressed by cramjam using cramjam... Success!
Decompressing data compressed by cramjam using numcodecs... Failed: LZ4 decompression error: -9
Decompressing data compressed by numcodecs using lz4... Failed: LZ4F_getFrameInfo failed with code: ERROR_frameType_unknown
Decompressing data compressed by numcodecs using cramjam... Failed: LZ4 error: ERROR_frameType_unknown
Decompressing data compressed by numcodecs using numcodecs... Success!

@Rachit931
Copy link
Contributor Author

Yeah, I'll open this issue for lz4 in numcodecs as it is failing at the comrpession format level as well.
I've confirmed format-level compatability for ZLIB and the tests passes with pytest.
But, I'll get back to you with confirmation for ZLIB after running more thorough comparisons against the corrupted files and real ROOT files to ensure if ZLIB works consistently with numcodecs or not.

@Rachit931
Copy link
Contributor Author

@ariostas The above tests shows that numcodecs.LZ4 is not compatible with lz4.frame/cramjam.
I was testing decompression of test ROOT files with solely numcodecs.lz4. When reading the existing ROOT files that are LZ4 compressed, decoding fails.
This suggests that numcodecs.LZ4 may not match the LZ4 block format used by ROOT as well.
This seemed relevant with the failure due to backend compatibility.

TEST :

import uproot
import numcodecs

file = uproot.open(""scikit-hep-testdata/src/skhep_testdata/data/uproot-issue79.root"")
tree = file["taus;1"]
branch = tree["pt"]

basket = branch.basket(0)

compressed = basket.data
expected_size = basket.uncompressed_bytes

print("Compression:",file.file.compression)

codec = numcodecs.LZ4()

try:
    decoded = codec.decode(compressed)
    print("Decoded size:", len(decoded))
    print("Matches expected:", len(decoded) == expected_size)
except Exception as e:
    print("Decode failed:", type(e), e)

Output :
Decode failed: <class 'RuntimeError'> LZ4 decompression error: -8

@Rachit931 Rachit931 changed the title feat: use numcodecs as an optional backend for LZMA and ZSTD feat: use numcodecs as an optional backend for LZMA, ZSTD, ZLIB Feb 14, 2026
@ariostas
Copy link
Member

Thank you, @Rachit931! I'd say we should wait a bit before moving on with this. I'm getting a bit worried because they still haven't commented on the issue you opened in the numcodecs repo. So I'm not sure how actively maintained it is.

@Rachit931
Copy link
Contributor Author

Thanks for the feedback ! I'll wait until there's any response on the numcodecs issue before moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants