feat: use numcodecs as an optional backend for LZMA, ZSTD, ZLIB by Rachit931 · Pull Request #1574 · scikit-hep/uproot5

Rachit931 · 2026-02-08T21:00:47Z

This PR is towards evaluating numcodecs as an alternative backend for compression in uproot, as discussed in #1568.

It introduces numcodecs as an optional backend for LZMA and ZSTD compression and decompression within compression.py, while preserving existing interfaces, error handling and fallback behavior. The existing cramjam (and stdlib, where applicable) implementations remains unchanged and are used as fallbacks if numcodecs is unavailable or unsuitable.

As of now, I have intentionally limited the scope of this PR to LZMA and ZSTD. While reading through the existng code, it seemed to me that for zlib and LZ4 paths rely on mroe specific behavior and error handling, and I wasn't confident that a straightforward swap to numcodecs would preserve semantics there. To avoid guesing and to keep this change minimal, those codecs are left unchanged for now.

ikrommyd · 2026-02-08T21:04:29Z

Looking at the diff, I think it's best to add numcodecs to uproot.extras. All these try except imports can be grouped into that place similarly to how it's done for every other optional dependency.

Rachit931 · 2026-02-08T21:19:19Z

Thanks for the suggestion! Routing numcodecs through uproot.extras would definitely be more consistent with how other optional dependencies are handled.
@ariostas I’m happy to do the uproot.extras refactor for numcodecs, just wanted to ask.

And it looks like there’s a CI failure related to a type mismatch in the numcodecs path. I’m working on a small fix to normalize the output and will update shortly for this as of now.

Rachit931 · 2026-02-08T21:55:15Z

I'm sorry for the ci failures, I will take a deeper look at it today.

Rachit931 · 2026-02-09T11:58:57Z

@ikrommyd Following up on this, you were right about routing numcodecs through uproot.extras, Maybe I was overthinking before. I’ll refactor the current changes to centralize all numcodecs access there, consistent with how other optional dependencies are handled.

And the CI failures were caused by numcodecs returning lists in some encode/decode paths (surfaced by the fsspec + ssh tests). I’m normalizing the output at both boundaries and will include that fix as part of the uproot.extras refactor.

I’ll push an updated commit shortly once this is cleaned up.

Rachit931 · 2026-02-09T13:50:03Z

I think I’m missing something fundamental in the write path and I’d really appreciate guidance here.

I’ve normalized list[bytes] outputs in the numcodecs encode/decode paths and also flattened list-like inputs at the start of compress(), but CI is still failing with
TypeError: expected string or bytes-like object, got 'list' in the fsspec/ssh write tests.

This makes me think there’s another write path where a list buffer bypasses compress() entirely, or a better place where this invariant should be enforced.

Rachit931 · 2026-02-09T15:24:22Z

CI showed a list was still reaching the fsspec write path in test_fsspec_writing_ssh_simple.
Even after adding a small normalization before block splitting in def compress() so the internal logic always works on bytes-like data.
I still can't figure out where the leak is coming from.

ariostas · 2026-02-09T15:39:51Z

@Rachit931 the fsspec issue is unrelated to what you're doing. They broke something with their latest release. I'm just waiting a couple of days to see if they patch it or if we have to pin to an older version.

Rachit931 · 2026-02-09T16:09:44Z

Thanks for the clarification!
In that case, I am happy to drop the additional normalization I added while chasing the fsspec failure, if that's preferred.
I'll wait for the fsspec situation to settle before making any further updates.

Separate question : Once this settles, I wanted to ask if the extending the numcodecs backend to other ready codecs (eg LZ4 and zlib) or keeping this PR limited to LZMA/ZSTD ?

ariostas

Thank you, @Rachit931. I left a couple of comments.

I wanted to ask if the extending the numcodecs backend to other ready codecs (eg LZ4 and zlib) or keeping this PR limited to LZMA/ZSTD ?

It would be good to extend them to the full set of codecs.

To make sure that this is being tested by the CI, we should add the numcodecs dependency to the test group in pyproject.toml. Then, in the future when we're ready to fully switch we can make it a strict dependency and remove cramjam.

Lasty, it would be better if you can use commit messages where you describe what changed in that commit instead of using the same message every time. This makes it easier to go back and see how things evolved.

src/uproot/compression.py

Rachit931 · 2026-02-10T14:07:12Z

@ariostas

When I added numcodecs as a backend for ZLIB, LZ4, multiple tests kept failing with the error : local variable 'out' referenced before assignment and RuntimeError : LZ4 decompression error.
To the extent I could find out : For ZLIB and LZ4,uproot already knows how large the data should be after compression and expects the decompression library to check this while decoding.
stdlib and cramjam both allow uproot to pass in this expected size and fails immediately if the decoded data does not match it which helps in catching corrupted or mismatched data early.
Whereas, numcodecs does not provide a way to pass in or check the expected output size during decoding.
Maybe there's a modification which I can't this of, would love to know your opinion.
For the changes as of now: I made slight changes as per the feedback : removing the ValueError, trying to convert codec.encode into a bytes which I pursued because I was chasing the fsspec failure and I have added numcodecs to test in pyproject.toml.
And I apologize for not keeping the same commit messages. That's on me, won't happen again.

… Remove ValueError handling for numcodecs backend, Remove unnecessary bytes() conversions for decoded output

ariostas

When I added numcodecs as a backend for ZLIB, LZ4, multiple tests kept failing with the error : local variable 'out' referenced before assignment and RuntimeError : LZ4 decompression error.

The first error, it sounds like there was something wrong with your code. You should look for why it might be unassigned. For the second one, try to find an example where that happens. If possible, try to find a minimal reproducer of a byte-string that works with cramjam but not with numcodecs. If that's the case, we should report that to them so they can fix it.

To the extent I could find out : For ZLIB and LZ4,uproot already knows how large the data should be after compression and expects the decompression library to check this while decoding.
stdlib and cramjam both allow uproot to pass in this expected size and fails immediately if the decoded data does not match it which helps in catching corrupted or mismatched data early.
Whereas, numcodecs does not provide a way to pass in or check the expected output size during decoding.
Maybe there's a modification which I can't this of, would love to know your opinion.

You are already doing the right approach of just explicitly verifying that the length matches the expected one.

src/uproot/compression.py

Rachit931 · 2026-02-10T17:06:54Z

Appreciate the feedback ! I’m currently looking into a minimal reproducer for a byte-string for the numcodecs case . In the meantime, I’ve simplified the code per the feedback and pushed those changes.

And I also have a PR ready with the cleanup related to dependency issue, will include that once the changes are settled here.

Rachit931 · 2026-02-10T17:23:29Z

Added the missing changes as per the feedback.

Rachit931 · 2026-02-10T19:44:19Z

You were right ! My initial code for specifically ZLIB was wrong, after a little modification I am currently not getting any failed tests due to ZLIB but the issue for LZ4 still persists and I will share anything if I can find anything as I dig into it more.

Rachit931 · 2026-02-11T14:44:15Z

@ariostas This is the reproducer I could find :

import cramjam
import numcodecs
import os

raw = os.urandom(1024)

compressed = cramjam.lz4.compress(raw)

expected_size = len(raw)

# Test cramjam
out1 = cramjam.lz4.decompress(compressed, expected_size)
print("cramjam is functional:", len(out1))

# Test numcodecs
codec = numcodecs.LZ4()
try:
    out2 = codec.decode(compressed)
    print("numcodecs is functional", len(out2))
except Exception as e:
    print("numcodecs failed:", e)

Output :

cramjam  is functional: 1024
numcodecs failed: LZ4 decompression error: -7

This proves that the same valid LZ4 byte-string can be decompressed by cramjam but fails with numcodecs.
Please let me know if I am missing something or misusing the API.

ariostas · 2026-02-11T15:27:17Z

Thank you, @Rachit931! That's a good finding. Here is some code that compares three different libraries. It shows that lz4 and cramjam are compatible, whereas numcodecs doesn't work with the other two. You should open an issue in the numcodecs repo reporting this.

import cramjam
import numcodecs
import os
import lz4.frame

initial_data = os.urandom(128)

compressed = {}

compressed["lz4"] = lz4.frame.compress(initial_data)
compressed["cramjam"] = bytes(cramjam.lz4.compress(initial_data))
compressed["numcodecs"] = numcodecs.LZ4().encode(initial_data)

libraries = ["lz4", "cramjam", "numcodecs"]

for compressor in libraries:
    for decompressor in libraries:
        print(f"Decompressing data compressed by {compressor} using {decompressor}... ", end="")
        
        try:
            if decompressor == "lz4":
                decompressed = lz4.frame.decompress(compressed[compressor])
            elif decompressor == "cramjam":
                decompressed = bytes(cramjam.lz4.decompress(compressed[compressor]))
            elif decompressor == "numcodecs":
                decompressed = numcodecs.LZ4().decode(compressed[compressor])
            assert initial_data == decompressed, f"Decompression failed for {decompressor} with data compressed by {compressor}"
            print("Success!")
        except Exception as e:
            print(f"Failed: {e}")

outputs

Decompressing data compressed by lz4 using lz4... Success!
Decompressing data compressed by lz4 using cramjam... Success!
Decompressing data compressed by lz4 using numcodecs... Failed: LZ4 decompression error: -13
Decompressing data compressed by cramjam using lz4... Success!
Decompressing data compressed by cramjam using cramjam... Success!
Decompressing data compressed by cramjam using numcodecs... Failed: LZ4 decompression error: -9
Decompressing data compressed by numcodecs using lz4... Failed: LZ4F_getFrameInfo failed with code: ERROR_frameType_unknown
Decompressing data compressed by numcodecs using cramjam... Failed: LZ4 error: ERROR_frameType_unknown
Decompressing data compressed by numcodecs using numcodecs... Success!

Rachit931 · 2026-02-11T15:53:56Z

Yeah, I'll open this issue for lz4 in numcodecs as it is failing at the comrpession format level as well.
I've confirmed format-level compatability for ZLIB and the tests passes with pytest.
But, I'll get back to you with confirmation for ZLIB after running more thorough comparisons against the corrupted files and real ROOT files to ensure if ZLIB works consistently with numcodecs or not.

Rachit931 · 2026-02-11T21:18:30Z

@ariostas The above tests shows that numcodecs.LZ4 is not compatible with lz4.frame/cramjam.
I was testing decompression of test ROOT files with solely numcodecs.lz4. When reading the existing ROOT files that are LZ4 compressed, decoding fails.
This suggests that numcodecs.LZ4 may not match the LZ4 block format used by ROOT as well.
This seemed relevant with the failure due to backend compatibility.

TEST :

import uproot
import numcodecs

file = uproot.open(""scikit-hep-testdata/src/skhep_testdata/data/uproot-issue79.root"")
tree = file["taus;1"]
branch = tree["pt"]

basket = branch.basket(0)

compressed = basket.data
expected_size = basket.uncompressed_bytes

print("Compression:",file.file.compression)

codec = numcodecs.LZ4()

try:
    decoded = codec.decode(compressed)
    print("Decoded size:", len(decoded))
    print("Matches expected:", len(decoded) == expected_size)
except Exception as e:
    print("Decode failed:", type(e), e)

Output :
Decode failed: <class 'RuntimeError'> LZ4 decompression error: -8

ariostas · 2026-02-17T18:58:59Z

Thank you, @Rachit931! I'd say we should wait a bit before moving on with this. I'm getting a bit worried because they still haven't commented on the issue you opened in the numcodecs repo. So I'm not sure how actively maintained it is.

Rachit931 · 2026-02-17T20:55:08Z

Thanks for the feedback ! I'll wait until there's any response on the numcodecs issue before moving forward.

Rachit931 added 2 commits February 9, 2026 02:11

Use numcodecs as an optional backend for LZMA and ZSTD

1cd0951

Use numcodecs as an optional backend for LZMA and ZSTD

dff2eae

Rachit931 changed the title ~~Use numcodecs as an optional backend for LZMA and ZSTD compression.~~ feat: use numcodecs as an optional backend for LZMA and ZSTD Feb 8, 2026

Rachit931 added 5 commits February 9, 2026 02:52

Use numcodecs as an optional backend for LZMA and ZSTD

8f1c024

Use numcodecs as an optional backend for LZMA and ZSTD

f8fb5ee

Use numcodecs as an optional backend for LZMA and ZSTD

3b604a1

Use numcodecs as an optional backend for LZMA and ZSTD

c91040d

Use numcodecs as an optional backend for LZMA and ZSTD

db15239

Rachit931 and others added 4 commits February 9, 2026 17:11

Use numcodecs as an optional backend for LZMA and ZSTD

201ebef

style: pre-commit fixes

c351a1c

Use numcodecs as an optional backend for LZMA and ZSTD

9d4ee75

Use numcodecs as an optional backend for LZMA and ZSTD

b047e42

Rachit931 added 2 commits February 9, 2026 18:00

Use numcodecs as an optional backend for LZMA and ZSTD

6ba8c53

Use numcodecs as an optional backend for LZMA and ZSTD

cea53de

Rachit931 added 2 commits February 9, 2026 20:25

Use numcodecs as an optional backend for LZMA and ZSTD

0cdf367

Use numcodecs as an optional backend for LZMA and ZSTD

ecc9be5

ariostas reviewed Feb 9, 2026

View reviewed changes

src/uproot/compression.py Outdated Show resolved Hide resolved

src/uproot/compression.py Outdated Show resolved Hide resolved

Rachit931 and others added 2 commits February 10, 2026 19:42

Adding numcodecs to test dependencies so CI covers numcodecs backend,…

82be5d8

… Remove ValueError handling for numcodecs backend, Remove unnecessary bytes() conversions for decoded output

style: pre-commit fixes

958a3de

ariostas reviewed Feb 10, 2026

View reviewed changes

src/uproot/compression.py Outdated Show resolved Hide resolved

src/uproot/compression.py Outdated Show resolved Hide resolved

src/uproot/compression.py Outdated Show resolved Hide resolved

Simplify ZSTD handling and remove byte checks

296671e

style: pre-commit fixes

efbb484

Rachit931 force-pushed the fix-fifth-uproot-issue branch from 0f47c22 to efbb484 Compare February 10, 2026 17:22

ariostas mentioned this pull request Feb 11, 2026

feat: add buffer caches scikit-hep/coffea#1508

Merged

Rachit931 added 2 commits February 14, 2026 13:14

Adding numcodecs as a backend to ZLIB as well

b5ba5f0

Adding numcodecs as a backend to ZLIB as well

0c89e60

Rachit931 changed the title ~~feat: use numcodecs as an optional backend for LZMA and ZSTD~~ feat: use numcodecs as an optional backend for LZMA, ZSTD, ZLIB Feb 14, 2026

Conversation

Rachit931 commented Feb 8, 2026

Uh oh!

ikrommyd commented Feb 8, 2026

Uh oh!

Rachit931 commented Feb 8, 2026

Uh oh!

Rachit931 commented Feb 8, 2026

Uh oh!

Rachit931 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachit931 commented Feb 9, 2026

Uh oh!

Rachit931 commented Feb 9, 2026

Uh oh!

ariostas commented Feb 9, 2026

Uh oh!

Rachit931 commented Feb 9, 2026

Uh oh!

ariostas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Rachit931 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariostas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Rachit931 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachit931 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachit931 commented Feb 10, 2026

Uh oh!

Rachit931 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariostas commented Feb 11, 2026

Uh oh!

Rachit931 commented Feb 11, 2026

Uh oh!

Rachit931 commented Feb 11, 2026

Uh oh!

ariostas commented Feb 17, 2026

Uh oh!

Rachit931 commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Rachit931 commented Feb 9, 2026 •

edited

Loading

Rachit931 commented Feb 10, 2026 •

edited

Loading

Rachit931 commented Feb 10, 2026 •

edited

Loading

Rachit931 commented Feb 10, 2026 •

edited

Loading

Rachit931 commented Feb 11, 2026 •

edited

Loading