Skip to content

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Aug 13, 2025

This PR brings in all the codecs defined in numcodecs.zarr3. After this PR is merged, we can safely replace the numcodecs.zarr3 module with reexports from zarr python, or remove numcodecs.zarr3 entirely, thereby fixing our circular dependency problem.

This PR also changes the default config to ensure that the locally-defined codecs take priority over the same codec found in the numcodecs registry.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Aug 13, 2025
Copy link

codecov bot commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 31.73077% with 142 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.53%. Comparing base (e76b1e0) to head (3732501).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/codecs/numcodecs/_codecs.py 38.46% 104 Missing ⚠️
src/zarr/codecs/__init__.py 0.00% 33 Missing ⚠️
src/zarr/codecs/numcodecs/__init__.py 0.00% 3 Missing ⚠️
src/zarr/codecs/sharding.py 0.00% 1 Missing ⚠️
src/zarr/registry.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3376      +/-   ##
==========================================
- Coverage   60.60%   60.53%   -0.08%     
==========================================
  Files          79       81       +2     
  Lines        9506     9694     +188     
==========================================
+ Hits         5761     5868     +107     
- Misses       3745     3826      +81     
Files with missing lines Coverage Δ
src/zarr/codecs/blosc.py 39.39% <ø> (+0.78%) ⬆️
src/zarr/codecs/bytes.py 54.09% <ø> (+2.53%) ⬆️
src/zarr/codecs/crc32c_.py 43.75% <ø> (+2.57%) ⬆️
src/zarr/codecs/gzip.py 30.30% <ø> (+1.73%) ⬆️
src/zarr/codecs/transpose.py 47.27% <ø> (+1.65%) ⬆️
src/zarr/codecs/vlen_utf8.py 28.07% <ø> (+1.40%) ⬆️
src/zarr/codecs/zstd.py 36.00% <ø> (+1.38%) ⬆️
src/zarr/core/config.py 29.16% <ø> (+4.16%) ⬆️
src/zarr/codecs/sharding.py 59.07% <0.00%> (+0.18%) ⬆️
src/zarr/registry.py 63.63% <50.00%> (ø)
... and 3 more

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


def _encode(self, chunk_data: Buffer, prototype: BufferPrototype) -> Buffer:
encoded = self._codec.encode(chunk_data.as_array_like())
if isinstance(encoded, np.ndarray): # Required for checksum codecs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we know statically which are checksum codecs without the isinstance check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n.b., this was copy + pasted from numcodecs, but I think the answer is "no"

codec_name: str
codec_config: dict[str, JSON]

def __init_subclass__(cls, *, codec_name: str | None = None, **kwargs: Any) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would a codec definition look like without this magic? I'd be fine with repeating a few things if it meant we could avoid this (and IIUC some of the complexity in __repr__ and __init__ would go away too?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I have rebased #3332 off of this PR, I am 100% going to demagic these codecs in that effort.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 17, 2025

> Can you explain the different cases here?

Spinning this question out into the main thread -- from me, the general answer to questions like this will be "no", since I am only copy+pasting stuff from numcodecs. I haven't spent too much time figuring out what this code is doing. I do think @normanrz and @TomNicholas might be able to answer some of these questions though.

@TomAugspurger
Copy link
Contributor

Ah, I didn't realize this was mostly from numcodecs. I think that moots most of my comments aside from where in the public API we put these.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 17, 2025

yeah I should have made more clear that this is nearly all directly copy + pasted from numcodecs.zarr3

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Aug 21, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 21, 2025

I think this is ready to go in (and it's necessary for #3332)

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 21, 2025

and important recent addition: I moved all of the invocations of register_codec to src/zarr/codecs/__init__.py. This ensures that all codecs get registered, regardless of whether they are part of zarr.codecs.__all__

@maxrjones
Copy link
Member

@d-v-b have you tested this with different numcodecs versions to make sure there's no unexpected issues with clobbering of codec registration?

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 21, 2025

@d-v-b have you tested this with different numcodecs versions to make sure there's no unexpected issues with clobbering of codec registration?

I'm don't think I expect any behavior to depend on numcodecs versions, happy to be corrected though. My understanding is that the codecs in numcodecs.zarr3 are not registered with the numcodecs registry, and instead are exposed via the entrypoints framework. That means we don't have to worry about anything interacting with numcodecs' own registry.

we have a test that checks our compatibility with these codecs, defined as dicts. In main these tests will pick up the codec class from numcodecs, but in this PR the version of the codec defined in zarr python is used instead. Is that kind of thing you are worried about?

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, great work Davis!

Comment on lines +70 to +85
@dataclass(frozen=True)
class _NumcodecsCodec(Metadata):
codec_name: str
codec_config: dict[str, JSON]

def __init_subclass__(cls, *, codec_name: str | None = None, **kwargs: Any) -> None:
"""To be used only when creating the actual public-facing codec class."""
super().__init_subclass__(**kwargs)
if codec_name is not None:
namespace = codec_name

cls_name = f"{CODEC_PREFIX}{namespace}.{cls.__name__}"
cls.codec_name = f"{CODEC_PREFIX}{namespace}"
cls.__doc__ = f"""
See :class:`{cls_name}` for more details and parameters.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember a discussion a few months back about classes initialized this way having challenges with serialization...but I can't track down the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those issues would have been resolved by zarr-developers/numcodecs#745, and this PR uses the changes from that PR

Comment on lines +104 to +118
def test_generic_compressor(codec_class: type[_numcodecs._NumcodecsBytesBytesCodec]) -> None:
data = np.arange(0, 256, dtype="uint16").reshape((16, 16))

with pytest.warns(ZarrUserWarning, match=EXPECTED_WARNING_STR):
a = create_array(
{},
shape=data.shape,
chunks=(16, 16),
dtype=data.dtype,
fill_value=0,
compressors=[codec_class()],
)

a[:, :] = data.copy()
np.testing.assert_array_equal(data, a[:, :])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this test.

@d-v-b d-v-b enabled auto-merge (squash) September 13, 2025 16:35
@d-v-b d-v-b merged commit bce30dd into zarr-developers:main Sep 13, 2025
29 checks passed
@d-v-b d-v-b deleted the chore/handle-numcodecs-codecs branch September 13, 2025 18:43
@d-v-b d-v-b mentioned this pull request Sep 16, 2025
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants