Skip to content

add numcodec protocol #3318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Aug 13, 2025
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
a367268
add numcodec protocol
d-v-b Jul 31, 2025
1d424c0
add tests for numcodecs compatibility
d-v-b Jul 31, 2025
41dd6ff
changelog
d-v-b Jul 31, 2025
c435a59
ignore unknown key
d-v-b Jul 31, 2025
8e50ef8
remove re-implementation of get_codec
d-v-b Aug 1, 2025
ef31c5b
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 1, 2025
4ba7914
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 4, 2025
ab52539
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 4, 2025
95c9c8b
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 4, 2025
fcf84b3
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 4, 2025
5b0c3ac
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 5, 2025
84c9780
avoid circular imports by importing lower-level routines exactly wher…
d-v-b Aug 5, 2025
9a2f35b
push numcodec prototol into abcs; remove all numcodecs.abc.Codec type…
d-v-b Aug 5, 2025
0d0712f
add tests for codecjson typeguard
d-v-b Aug 5, 2025
931bf2f
avoid using zarr's buffer / ndbuffer for numcodec encode / decode
d-v-b Aug 5, 2025
01bd4b7
use Any to model input / output types of numcodec protocol
d-v-b Aug 5, 2025
f06c6aa
add numcodec protocol
d-v-b Jul 31, 2025
b71e8ac
add tests for numcodecs compatibility
d-v-b Jul 31, 2025
bcaa9ee
changelog
d-v-b Jul 31, 2025
7e49f39
ignore unknown key
d-v-b Jul 31, 2025
4b53f5d
remove re-implementation of get_codec
d-v-b Aug 1, 2025
b35e6c9
avoid circular imports by importing lower-level routines exactly wher…
d-v-b Aug 5, 2025
deef94a
push numcodec prototol into abcs; remove all numcodecs.abc.Codec type…
d-v-b Aug 5, 2025
f057525
add tests for codecjson typeguard
d-v-b Aug 5, 2025
190e1b2
avoid using zarr's buffer / ndbuffer for numcodec encode / decode
d-v-b Aug 5, 2025
82992c5
use Any to model input / output types of numcodec protocol
d-v-b Aug 5, 2025
7ea7e91
Merge branch 'feat/numcodecs-protocol' of github.com:d-v-b/zarr-pytho…
d-v-b Aug 5, 2025
413573a
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Aug 6, 2025
cee4389
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 6, 2025
76f666c
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 6, 2025
c86be01
Update src/zarr/abc/numcodec.py
d-v-b Aug 10, 2025
dba39f5
Update src/zarr/abc/numcodec.py
d-v-b Aug 10, 2025
a857fc2
Update src/zarr/abc/numcodec.py
d-v-b Aug 10, 2025
a082222
Update src/zarr/abc/numcodec.py
d-v-b Aug 10, 2025
ccaaa65
Update src/zarr/abc/numcodec.py
d-v-b Aug 10, 2025
c1991e4
Merge branch 'feat/numcodecs-protocol' of github.com:d-v-b/zarr-pytho…
d-v-b Aug 13, 2025
bb28d1d
fix docstrings
d-v-b Aug 13, 2025
eedea84
revert changes to store imports
d-v-b Aug 13, 2025
fcc010b
remove whitespace
d-v-b Aug 13, 2025
0166d44
fix docstring
d-v-b Aug 13, 2025
ab19c46
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 13, 2025
194e70d
Merge branch 'main' into feat/numcodecs-protocol
d-v-b Aug 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions changes/3318.misc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Define a ``Protocol`` to model the ``numcodecs.abc.Codec`` interface. This is groundwork toward
making ``numcodecs`` an optional dependency for ``zarr-python``.
28 changes: 26 additions & 2 deletions src/zarr/abc/codec.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
from __future__ import annotations

from abc import abstractmethod
from typing import TYPE_CHECKING, Generic, TypeVar
from collections.abc import Mapping
from typing import TYPE_CHECKING, Generic, TypeGuard, TypeVar

from typing_extensions import ReadOnly, TypedDict

from zarr.abc.metadata import Metadata
from zarr.core.buffer import Buffer, NDBuffer
from zarr.core.common import ChunkCoords, concurrent_map
from zarr.core.common import ChunkCoords, NamedConfig, concurrent_map
from zarr.core.config import config

if TYPE_CHECKING:
Expand Down Expand Up @@ -34,6 +37,27 @@
CodecInput = TypeVar("CodecInput", bound=NDBuffer | Buffer)
CodecOutput = TypeVar("CodecOutput", bound=NDBuffer | Buffer)

TName = TypeVar("TName", bound=str, covariant=True)


class CodecJSON_V2(TypedDict, Generic[TName]):
"""The JSON representation of a codec for Zarr V2"""

id: ReadOnly[TName]


def _check_codecjson_v2(data: object) -> TypeGuard[CodecJSON_V2[str]]:
return isinstance(data, Mapping) and "id" in data and isinstance(data["id"], str)


CodecJSON_V3 = str | NamedConfig[str, Mapping[str, object]]
"""The JSON representation of a codec for Zarr V3."""

# The widest type we will *accept* for a codec JSON
# This covers v2 and v3
CodecJSON = str | Mapping[str, object]
"""The widest type of JSON-like input that could specify a codec."""


class BaseCodec(Metadata, Generic[CodecInput, CodecOutput]):
"""Generic base class for codecs.
Expand Down
5 changes: 2 additions & 3 deletions src/zarr/api/asynchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,8 @@
if TYPE_CHECKING:
from collections.abc import Iterable

import numcodecs.abc

from zarr.abc.codec import Codec
from zarr.codecs._v2 import Numcodec
from zarr.core.buffer import NDArrayLikeOrScalar
from zarr.core.chunk_key_encodings import ChunkKeyEncoding
from zarr.storage import StoreLike
Expand Down Expand Up @@ -871,7 +870,7 @@ async def create(
overwrite: bool = False,
path: PathLike | None = None,
chunk_store: StoreLike | None = None,
filters: Iterable[dict[str, JSON] | numcodecs.abc.Codec] | None = None,
filters: Iterable[dict[str, JSON] | Numcodec] | None = None,
cache_metadata: bool | None = None,
cache_attrs: bool | None = None,
read_only: bool | None = None,
Expand Down
4 changes: 2 additions & 2 deletions src/zarr/api/synchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@
if TYPE_CHECKING:
from collections.abc import Iterable

import numcodecs.abc
import numpy as np
import numpy.typing as npt

from zarr.abc.codec import Codec
from zarr.api.asynchronous import ArrayLike, PathLike
from zarr.codecs._v2 import Numcodec
from zarr.core.array import (
CompressorsLike,
FiltersLike,
Expand Down Expand Up @@ -609,7 +609,7 @@ def create(
overwrite: bool = False,
path: PathLike | None = None,
chunk_store: StoreLike | None = None,
filters: Iterable[dict[str, JSON] | numcodecs.abc.Codec] | None = None,
filters: Iterable[dict[str, JSON] | Numcodec] | None = None,
cache_metadata: bool | None = None,
cache_attrs: bool | None = None,
read_only: bool | None = None,
Expand Down
30 changes: 30 additions & 0 deletions src/zarr/codecs/_numcodecs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
from zarr.abc.codec import CodecJSON_V2
from zarr.codecs._v2 import Numcodec


def get_numcodec(data: CodecJSON_V2[str]) -> Numcodec:
"""
Resolve a numcodec codec from the numcodecs registry.

This requires the Numcodecs package to be installed.

Parameters
----------
data : CodecJSON_V2
The JSON metadata for the codec.

Returns
-------
codec : Numcodec

Examples
--------

>>> codec = get_codec({'id': 'zlib', 'level': 1})
>>> codec
Zlib(level=1)
"""

from numcodecs.registry import get_codec

return get_codec(data) # type: ignore[no-any-return]
77 changes: 64 additions & 13 deletions src/zarr/codecs/_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,77 @@

import asyncio
from dataclasses import dataclass
from typing import TYPE_CHECKING
from typing import TYPE_CHECKING, ClassVar, Self, TypeGuard

import numcodecs
import numpy as np
from numcodecs.compat import ensure_bytes, ensure_ndarray_like
from typing_extensions import Protocol

from zarr.abc.codec import ArrayBytesCodec
from zarr.abc.codec import ArrayBytesCodec, CodecJSON_V2
from zarr.registry import get_ndbuffer_class

if TYPE_CHECKING:
import numcodecs.abc

from zarr.core.array_spec import ArraySpec
from zarr.core.buffer import Buffer, NDBuffer


class Numcodec(Protocol):
"""
A protocol that models the ``numcodecs.abc.Codec`` interface.
"""

codec_id: ClassVar[str]

def encode(self, buf: Buffer | NDBuffer) -> Buffer | NDBuffer: ...

def decode(
self, buf: Buffer | NDBuffer, out: Buffer | NDBuffer | None = None
) -> Buffer | NDBuffer: ...

def get_config(self) -> CodecJSON_V2[str]: ...

@classmethod
def from_config(cls, config: CodecJSON_V2[str]) -> Self: ...


def _is_numcodec(obj: object) -> TypeGuard[Numcodec]:
"""
Check if the given object implements the Numcodec protocol.

The @runtime_checkable decorator does not allow issubclass checks for protocols with non-method
members (i.e., attributes), so we use this function to manually check for the presence of the
required attributes and methods on a given object.
"""
return _is_numcodec_cls(type(obj))


def _is_numcodec_cls(obj: object) -> TypeGuard[type[Numcodec]]:
"""
Check if the given object is a class implements the Numcodec protocol.

The @runtime_checkable decorator does not allow issubclass checks for protocols with non-method
members (i.e., attributes), so we use this function to manually check for the presence of the
required attributes and methods on a given object.
"""
return (
isinstance(obj, type)
and hasattr(obj, "codec_id")
and isinstance(obj.codec_id, str)
and hasattr(obj, "encode")
and callable(obj.encode)
and hasattr(obj, "decode")
and callable(obj.decode)
and hasattr(obj, "get_config")
and callable(obj.get_config)
and hasattr(obj, "from_config")
and callable(obj.from_config)
)


@dataclass(frozen=True)
class V2Codec(ArrayBytesCodec):
filters: tuple[numcodecs.abc.Codec, ...] | None
compressor: numcodecs.abc.Codec | None
filters: tuple[Numcodec, ...] | None
compressor: Numcodec | None

is_fixed_size = False

Expand All @@ -33,9 +84,9 @@ async def _decode_single(
cdata = chunk_bytes.as_array_like()
# decompress
if self.compressor:
chunk = await asyncio.to_thread(self.compressor.decode, cdata)
chunk = await asyncio.to_thread(self.compressor.decode, cdata) # type: ignore[arg-type]
else:
chunk = cdata
chunk = cdata # type: ignore[assignment]

# apply filters
if self.filters:
Expand All @@ -56,7 +107,7 @@ async def _decode_single(
# is an object array. In this case, we need to convert the object
# array to the correct dtype.

chunk = np.array(chunk).astype(chunk_spec.dtype.to_native_dtype())
chunk = np.array(chunk).astype(chunk_spec.dtype.to_native_dtype()) # type: ignore[assignment]

elif chunk.dtype != object:
# If we end up here, someone must have hacked around with the filters.
Expand Down Expand Up @@ -85,17 +136,17 @@ async def _encode_single(
# apply filters
if self.filters:
for f in self.filters:
chunk = await asyncio.to_thread(f.encode, chunk)
chunk = await asyncio.to_thread(f.encode, chunk) # type: ignore[arg-type]

# check object encoding
if ensure_ndarray_like(chunk).dtype == object:
raise RuntimeError("cannot write object array without object codec")

# compress
if self.compressor:
cdata = await asyncio.to_thread(self.compressor.encode, chunk)
cdata = await asyncio.to_thread(self.compressor.encode, chunk) # type: ignore[arg-type]
else:
cdata = chunk
cdata = chunk # type: ignore[assignment]

cdata = ensure_bytes(cdata)
return chunk_spec.prototype.buffer.from_bytes(cdata)
Expand Down
16 changes: 8 additions & 8 deletions src/zarr/core/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
import zarr
from zarr.abc.codec import ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec, Codec
from zarr.abc.store import Store, set_or_delete
from zarr.codecs._v2 import V2Codec
from zarr.codecs._v2 import Numcodec, V2Codec
from zarr.codecs.bytes import BytesCodec
from zarr.codecs.vlen_utf8 import VLenBytesCodec, VLenUTF8Codec
from zarr.codecs.zstd import ZstdCodec
Expand Down Expand Up @@ -607,7 +607,7 @@ async def _create(
chunks: ShapeLike | None = None,
dimension_separator: Literal[".", "/"] | None = None,
order: MemoryOrder | None = None,
filters: Iterable[dict[str, JSON] | numcodecs.abc.Codec] | None = None,
filters: Iterable[dict[str, JSON] | Numcodec] | None = None,
compressor: CompressorLike = "auto",
# runtime
overwrite: bool = False,
Expand Down Expand Up @@ -818,7 +818,7 @@ def _create_metadata_v2(
order: MemoryOrder,
dimension_separator: Literal[".", "/"] | None = None,
fill_value: Any | None = DEFAULT_FILL_VALUE,
filters: Iterable[dict[str, JSON] | numcodecs.abc.Codec] | None = None,
filters: Iterable[dict[str, JSON] | Numcodec] | None = None,
compressor: CompressorLikev2 = None,
attributes: dict[str, JSON] | None = None,
) -> ArrayV2Metadata:
Expand Down Expand Up @@ -856,7 +856,7 @@ async def _create_v2(
config: ArrayConfig,
dimension_separator: Literal[".", "/"] | None = None,
fill_value: Any | None = DEFAULT_FILL_VALUE,
filters: Iterable[dict[str, JSON] | numcodecs.abc.Codec] | None = None,
filters: Iterable[dict[str, JSON] | Numcodec] | None = None,
compressor: CompressorLike = "auto",
attributes: dict[str, JSON] | None = None,
overwrite: bool = False,
Expand Down Expand Up @@ -3898,7 +3898,7 @@ def _build_parents(


FiltersLike: TypeAlias = (
Iterable[dict[str, JSON] | ArrayArrayCodec | numcodecs.abc.Codec]
Iterable[dict[str, JSON] | ArrayArrayCodec | Numcodec]
| ArrayArrayCodec
| Iterable[numcodecs.abc.Codec]
| numcodecs.abc.Codec
Expand All @@ -3911,10 +3911,10 @@ def _build_parents(
)

CompressorsLike: TypeAlias = (
Iterable[dict[str, JSON] | BytesBytesCodec | numcodecs.abc.Codec]
Iterable[dict[str, JSON] | BytesBytesCodec | Numcodec]
| dict[str, JSON]
| BytesBytesCodec
| numcodecs.abc.Codec
| Numcodec
| Literal["auto"]
| None
)
Expand Down Expand Up @@ -4944,7 +4944,7 @@ def _parse_deprecated_compressor(
# "no compression"
compressors = ()
else:
compressors = (compressor,)
compressors = (compressor,) # type: ignore[assignment]
elif zarr_format == 2 and compressor == compressors == "auto":
compressors = ({"id": "blosc"},)
return compressors
Expand Down
2 changes: 1 addition & 1 deletion tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -1282,7 +1282,7 @@ def test_gpu_basic(store: Store, zarr_format: ZarrFormat | None) -> None:
dtype=src.dtype,
overwrite=True,
zarr_format=zarr_format,
compressors=compressors,
compressors=compressors, # type: ignore[arg-type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of type: ignore comments currently dotted throughout - what's the reason for that? Is it perhaps because numcodecs.abc.Codec doesn't currently conform to the new protocol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular example is a test, and if you look a few lines above you will see that compressors is either None or the string "auto", and mypy models this as str | None, leading to this error:

tests/test_api.py:1285: error: Argument "compressors" to "create_array" has incompatible type "str | None"; expected "CompressorsLike" [arg-type]

I suspect the other type: ignore statements were added for different reasons

)
z[:10, :10] = src[:10, :10]

Expand Down
2 changes: 1 addition & 1 deletion tests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -1684,7 +1684,7 @@ def test_roundtrip_numcodecs() -> None:
shape=(720, 1440),
chunks=(720, 1440),
dtype="float64",
compressors=compressors,
compressors=compressors, # type: ignore[arg-type]
filters=filters,
fill_value=-9.99,
dimension_names=["lat", "lon"],
Expand Down
24 changes: 24 additions & 0 deletions tests/test_codecs/test_numcodecs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from __future__ import annotations

from numcodecs import GZip

from zarr.codecs._numcodecs import get_numcodec
from zarr.codecs._v2 import _is_numcodec, _is_numcodec_cls


def test_get_numcodec() -> None:
assert get_numcodec({"id": "gzip", "level": 2}) == GZip(level=2) # type: ignore[typeddict-unknown-key]


def test_is_numcodec() -> None:
"""
Test the _is_numcodec function
"""
assert _is_numcodec(GZip())


def test_is_numcodec_cls() -> None:
"""
Test the _is_numcodec_cls function
"""
assert _is_numcodec_cls(GZip)
2 changes: 1 addition & 1 deletion tests/test_codecs/test_vlen.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def test_vlen_string(
chunks=data.shape,
dtype=data.dtype,
fill_value="",
compressors=compressor,
compressors=compressor, # type: ignore[arg-type]
)
assert isinstance(a.metadata, ArrayV3Metadata) # needed for mypy

Expand Down
Loading