Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
c1d1550
add int2 example, and expand dtype docs
d-v-b Jun 19, 2025
6e4a938
specify zarr with a direct local file reference for the dtype example
d-v-b Jun 19, 2025
8d18eed
add comment on pep-723 metadata
d-v-b Jun 19, 2025
bfb2088
ignore future warning in docs
d-v-b Jun 19, 2025
60a1e30
Merge branch 'main' into docs/dtype-docs
d-v-b Jun 19, 2025
5b2a601
Merge branch 'main' into docs/dtype-docs
d-v-b Jun 20, 2025
893540f
re-export vlen-bytes
d-v-b Jun 22, 2025
15ebfa6
make examples stand-alone and testable via script dependency modifica…
d-v-b Jun 22, 2025
383acfc
docstrings
d-v-b Jun 23, 2025
46e80ec
oMerge branch 'docs/dtype-docs' of github.com:d-v-b/zarr-python into …
d-v-b Jun 23, 2025
2e96eca
changelog
d-v-b Jun 23, 2025
b9a510a
docstring style
d-v-b Jun 24, 2025
e34d18e
add docstrings and polish interfaces
d-v-b Jun 24, 2025
c4031dc
fixup
d-v-b Jun 24, 2025
ae268b9
prose
d-v-b Jun 24, 2025
eec3ec3
gamble on a new pytest version fixing windows CI failure
d-v-b Jun 24, 2025
f942508
gamble on a new pytest version fixing windows CI failure
d-v-b Jun 24, 2025
9d3dc48
revert change to pytest dep
d-v-b Jun 24, 2025
532ae1e
skip example tests on windows
d-v-b Jun 24, 2025
620749b
unexclude api from exclude_patterns
d-v-b Jun 24, 2025
45aab29
harmonize docstrings
d-v-b Jun 24, 2025
bf15d71
numpy -> np
d-v-b Jun 24, 2025
27cccdd
restructure list of dtypes
d-v-b Jun 24, 2025
a68751a
code block
d-v-b Jun 24, 2025
f3c44db
prose
d-v-b Jun 24, 2025
3045e9a
revert ectopic change
d-v-b Jun 24, 2025
84b572e
remove trailing underscore from np.void
d-v-b Jun 24, 2025
f35d4c1
remove methods section, correct attributes
d-v-b Jun 24, 2025
669afd3
resolve docs build error my re-ordering plugins. great stuff, sphinx
d-v-b Jun 24, 2025
9af24ed
numpy -> np
d-v-b Jun 24, 2025
3453af2
fix doctests
d-v-b Jun 24, 2025
efb767f
add pytest to docs env, because this resolves a warning about a missi…
d-v-b Jun 24, 2025
6e6d337
put return types in double backticks
d-v-b Jun 24, 2025
cee23aa
escape piped return types
d-v-b Jun 24, 2025
3172095
fix internal link
d-v-b Jun 24, 2025
f6d67b3
Merge branch 'main' into docs/dtype-docs
d-v-b Jun 24, 2025
0c20603
Merge branch 'main' into docs/dtype-docs
d-v-b Jun 24, 2025
432d975
Merge branch 'main' of github.com:zarr-developers/zarr-python into do…
d-v-b Jun 25, 2025
b4a05ba
Merge branch 'docs/dtype-docs' of github.com:d-v-b/zarr-python into d…
d-v-b Jun 25, 2025
bd7e9fc
Merge branch 'main' into docs/dtype-docs
d-v-b Jun 25, 2025
a785e35
Update examples/custom_dtype.py
d-v-b Jun 27, 2025
bec9512
Merge branch 'main' of github.com:zarr-developers/zarr-python into do…
d-v-b Jul 1, 2025
d408b9d
make datatype configuration typeddict readonly
d-v-b Jul 1, 2025
b4a114d
document namedconfig
d-v-b Jul 1, 2025
0639696
document and export typeddicts, move dtype docs to an advanced section
d-v-b Jul 2, 2025
f608c12
remove added features from list of missing features
d-v-b Jul 2, 2025
80bd097
fix accidental copy + paste breakage
d-v-b Jul 2, 2025
c44f1a2
use anonymous rst links
d-v-b Jul 2, 2025
2540eab
Merge branch 'main' into docs/dtype-docs
d-v-b Jul 3, 2025
368145f
normalize typerror when check_scalar fails, and add tests for it
d-v-b Jul 3, 2025
87c71fa
prose
d-v-b Jul 3, 2025
3cfaa0d
Merge branch 'main' of github.com:zarr-developers/zarr-python into do…
d-v-b Jul 3, 2025
48000fc
improve coverage the hard way and the easy way
d-v-b Jul 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions docs/user-guide/data_types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,3 +170,100 @@ Deserialize a scalar value from JSON:

>>> scalar_value = int8.from_json_scalar(42, zarr_format=3)
>>> assert scalar_value == np.int8(42)

Adding new data types
~~~~~~~~~~~~~~~~~~~~~

Each Zarr data type is a separate Python class that inherits from
`ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_. You can define a custom data type by
writing your own subclass of `ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ and adding
your data type to the data type registry. A complete example of this process is included below.

The source code for this example can be found in the ``examples/custom_dtype.py`` file in the Zarr
Python project directory.

.. literalinclude:: ../../examples/custom_dtype.py
:language: python


Data type resolution
~~~~~~~~~~~~~~~~~~~~

Although Zarr Python uses a different data type model from NumPy, you can still define a Zarr array
with a NumPy data type object:

.. code-block:: python

>>> from zarr import create_array
>>> import numpy as np
>>> a = create_array({}, shape=(10,), dtype=np.dtype('int'))
>>> a
<Array memory:... shape=(10,) dtype=int64>

Or a string representation of a NumPy data type:

.. code-block:: python

>>> a = create_array({}, shape=(10,), dtype='<i8')
>>> a
<Array memory:... shape=(10,) dtype=int64>

The ``Array`` object presents itself like a NumPy array, including exposing a NumPy
data type as its ``dtype`` attribute:

.. code-block:: python

>>> type(a.dtype)
<class 'numpy.dtypes.Int64DType'>

But if we inspect the metadata for the array, we can see the Zarr data type object:

.. code-block:: python

>>> type(a.metadata.data_type)
<class 'zarr.core.dtype.npy.int.Int64'>

This example illustrates a general problem Zarr Python has to solve -- how can we allow users to
specify a data type as a string, or a NumPy ``dtype`` object, and produce the right Zarr data type
from that input? We call this process "data type resolution". Zarr Python also performs data type
resolution when reading stored arrays, although in this case the input is a ``JSON`` value instead
of a NumPy data type.

For simple data types like ``int`` the solution could be extremely simple: just
maintain a lookup table that relates a NumPy data type to the Zarr data type equivalent. But not all
data types are so simple. Consider this case:

.. code-block:: python

>>> from zarr import create_array
>>> import warnings
>>> import numpy as np
>>> warnings.simplefilter("ignore", category=FutureWarning)
>>> a = create_array({}, shape=(10,), dtype=[('a', 'f8'), ('b', 'i8')])
>>> a.dtype # this is the NumPy data type
dtype([('a', '<f8'), ('b', '<i8')])
>>> a.metadata.data_type # this is the Zarr data type
Structured(fields=(('a', Float64(endianness='little')), ('b', Int64(endianness='little'))))

In this example, we created a
`NumPy structured data type <https://numpy.org/doc/stable/user/basics.rec.html#structured-datatypes>`_.
This data type is a container that can contain any NumPy data type, which makes it recursive. It is
not possible to make a lookup table that relates all NumPy structured data types to their Zarr
equivalents, as there is a nearly unbounded number of different structured data types. So instead of
a static lookup table, Zarr Python relies on a dynamic approach to data type resolution.

Zarr Python defines a collection of Zarr data types. This collection, called a "data type registry",
is essentially a dict where the keys are strings (a canonical name for each data type), and the values are
the data type classes themselves. Dynamic data type resolution entails iterating over these data
type classes, invoking a special class constructor defined on each one, and returning a concrete
data type instance if and only if exactly 1 of those constructor invocations was successful.

In plain language, we take some user input (a NumPy array), offer it to all the known data type
classes, and return an instance of the one data type class that could accept that user input.

We want to avoid a situation where the same NumPy data type matches multiple Zarr data types. I.e.,
a NumPy data type should uniquely specify a single Zarr data type. But data type resolution is
dynamic, so it's not possible to guarantee this uniqueness constraint. So we attempt data type
resolution against every data type class, and if for some reason a NumPy data type matches multiple
Zarr data types, we treat this as an error and raise an exception.

170 changes: 170 additions & 0 deletions examples/custom_dtype.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr @ {root}",
# "ml_dtypes==0.5.1",
# "pytest==8.4.1"
# ]
# ///
#
# Note: the zarr version must be changed in order to run this outside of the
# zarr source tree. For example, to make this script truly stand-alone, specify the zarr
# dependency as just "zarr"

"""
Demonstrate how to extend Zarr Python by defining a new data type
"""

import json
import sys
from pathlib import Path
from typing import ClassVar, Literal, Self, TypeGuard

import ml_dtypes # necessary to add extra dtypes to NumPy
import numpy as np
import pytest

import zarr
from zarr.core.common import JSON, ZarrFormat
from zarr.core.dtype import ZDType, data_type_registry
from zarr.core.dtype.common import (
DataTypeValidationError,
DTypeConfig_V2,
DTypeJSON,
check_dtype_spec_v2,
)

int2_dtype_cls = type(np.dtype("int2"))
int2_scalar_cls = ml_dtypes.int2


class Int2(ZDType[int2_dtype_cls, int2_scalar_cls]):
"""
This class provides a Zarr compatibility layer around the int2 data type and the int2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a nice link explaining the difference between these? I think I've inferred it but would be nice to make it explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I don't actually think there is a nice link that explains the data type / scalar type difference. The numpy docs should explain this, but they don't. I can add something to our docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put something about this in the data type guide

scalar type.
"""

# This field is as the key for the data type in the internal data type registry, and also
# as the identifier for the data type when serializaing the data type to disk for zarr v3
_zarr_v3_name: ClassVar[Literal["int2"]] = "int2"
# this field will be used internally
_zarr_v2_name: ClassVar[Literal["int2"]] = "int2"

# we bind a class variable to the native data type class so we can create instances of it
dtype_cls = int2_dtype_cls

@classmethod
def from_native_dtype(cls, dtype: np.dtype) -> Self:
"""Create an instance of this ZDType from a native dtype."""
if cls._check_native_dtype(dtype):
return cls()
raise DataTypeValidationError(
f"Invalid data type: {dtype}. Expected an instance of {cls.dtype_cls}"
)

def to_native_dtype(self: Self) -> int2_dtype_cls:
"""Create an int2 dtype instance from this ZDType"""
return self.dtype_cls()

@classmethod
def _check_json_v2(cls, data: DTypeJSON) -> TypeGuard[DTypeConfig_V2[Literal["|b1"], None]]:
"""Type check for Zarr v2-flavored JSON"""
return (
check_dtype_spec_v2(data) and data["name"] == "int2" and data["object_codec_id"] is None
)

@classmethod
def _check_json_v3(cls, data: DTypeJSON) -> TypeGuard[Literal["int2"]]:
"""Type check for Zarr v3-flavored JSON"""
return data == cls._zarr_v3_name

@classmethod
def _from_json_v2(cls, data: DTypeJSON) -> Self:
"""
Create an instance of this ZDType from zarr v3-flavored JSON.
"""
if cls._check_json_v2(data):
return cls()
msg = f"Invalid JSON representation of {cls.__name__}. Got {data!r}, expected the string {cls._zarr_v2_name!r}"
raise DataTypeValidationError(msg)

@classmethod
def _from_json_v3(cls: type[Self], data: DTypeJSON) -> Self:
"""
Create an instance of this ZDType from zarr v3-flavored JSON.
"""
if cls._check_json_v3(data):
return cls()
msg = f"Invalid JSON representation of {cls.__name__}. Got {data!r}, expected the string {cls._zarr_v3_name!r}"
raise DataTypeValidationError(msg)

def to_json(
self, zarr_format: ZarrFormat
) -> DTypeConfig_V2[Literal["int2"], None] | Literal["int2"]:
"""Serialize this ZDType to v2- or v3-flavored JSON"""
if zarr_format == 2:
return {"name": "int2", "object_codec_id": None}
elif zarr_format == 3:
return self._zarr_v3_name
raise ValueError(f"zarr_format must be 2 or 3, got {zarr_format}") # pragma: no cover

def _check_scalar(self, data: object) -> TypeGuard[int]:
"""Check if a python object is a valid scalar"""
return isinstance(data, (int, int2_scalar_cls))

def cast_scalar(self, data: object) -> ml_dtypes.int2:
"""
Attempt to cast a python object to an int2. Might fail pending a type check.
"""
if self._check_scalar(data):
return ml_dtypes.int2(data)
msg = f"Cannot convert object with type {type(data)} to a 2-bit integer."
raise TypeError(msg)

def default_scalar(self) -> ml_dtypes.int2:
"""Get the default scalar value"""
return ml_dtypes.int2(0)

def to_json_scalar(self, data: object, *, zarr_format: ZarrFormat) -> int:
"""Convert a python object to a scalar."""
return int(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be more specific to the example? e.g. explain something to the effect of "needs to be int to be compatible with json." and mention int2 somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something to this effect


def from_json_scalar(self, data: JSON, *, zarr_format: ZarrFormat) -> ml_dtypes.int2:
"""
Read a JSON-serializable value as a scalar. The base definition of this method
requires that it take a zarr_format parameter, because some data types serialize scalars
differently in zarr v2 and v3
"""
if self._check_scalar(data):
return ml_dtypes.int2(data)
raise TypeError(f"Invalid type: {data}. Expected an int.")


# after defining dtype class, it must be registered with the data type registry so zarr can use it
data_type_registry.register(Int2._zarr_v3_name, Int2)


# this parametrized function will create arrays in zarr v2 and v3 using our new data type
@pytest.mark.parametrize("zarr_format", [2, 3])
def test_custom_dtype(tmp_path: Path, zarr_format: Literal[2, 3]) -> None:
# create array and write values
z_w = zarr.create_array(
store=tmp_path, shape=(4,), dtype="int2", zarr_format=zarr_format, compressors=None
)
z_w[:] = [-1, -2, 0, 1]

# open the array
z_r = zarr.open_array(tmp_path, mode="r")

print(z_r.info_complete())

# look at the array metadata
if zarr_format == 2:
meta_file = tmp_path / ".zarray"
else:
meta_file = tmp_path / "zarr.json"
print(json.dumps(json.loads(meta_file.read_text()), indent=2))


if __name__ == "__main__":
sys.exit(pytest.main(["-s", __file__, f"-c {__file__}"]))