Skip to content

Commit e8fd72c

Browse files
committed
start design doc
1 parent 6a7857b commit e8fd72c

File tree

4 files changed

+181
-23
lines changed

4 files changed

+181
-23
lines changed

docs/user-guide/data_types.rst

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
Data types
2+
==========
3+
4+
Zarr's data type model
5+
----------------------
6+
7+
Every Zarr array has a "data type", which defines the meaning and physical layout of the
8+
array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and
9+
Zarr arrays can use many of the same data types as numpy arrays::
10+
>>> import zarr
11+
>>> import numpy as np
12+
>>> zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
13+
>>> z
14+
<Array memory://126225407345920 shape=(10,) dtype=uint8>
15+
16+
But Zarr data types and Numpy data types are also very different in one key respect:
17+
Zarr arrays are designed to be persisted to storage and later read, possibly by Zarr implementations in different programming languages.
18+
So in addition to defining a memory layout for array elements, each Zarr data type defines a procedure for
19+
reading and writing that data type to Zarr array metadata, and also reading and writing **instances** of that data type to
20+
array metadata.
21+
22+
Data types in Zarr version 2
23+
-----------------------------
24+
25+
Version 2 of the Zarr format defined its data types relative to `Numpy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, and added a few non-Numpy data types as well.
26+
Thus the JSON identifer for a Numpy-compatible data type is just the Numpy ``str`` attribute of that dtype:
27+
28+
>>> import zarr
29+
>>> import numpy as np
30+
>>> import json
31+
>>> np_dtype = np.dtype('int64')
32+
>>> z = zarr.create_array(shape=(1,), dtype=np_dtype, zarr_format=2)
33+
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
34+
>>> assert dtype_meta == np_dtype.str # True
35+
>>> dtype_meta
36+
<i8
37+
38+
.. note::
39+
The ``<`` character in the data type metadata encodes the `endianness https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html`_, or "byte order", of the data type. Following Numpy's example,
40+
Zarr version 2 data types associate each data type with an endianness where applicable. Zarr version 3 data types do not store endianness information.
41+
42+
In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``, Zarr also
43+
defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers,
44+
as are floats, with the caveat that `NaN`, positive infinity, and negative infinity are stored as special strings.
45+
46+
Data types in Zarr version 3
47+
----------------------------
48+
49+
* No endianness
50+
* Data type can be encoded as a string or a ``JSON`` object with the structure ``{"name": <string identifier>, "configuration": {...}}``
51+
52+
Data types in Zarr-Python
53+
-------------------------
54+
55+
Zarr-Python supports two different Zarr formats, and those two formats specify data types in rather different ways:
56+
data types in Zarr version 2 are encoded as Numpy-compatible strings, while data types in Zarr version 3 are encoded as either strings or ``JSON`` objects,
57+
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.
58+
59+
If that wasn't enough, we want Zarr-Python to support data types beyond what's available in Numpy. So it's crucial that we have a
60+
model of array data types that can adapt to the differences between Zarr V2 and V3 and doesn't over-fit to Numpy.
61+
62+
Here are the operations we need to perform on data types in Zarr-Python:
63+
64+
* Round-trip native data types to fields in array metadata documents.
65+
For example, the Numpy data type ``np.dtype('>i2')`` should be saved as ``{..., "dtype" : ">i2"}`` in Zarr V2 metadata.
66+
67+
In Zarr V3 metadata, the same Numpy data type would be saved as ``{..., "data_type": "int16", "codecs": [..., {"name": "bytes", "configuration": {"endian": "big"}, ...]}``
68+
69+
* Define a default fill value. This is not mandated by the Zarr specifications, but it's convenient for users
70+
to have a useful default. For numeric types like integers and floats the default can be statically set to 0, but for
71+
parametric data types like fixed-length strings the default can only be generated after the data type has been parametrized at runtime.
72+
73+
* Round-trip scalars to the ``fill_value`` field in Zarr V2 and V3 array metadata documents. The Zarr V2 and V3 specifications
74+
define how scalars of each data type should be stored as JSON in array metadata documents, and in principle each data type
75+
can define this encoding separately.
76+
77+
* Do all of the above for *user-defined data types*. Zarr-Python should support data types added as extensions,so we cannot
78+
hard-code the list of data types. We need to ensure that users can easily (or easily enough) define a python object
79+
that models their custom data type and register this object with Zarr-Python, so that the above operations all succeed for their
80+
custom data type.
81+
82+
To achieve these goals, Zarr Python uses a class called :class:`zarr.core.dtype.DTypeWrapper` to wrap native data types. Each data type
83+
supported by Zarr Python is modeled by a subclass of `DTypeWrapper`, which has the following structure:
84+
85+
(attribute) ``dtype_cls``
86+
^^^^^^^^^^^^^
87+
The ``dtype_cls`` attribute is a **class variable** that is bound to a class that can produce
88+
an instance of a native data type. For example, on the ``DTypeWrapper`` used to model the boolean
89+
data type, the ``dtype_cls`` attribute is bound to the numpy bool data type class: ``np.dtypes.BoolDType``.
90+
This attribute is used when we need to create an instance of the native data type, for example when
91+
defining a Numpy array that will contain Zarr data.
92+
93+
It might seem odd that ``DTypeWrapper.dtype_cls`` binds to a *class* that produces a native data type instead of an instance of that native data type --
94+
why not have a ``DTypeWrapper.dtype`` attribute that binds to ``np.dtypes.BoolDType()``? The reason why ``DTypeWrapper``
95+
doesn't wrap a concrete data type instance is because data type instances may have endianness information, but Zarr V3
96+
data types do not. To model Zarr V3 data types, we need endianness to be an **instance variable** which is
97+
defined when creating an instance of the ```DTypeWrapper``. Subclasses of ``DTypeWrapper`` that model data types with
98+
byte order semantics thus have ``endianness`` as an instance variable, and this value can be set when creating an instance of the wrapper.
99+
100+
101+
(attribute) ``_zarr_v3_name``
102+
^^^^^^^^^^^^^
103+
The ``_zarr_v3_name`` attribute encodes the canonical name for a data type for Zarr V3. For many data types these names
104+
are defined in the `Zarr V3 specification https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types`_ For nearly all of the
105+
data types defined in Zarr V3, this name can be used to uniquely specify a data type. The one exception is the ``r*`` data type,
106+
which is parametrized by a number of bits, and so may take the form ``r8``, ``r16``, ... etc.
107+
108+
(class method) ``from_dtype(cls, dtype) -> Self``
109+
^^^^^^^^^
110+
This method defines a procedure for safely converting a native dtype instance into an instance of ``DTypeWrapper``. It should perform
111+
validation of its input to ensure that the native dtype is an instance of the ``dtype_cls`` class attribute, for example. For some
112+
data types, additional checks are needed -- in Numpy "structured" data types and "void" data types use the same class, with different properties.
113+
A ``DTypeWrapper`` that wraps Numpy structured data types must do additional checks to ensure that the input ``dtype`` is actually a structured data type.
114+
If input validation succeeds, this method will call ``_from_dtype_unsafe``.
115+
116+
(class method) ``_from_dtype_unsafe(cls, dtype) -> Self``
117+
^^^^^^^^^^
118+
This method defines the procedure for converting a native data type instance, like ``np.dtype('uint8')``,
119+
into a wrapper class instance. The ``unsafe`` prefix on the method name denotes that this method should not
120+
perform any input validation. Input validation should be done by the routine that calls this method.
121+
122+
For many data types, creating the wrapper class takes no arguments and so this method can just return ``cls()``.
123+
But for data types with runtime attributes like endianness or length (for fixed-size strings), this ``_from_dtype_unsafe``
124+
ensures that those attributes of ``dtype`` are mapped on to the correct parameters in the ``DTypeWrapper`` class constructor.
125+
126+
(method) ``to_dtype(self) -> dtype``
127+
^^^^^^^
128+
This method produces a native data type consistent with the properties of the ``DTypeWrapper``. Together
129+
with ``from_dtype``, this method allows round-trip conversion of a native data type in to a wrapper class and then out again.
130+
131+
That is, for some ``DTypeWrapper`` class ``FooWrapper`` that wraps a native data type called ``foo``, ``FooWrapper.from_dtype(instance_of_foo).to_dtype() == instance_of_foo`` should be true.
132+
133+
(method) ``to_dict(self) -> dict``
134+
^^^^^
135+
This method generates a JSON-serialiazable representation of the wrapped data type which can be stored in
136+
Zarr metadata.
137+
138+
(method) ``cast_value(self, value: object) -> scalar``
139+
^^^^^
140+
Cast a python object to an instance of the wrapped data type. This is used for generating the default
141+
value associated with this data type.
142+
143+
144+
(method) ``default_value(self) -> scalar``
145+
^^^^
146+
Return the default value for the wrapped data type. Zarr-Python uses this method to generate a default fill value
147+
for an array when a user has not requested one.
148+
149+
Why is this a method and not a static attribute? Although some data types
150+
can have a static default value, parametrized data types like fixed-length strings or structured data types cannot. For these data types,
151+
a default value must be calculated based on the attributes of the wrapped data type.
152+
153+
(method) ``
154+
155+
156+

docs/user-guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ User guide
88

99
installation
1010
arrays
11+
data_types
1112
groups
1213
attributes
1314
storage

src/zarr/core/dtype/_numpy.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -569,7 +569,7 @@ def check_dtype(cls: type[Self], dtype: TDType) -> TypeGuard[np.dtypes.VoidDType
569569
return super().check_dtype(dtype) and dtype.fields is None
570570

571571
@classmethod
572-
def check_json(cls, data: dict[str, JSON]) -> TypeGuard[dict[str, JSON]]:
572+
def check_dict(cls, data: dict[str, JSON]) -> TypeGuard[dict[str, JSON]]:
573573
# Overriding the base class implementation because the r* dtype
574574
# does not have a name that will can appear in array metadata
575575
# Instead, array metadata will contain names like "r8", "r16", etc
@@ -787,7 +787,7 @@ def to_dict(self) -> dict[str, JSON]:
787787
return base_dict
788788

789789
@classmethod
790-
def check_json(cls, data: JSON) -> bool:
790+
def check_dict(cls, data: JSON) -> bool:
791791
return (
792792
isinstance(data, dict)
793793
and "name" in data
@@ -797,7 +797,7 @@ def check_json(cls, data: JSON) -> bool:
797797

798798
@classmethod
799799
def from_dict(cls, data: dict[str, JSON]) -> Self:
800-
if cls.check_json(data):
800+
if cls.check_dict(data):
801801
from zarr.core.dtype import get_data_type_from_dict
802802

803803
fields = tuple(

src/zarr/core/dtype/wrapper.py

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -39,24 +39,6 @@ class DTypeWrapper(Generic[TDType, TScalar], ABC, Metadata):
3939
dtype_cls: ClassVar[type[TDType]] # type: ignore[misc]
4040
_zarr_v3_name: ClassVar[str]
4141

42-
@classmethod
43-
@abstractmethod
44-
def _from_dtype_unsafe(cls: type[Self], dtype: TDType) -> Self:
45-
"""
46-
Wrap a native dtype without checking.
47-
48-
Parameters
49-
----------
50-
dtype : TDType
51-
The native dtype to wrap.
52-
53-
Returns
54-
-------
55-
Self
56-
The wrapped dtype.
57-
"""
58-
raise NotImplementedError
59-
6042
@classmethod
6143
def from_dtype(cls: type[Self], dtype: TDType) -> Self:
6244
"""
@@ -83,6 +65,25 @@ def from_dtype(cls: type[Self], dtype: TDType) -> Self:
8365
f"Invalid dtype: {dtype}. Expected an instance of {cls.dtype_cls}."
8466
)
8567

68+
69+
@classmethod
70+
@abstractmethod
71+
def _from_dtype_unsafe(cls: type[Self], dtype: TDType) -> Self:
72+
"""
73+
Wrap a native dtype without checking.
74+
75+
Parameters
76+
----------
77+
dtype : TDType
78+
The native dtype to wrap.
79+
80+
Returns
81+
-------
82+
Self
83+
The wrapped dtype.
84+
"""
85+
raise NotImplementedError
86+
8687
@abstractmethod
8788
def to_dtype(self: Self) -> TDType:
8889
"""
@@ -158,7 +159,7 @@ def check_dtype(cls: type[Self], dtype: TDType) -> TypeGuard[TDType]:
158159
return type(dtype) is cls.dtype_cls
159160

160161
@classmethod
161-
def check_json(cls: type[Self], data: dict[str, JSON]) -> TypeGuard[dict[str, JSON]]:
162+
def check_dict(cls: type[Self], data: dict[str, JSON]) -> TypeGuard[dict[str, JSON]]:
162163
"""
163164
Check that a JSON representation of a data type matches the dtype_cls class attribute. Used
164165
as a type guard. This base implementation checks that the input is a dictionary,
@@ -192,7 +193,7 @@ def from_dict(cls: type[Self], data: dict[str, JSON]) -> Self:
192193
Self
193194
The wrapped data type.
194195
"""
195-
if cls.check_json(data):
196+
if cls.check_dict(data):
196197
return cls._from_json_unsafe(data)
197198
raise DataTypeValidationError(f"Invalid JSON representation of data type {cls}.")
198199

0 commit comments

Comments
 (0)