Skip to content

Commit b22f324

Browse files
committed
more design doc
1 parent e8fd72c commit b22f324

File tree

1 file changed

+45
-45
lines changed

1 file changed

+45
-45
lines changed

docs/user-guide/data_types.rst

Lines changed: 45 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,19 @@ Data types
44
Zarr's data type model
55
----------------------
66

7-
Every Zarr array has a "data type", which defines the meaning and physical layout of the
7+
Every Zarr array has a "data type", which defines the meaning and physical layout of the
88
array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and
99
Zarr arrays can use many of the same data types as numpy arrays::
1010
>>> import zarr
1111
>>> import numpy as np
1212
>>> zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
1313
>>> z
14-
<Array memory://126225407345920 shape=(10,) dtype=uint8>
14+
<Array memory://126225407345920 shape=(10,) dtype=uint8>
1515

16-
But Zarr data types and Numpy data types are also very different in one key respect:
17-
Zarr arrays are designed to be persisted to storage and later read, possibly by Zarr implementations in different programming languages.
18-
So in addition to defining a memory layout for array elements, each Zarr data type defines a procedure for
19-
reading and writing that data type to Zarr array metadata, and also reading and writing **instances** of that data type to
16+
But Zarr data types and Numpy data types are also very different in one key respect:
17+
Zarr arrays are designed to be persisted to storage and later read, possibly by Zarr implementations in different programming languages.
18+
So in addition to defining a memory layout for array elements, each Zarr data type defines a procedure for
19+
reading and writing that data type to Zarr array metadata, and also reading and writing **instances** of that data type to
2020
array metadata.
2121

2222
Data types in Zarr version 2
@@ -35,11 +35,11 @@ Thus the JSON identifer for a Numpy-compatible data type is just the Numpy ``str
3535
>>> dtype_meta
3636
<i8
3737

38-
.. note::
38+
.. note::
3939
The ``<`` character in the data type metadata encodes the `endianness https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html`_, or "byte order", of the data type. Following Numpy's example,
4040
Zarr version 2 data types associate each data type with an endianness where applicable. Zarr version 3 data types do not store endianness information.
4141

42-
In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``, Zarr also
42+
In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``, Zarr also
4343
defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers,
4444
as are floats, with the caveat that `NaN`, positive infinity, and negative infinity are stored as special strings.
4545

@@ -52,105 +52,105 @@ Data types in Zarr version 3
5252
Data types in Zarr-Python
5353
-------------------------
5454

55-
Zarr-Python supports two different Zarr formats, and those two formats specify data types in rather different ways:
56-
data types in Zarr version 2 are encoded as Numpy-compatible strings, while data types in Zarr version 3 are encoded as either strings or ``JSON`` objects,
57-
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.
55+
Zarr-Python supports two different Zarr formats, and those two formats specify data types in rather different ways:
56+
data types in Zarr version 2 are encoded as Numpy-compatible strings, while data types in Zarr version 3 are encoded as either strings or ``JSON`` objects,
57+
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.
5858

59-
If that wasn't enough, we want Zarr-Python to support data types beyond what's available in Numpy. So it's crucial that we have a
59+
If that wasn't enough, we want Zarr-Python to support data types beyond what's available in Numpy. So it's crucial that we have a
6060
model of array data types that can adapt to the differences between Zarr V2 and V3 and doesn't over-fit to Numpy.
6161

6262
Here are the operations we need to perform on data types in Zarr-Python:
6363

6464
* Round-trip native data types to fields in array metadata documents.
65-
For example, the Numpy data type ``np.dtype('>i2')`` should be saved as ``{..., "dtype" : ">i2"}`` in Zarr V2 metadata.
66-
65+
For example, the Numpy data type ``np.dtype('>i2')`` should be saved as ``{..., "dtype" : ">i2"}`` in Zarr V2 metadata.
66+
6767
In Zarr V3 metadata, the same Numpy data type would be saved as ``{..., "data_type": "int16", "codecs": [..., {"name": "bytes", "configuration": {"endian": "big"}, ...]}``
6868

69-
* Define a default fill value. This is not mandated by the Zarr specifications, but it's convenient for users
70-
to have a useful default. For numeric types like integers and floats the default can be statically set to 0, but for
69+
* Define a default fill value. This is not mandated by the Zarr specifications, but it's convenient for users
70+
to have a useful default. For numeric types like integers and floats the default can be statically set to 0, but for
7171
parametric data types like fixed-length strings the default can only be generated after the data type has been parametrized at runtime.
7272

7373
* Round-trip scalars to the ``fill_value`` field in Zarr V2 and V3 array metadata documents. The Zarr V2 and V3 specifications
7474
define how scalars of each data type should be stored as JSON in array metadata documents, and in principle each data type
7575
can define this encoding separately.
7676

77-
* Do all of the above for *user-defined data types*. Zarr-Python should support data types added as extensions,so we cannot
78-
hard-code the list of data types. We need to ensure that users can easily (or easily enough) define a python object
79-
that models their custom data type and register this object with Zarr-Python, so that the above operations all succeed for their
77+
* Do all of the above for *user-defined data types*. Zarr-Python should support data types added as extensions,so we cannot
78+
hard-code the list of data types. We need to ensure that users can easily (or easily enough) define a python object
79+
that models their custom data type and register this object with Zarr-Python, so that the above operations all succeed for their
8080
custom data type.
8181

82-
To achieve these goals, Zarr Python uses a class called :class:`zarr.core.dtype.DTypeWrapper` to wrap native data types. Each data type
83-
supported by Zarr Python is modeled by a subclass of `DTypeWrapper`, which has the following structure:
82+
To achieve these goals, Zarr Python uses a class called :class:`zarr.core.dtype.DTypeWrapper` to wrap native data types. Each data type
83+
supported by Zarr Python is modeled by a subclass of `DTypeWrapper`, which has the following structure:
8484

8585
(attribute) ``dtype_cls``
8686
^^^^^^^^^^^^^
8787
The ``dtype_cls`` attribute is a **class variable** that is bound to a class that can produce
88-
an instance of a native data type. For example, on the ``DTypeWrapper`` used to model the boolean
89-
data type, the ``dtype_cls`` attribute is bound to the numpy bool data type class: ``np.dtypes.BoolDType``.
90-
This attribute is used when we need to create an instance of the native data type, for example when
91-
defining a Numpy array that will contain Zarr data.
88+
an instance of a native data type. For example, on the ``DTypeWrapper`` used to model the boolean
89+
data type, the ``dtype_cls`` attribute is bound to the numpy bool data type class: ``np.dtypes.BoolDType``.
90+
This attribute is used when we need to create an instance of the native data type, for example when
91+
defining a Numpy array that will contain Zarr data.
9292

93-
It might seem odd that ``DTypeWrapper.dtype_cls`` binds to a *class* that produces a native data type instead of an instance of that native data type --
93+
It might seem odd that ``DTypeWrapper.dtype_cls`` binds to a *class* that produces a native data type instead of an instance of that native data type --
9494
why not have a ``DTypeWrapper.dtype`` attribute that binds to ``np.dtypes.BoolDType()``? The reason why ``DTypeWrapper``
95-
doesn't wrap a concrete data type instance is because data type instances may have endianness information, but Zarr V3
96-
data types do not. To model Zarr V3 data types, we need endianness to be an **instance variable** which is
97-
defined when creating an instance of the ```DTypeWrapper``. Subclasses of ``DTypeWrapper`` that model data types with
95+
doesn't wrap a concrete data type instance is because data type instances may have endianness information, but Zarr V3
96+
data types do not. To model Zarr V3 data types, we need endianness to be an **instance variable** which is
97+
defined when creating an instance of the ```DTypeWrapper``. Subclasses of ``DTypeWrapper`` that model data types with
9898
byte order semantics thus have ``endianness`` as an instance variable, and this value can be set when creating an instance of the wrapper.
9999

100100

101101
(attribute) ``_zarr_v3_name``
102102
^^^^^^^^^^^^^
103-
The ``_zarr_v3_name`` attribute encodes the canonical name for a data type for Zarr V3. For many data types these names
103+
The ``_zarr_v3_name`` attribute encodes the canonical name for a data type for Zarr V3. For many data types these names
104104
are defined in the `Zarr V3 specification https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types`_ For nearly all of the
105105
data types defined in Zarr V3, this name can be used to uniquely specify a data type. The one exception is the ``r*`` data type,
106-
which is parametrized by a number of bits, and so may take the form ``r8``, ``r16``, ... etc.
106+
which is parametrized by a number of bits, and so may take the form ``r8``, ``r16``, ... etc.
107107

108108
(class method) ``from_dtype(cls, dtype) -> Self``
109109
^^^^^^^^^
110110
This method defines a procedure for safely converting a native dtype instance into an instance of ``DTypeWrapper``. It should perform
111-
validation of its input to ensure that the native dtype is an instance of the ``dtype_cls`` class attribute, for example. For some
112-
data types, additional checks are needed -- in Numpy "structured" data types and "void" data types use the same class, with different properties.
111+
validation of its input to ensure that the native dtype is an instance of the ``dtype_cls`` class attribute, for example. For some
112+
data types, additional checks are needed -- in Numpy "structured" data types and "void" data types use the same class, with different properties.
113113
A ``DTypeWrapper`` that wraps Numpy structured data types must do additional checks to ensure that the input ``dtype`` is actually a structured data type.
114-
If input validation succeeds, this method will call ``_from_dtype_unsafe``.
114+
If input validation succeeds, this method will call ``_from_dtype_unsafe``.
115115

116116
(class method) ``_from_dtype_unsafe(cls, dtype) -> Self``
117117
^^^^^^^^^^
118118
This method defines the procedure for converting a native data type instance, like ``np.dtype('uint8')``,
119-
into a wrapper class instance. The ``unsafe`` prefix on the method name denotes that this method should not
120-
perform any input validation. Input validation should be done by the routine that calls this method.
119+
into a wrapper class instance. The ``unsafe`` prefix on the method name denotes that this method should not
120+
perform any input validation. Input validation should be done by the routine that calls this method.
121121

122122
For many data types, creating the wrapper class takes no arguments and so this method can just return ``cls()``.
123-
But for data types with runtime attributes like endianness or length (for fixed-size strings), this ``_from_dtype_unsafe``
123+
But for data types with runtime attributes like endianness or length (for fixed-size strings), this ``_from_dtype_unsafe``
124124
ensures that those attributes of ``dtype`` are mapped on to the correct parameters in the ``DTypeWrapper`` class constructor.
125125

126126
(method) ``to_dtype(self) -> dtype``
127127
^^^^^^^
128-
This method produces a native data type consistent with the properties of the ``DTypeWrapper``. Together
128+
This method produces a native data type consistent with the properties of the ``DTypeWrapper``. Together
129129
with ``from_dtype``, this method allows round-trip conversion of a native data type in to a wrapper class and then out again.
130130

131131
That is, for some ``DTypeWrapper`` class ``FooWrapper`` that wraps a native data type called ``foo``, ``FooWrapper.from_dtype(instance_of_foo).to_dtype() == instance_of_foo`` should be true.
132132

133-
(method) ``to_dict(self) -> dict``
133+
(method) ``to_dict(self) -> dict``
134134
^^^^^
135-
This method generates a JSON-serialiazable representation of the wrapped data type which can be stored in
135+
This method generates a JSON-serialiazable representation of the wrapped data type which can be stored in
136136
Zarr metadata.
137137

138138
(method) ``cast_value(self, value: object) -> scalar``
139139
^^^^^
140-
Cast a python object to an instance of the wrapped data type. This is used for generating the default
140+
Cast a python object to an instance of the wrapped data type. This is used for generating the default
141141
value associated with this data type.
142142

143143

144144
(method) ``default_value(self) -> scalar``
145145
^^^^
146-
Return the default value for the wrapped data type. Zarr-Python uses this method to generate a default fill value
147-
for an array when a user has not requested one.
146+
Return the default value for the wrapped data type. Zarr-Python uses this method to generate a default fill value
147+
for an array when a user has not requested one.
148148

149-
Why is this a method and not a static attribute? Although some data types
149+
Why is this a method and not a static attribute? Although some data types
150150
can have a static default value, parametrized data types like fixed-length strings or structured data types cannot. For these data types,
151151
a default value must be calculated based on the attributes of the wrapped data type.
152152

153-
(method) ``
153+
(method) ``check_dtype(cls, dtype)``
154154

155155

156156

0 commit comments

Comments
 (0)