|
| 1 | +Data types |
| 2 | +========== |
| 3 | + |
| 4 | +Zarr's data type model |
| 5 | +---------------------- |
| 6 | + |
| 7 | +Every Zarr array has a "data type", which defines the meaning and physical layout of the |
| 8 | +array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and |
| 9 | +Zarr arrays can use many of the same data types as numpy arrays:: |
| 10 | + >>> import zarr |
| 11 | + >>> import numpy as np |
| 12 | + >>> zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8')) |
| 13 | + >>> z |
| 14 | + <Array memory://126225407345920 shape=(10,) dtype=uint8> |
| 15 | + |
| 16 | +But Zarr data types and Numpy data types are also very different in one key respect: |
| 17 | +Zarr arrays are designed to be persisted to storage and later read, possibly by Zarr implementations in different programming languages. |
| 18 | +So in addition to defining a memory layout for array elements, each Zarr data type defines a procedure for |
| 19 | +reading and writing that data type to Zarr array metadata, and also reading and writing **instances** of that data type to |
| 20 | +array metadata. |
| 21 | + |
| 22 | +Data types in Zarr version 2 |
| 23 | +----------------------------- |
| 24 | + |
| 25 | +Version 2 of the Zarr format defined its data types relative to `Numpy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, and added a few non-Numpy data types as well. |
| 26 | +Thus the JSON identifer for a Numpy-compatible data type is just the Numpy ``str`` attribute of that dtype: |
| 27 | + |
| 28 | + >>> import zarr |
| 29 | + >>> import numpy as np |
| 30 | + >>> import json |
| 31 | + >>> np_dtype = np.dtype('int64') |
| 32 | + >>> z = zarr.create_array(shape=(1,), dtype=np_dtype, zarr_format=2) |
| 33 | + >>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"] |
| 34 | + >>> assert dtype_meta == np_dtype.str # True |
| 35 | + >>> dtype_meta |
| 36 | + <i8 |
| 37 | + |
| 38 | +.. note:: |
| 39 | + The ``<`` character in the data type metadata encodes the `endianness https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html`_, or "byte order", of the data type. Following Numpy's example, |
| 40 | +Zarr version 2 data types associate each data type with an endianness where applicable. Zarr version 3 data types do not store endianness information. |
| 41 | + |
| 42 | +In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``, Zarr also |
| 43 | +defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers, |
| 44 | +as are floats, with the caveat that `NaN`, positive infinity, and negative infinity are stored as special strings. |
| 45 | + |
| 46 | +Data types in Zarr version 3 |
| 47 | +---------------------------- |
| 48 | + |
| 49 | +* No endianness |
| 50 | +* Data type can be encoded as a string or a ``JSON`` object with the structure ``{"name": <string identifier>, "configuration": {...}}`` |
| 51 | + |
| 52 | +Data types in Zarr-Python |
| 53 | +------------------------- |
| 54 | + |
| 55 | +Zarr-Python supports two different Zarr formats, and those two formats specify data types in rather different ways: |
| 56 | +data types in Zarr version 2 are encoded as Numpy-compatible strings, while data types in Zarr version 3 are encoded as either strings or ``JSON`` objects, |
| 57 | +and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types. |
| 58 | + |
| 59 | +If that wasn't enough, we want Zarr-Python to support data types beyond what's available in Numpy. So it's crucial that we have a |
| 60 | +model of array data types that can adapt to the differences between Zarr V2 and V3 and doesn't over-fit to Numpy. |
| 61 | + |
| 62 | +Here are the operations we need to perform on data types in Zarr-Python: |
| 63 | + |
| 64 | +* Round-trip native data types to fields in array metadata documents. |
| 65 | + For example, the Numpy data type ``np.dtype('>i2')`` should be saved as ``{..., "dtype" : ">i2"}`` in Zarr V2 metadata. |
| 66 | + |
| 67 | + In Zarr V3 metadata, the same Numpy data type would be saved as ``{..., "data_type": "int16", "codecs": [..., {"name": "bytes", "configuration": {"endian": "big"}, ...]}`` |
| 68 | + |
| 69 | +* Define a default fill value. This is not mandated by the Zarr specifications, but it's convenient for users |
| 70 | + to have a useful default. For numeric types like integers and floats the default can be statically set to 0, but for |
| 71 | + parametric data types like fixed-length strings the default can only be generated after the data type has been parametrized at runtime. |
| 72 | + |
| 73 | +* Round-trip scalars to the ``fill_value`` field in Zarr V2 and V3 array metadata documents. The Zarr V2 and V3 specifications |
| 74 | + define how scalars of each data type should be stored as JSON in array metadata documents, and in principle each data type |
| 75 | + can define this encoding separately. |
| 76 | + |
| 77 | +* Do all of the above for *user-defined data types*. Zarr-Python should support data types added as extensions,so we cannot |
| 78 | + hard-code the list of data types. We need to ensure that users can easily (or easily enough) define a python object |
| 79 | + that models their custom data type and register this object with Zarr-Python, so that the above operations all succeed for their |
| 80 | + custom data type. |
| 81 | + |
| 82 | +To achieve these goals, Zarr Python uses a class called :class:`zarr.core.dtype.DTypeWrapper` to wrap native data types. Each data type |
| 83 | +supported by Zarr Python is modeled by a subclass of `DTypeWrapper`, which has the following structure: |
| 84 | + |
| 85 | +(attribute) ``dtype_cls`` |
| 86 | +^^^^^^^^^^^^^ |
| 87 | +The ``dtype_cls`` attribute is a **class variable** that is bound to a class that can produce |
| 88 | +an instance of a native data type. For example, on the ``DTypeWrapper`` used to model the boolean |
| 89 | +data type, the ``dtype_cls`` attribute is bound to the numpy bool data type class: ``np.dtypes.BoolDType``. |
| 90 | +This attribute is used when we need to create an instance of the native data type, for example when |
| 91 | +defining a Numpy array that will contain Zarr data. |
| 92 | + |
| 93 | +It might seem odd that ``DTypeWrapper.dtype_cls`` binds to a *class* that produces a native data type instead of an instance of that native data type -- |
| 94 | +why not have a ``DTypeWrapper.dtype`` attribute that binds to ``np.dtypes.BoolDType()``? The reason why ``DTypeWrapper`` |
| 95 | +doesn't wrap a concrete data type instance is because data type instances may have endianness information, but Zarr V3 |
| 96 | +data types do not. To model Zarr V3 data types, we need endianness to be an **instance variable** which is |
| 97 | +defined when creating an instance of the ```DTypeWrapper``. Subclasses of ``DTypeWrapper`` that model data types with |
| 98 | +byte order semantics thus have ``endianness`` as an instance variable, and this value can be set when creating an instance of the wrapper. |
| 99 | + |
| 100 | + |
| 101 | +(attribute) ``_zarr_v3_name`` |
| 102 | +^^^^^^^^^^^^^ |
| 103 | +The ``_zarr_v3_name`` attribute encodes the canonical name for a data type for Zarr V3. For many data types these names |
| 104 | +are defined in the `Zarr V3 specification https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types`_ For nearly all of the |
| 105 | +data types defined in Zarr V3, this name can be used to uniquely specify a data type. The one exception is the ``r*`` data type, |
| 106 | +which is parametrized by a number of bits, and so may take the form ``r8``, ``r16``, ... etc. |
| 107 | + |
| 108 | +(class method) ``from_dtype(cls, dtype) -> Self`` |
| 109 | +^^^^^^^^^ |
| 110 | +This method defines a procedure for safely converting a native dtype instance into an instance of ``DTypeWrapper``. It should perform |
| 111 | +validation of its input to ensure that the native dtype is an instance of the ``dtype_cls`` class attribute, for example. For some |
| 112 | +data types, additional checks are needed -- in Numpy "structured" data types and "void" data types use the same class, with different properties. |
| 113 | +A ``DTypeWrapper`` that wraps Numpy structured data types must do additional checks to ensure that the input ``dtype`` is actually a structured data type. |
| 114 | +If input validation succeeds, this method will call ``_from_dtype_unsafe``. |
| 115 | + |
| 116 | +(class method) ``_from_dtype_unsafe(cls, dtype) -> Self`` |
| 117 | +^^^^^^^^^^ |
| 118 | +This method defines the procedure for converting a native data type instance, like ``np.dtype('uint8')``, |
| 119 | +into a wrapper class instance. The ``unsafe`` prefix on the method name denotes that this method should not |
| 120 | +perform any input validation. Input validation should be done by the routine that calls this method. |
| 121 | + |
| 122 | +For many data types, creating the wrapper class takes no arguments and so this method can just return ``cls()``. |
| 123 | +But for data types with runtime attributes like endianness or length (for fixed-size strings), this ``_from_dtype_unsafe`` |
| 124 | +ensures that those attributes of ``dtype`` are mapped on to the correct parameters in the ``DTypeWrapper`` class constructor. |
| 125 | + |
| 126 | +(method) ``to_dtype(self) -> dtype`` |
| 127 | +^^^^^^^ |
| 128 | +This method produces a native data type consistent with the properties of the ``DTypeWrapper``. Together |
| 129 | +with ``from_dtype``, this method allows round-trip conversion of a native data type in to a wrapper class and then out again. |
| 130 | + |
| 131 | +That is, for some ``DTypeWrapper`` class ``FooWrapper`` that wraps a native data type called ``foo``, ``FooWrapper.from_dtype(instance_of_foo).to_dtype() == instance_of_foo`` should be true. |
| 132 | + |
| 133 | +(method) ``to_dict(self) -> dict`` |
| 134 | +^^^^^ |
| 135 | +This method generates a JSON-serialiazable representation of the wrapped data type which can be stored in |
| 136 | +Zarr metadata. |
| 137 | + |
| 138 | +(method) ``cast_value(self, value: object) -> scalar`` |
| 139 | +^^^^^ |
| 140 | +Cast a python object to an instance of the wrapped data type. This is used for generating the default |
| 141 | +value associated with this data type. |
| 142 | + |
| 143 | + |
| 144 | +(method) ``default_value(self) -> scalar`` |
| 145 | +^^^^ |
| 146 | +Return the default value for the wrapped data type. Zarr-Python uses this method to generate a default fill value |
| 147 | +for an array when a user has not requested one. |
| 148 | + |
| 149 | +Why is this a method and not a static attribute? Although some data types |
| 150 | +can have a static default value, parametrized data types like fixed-length strings or structured data types cannot. For these data types, |
| 151 | +a default value must be calculated based on the attributes of the wrapped data type. |
| 152 | + |
| 153 | +(method) `` |
| 154 | + |
| 155 | + |
| 156 | + |
0 commit comments