zarr-developers · d-v-b · Jun 16, 2025 · Feb 21, 2025 · Feb 24, 2025 · Feb 24, 2025
diff --git a/docs/user-guide/data_types.rst b/docs/user-guide/data_types.rst
@@ -0,0 +1,183 @@
+Data types
+==========
+
+Zarr's data type model
+----------------------
+
+Every Zarr array has a "data type", which defines the meaning and physical layout of the
+array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and
+Zarr-Python supports creating arrays with Numpy data types::
+
+  >>> import zarr
+  >>> import numpy as np
+  >>> zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
+  >>> z
+  <Array memory://126225407345920 shape=(10,) dtype=uint8>
+
+Unlike Numpy arrays, Zarr arrays are designed to be persisted to storage and read by Zarr implementations in different programming languages.
+This means Zarr data types must be interpreted correctly when clients read an array. So each Zarr data type defines a procedure for
+encoding / decoding that data type to / from Zarr array metadata, and also encoding / decoding **instances** of that data type to / from
+array metadata. These serialization procedures depend on the Zarr format.
+
+Data types in Zarr version 2
+-----------------------------
+
+Version 2 of the Zarr format defined its data types relative to `Numpy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, and added a few non-Numpy data types as well.
+Thus the JSON identifier for a Numpy-compatible data type is just the Numpy ``str`` attribute of that dtype:
+
+    >>> import zarr
+    >>> import numpy as np
+    >>> import json
+    >>> np_dtype = np.dtype('int64')
+    >>> z = zarr.create_array(shape=(1,), dtype=np_dtype, zarr_format=2)
+    >>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
+    >>> assert dtype_meta == np_dtype.str # True
+    >>> dtype_meta
+    <i8
+
+.. note::
+  The ``<`` character in the data type metadata encodes the `endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_, or "byte order", of the data type. Following Numpy's example,
+  in Zarr version 2 each data type has an endianness where applicable. However, Zarr version 3 data types do not store endianness information.
+
+In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``, Zarr also
+defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers,
+as are floats, with the caveat that `NaN`, positive infinity, and negative infinity are stored as special strings.
+
+Data types in Zarr version 3
+----------------------------
+
+* Data type names are different -- Zarr V2 represented the 16 bit unsigned integer data type as ``>i2``; Zarr V3 represents the same data type as ``int16``.
+* No endianness
-* No endianness
+* No endianness; endianness is instead defined as part of the codec pipeline (see below).
-* No endianness
+* No endianness; endianness is instead defined as part of the codec pipeline (see below).
+* A data type can be encoded in metadata as a string or a ``JSON`` object with the structure ``{"name": <string identifier>, "configuration": {...}}``
+
+Data types in Zarr-Python
+-------------------------
+
+Zarr-Python supports two different Zarr formats, and those two formats specify data types in rather different ways:
+data types in Zarr version 2 are encoded as Numpy-compatible strings, while data types in Zarr version 3 are encoded as either strings or ``JSON`` objects,
+and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.
+
+We also want Zarr-Python to support data types beyond what's available in Numpy. So it's crucial that we have a
+model of array data types that can adapt to the differences between Zarr V2 and V3 and doesn't over-fit to Numpy.
+
+Here are the operations we need to perform on data types in Zarr-Python:
+
+* Round-trip native data types to fields in array metadata documents.
+    For example, the Numpy data type ``np.dtype('>i2')`` should be saved as ``{..., "dtype" : ">i2"}`` in Zarr V2 metadata.
+
+    In Zarr V3 metadata, the same Numpy data type would be saved as  ``{..., "data_type": "int16", "codecs": [..., {"name": "bytes", "configuration": {"endian": "big"}, ...]}``
+
+* Define a default fill value. This is not mandated by the Zarr specifications, but it's convenient for users
+  to have a useful default. For numeric types like integers and floats the default can be statically set to 0, but for
+  parametric data types like fixed-length strings the default can only be generated after the data type has been parametrized at runtime.
+
+* Round-trip scalars to the ``fill_value`` field in Zarr V2 and V3 array metadata documents. The Zarr V2 and V3 specifications
+  define how scalars of each data type should be stored as JSON in array metadata documents, and in principle each data type
+  can define this encoding separately.
+
+* Do all of the above for *user-defined data types*. Zarr-Python should support data types added as extensions,so we cannot
+  hard-code the list of data types. We need to ensure that users can easily (or easily enough) define a python object
+  that models their custom data type and register this object with Zarr-Python, so that the above operations all succeed for their
+  custom data type.
+
+To achieve these goals, Zarr Python uses a class called :class:`zarr.core.dtype.DTypeWrapper` to wrap native data types. Each data type
+supported by Zarr Python is modeled by a subclass of `DTypeWrapper`, which has the following structure:
+
+(attribute) ``dtype_cls``
+^^^^^^^^^^^^^^^^^^^^^^^^^
+The ``dtype_cls`` attribute is a **class variable** that is bound to a class that can produce
+an instance of a native data type. For example, on the ``DTypeWrapper`` used to model the boolean
+data type, the ``dtype_cls`` attribute is bound to the numpy bool data type class: ``np.dtypes.BoolDType``.
+This attribute is used when we need to create an instance of the native data type, for example when
+defining a Numpy array that will contain Zarr data.
+
+It might seem odd that ``DTypeWrapper.dtype_cls`` binds to a *class* that produces a native data type instead of an instance of that native data type --
+why not have a ``DTypeWrapper.dtype`` attribute that binds to ``np.dtypes.BoolDType()``? The reason why ``DTypeWrapper``
+doesn't wrap a concrete data type instance is because data type instances may have endianness information, but Zarr V3
+data types do not. To model Zarr V3 data types, we need endianness to be an **instance variable** which is
+defined when creating an instance of the ```DTypeWrapper``. Subclasses of ``DTypeWrapper`` that model data types with
+byte order semantics thus have ``endianness`` as an instance variable, and this value can be set when creating an instance of the wrapper.
+
+
+(attribute) ``_zarr_v3_name``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The ``_zarr_v3_name`` attribute encodes the canonical name for a data type for Zarr V3. For many data types these names
+are defined in the `Zarr V3 specification <https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types>`_ For nearly all of the
+data types defined in Zarr V3, this name can be used to uniquely specify a data type. The one exception is the ``r*`` data type,
+which is parametrized by a number of bits, and so may take the form ``r8``, ``r16``, ... etc.
+
+(class method) ``from_dtype(cls, dtype) -> Self``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This method defines a procedure for safely converting a native dtype instance into an instance of ``DTypeWrapper``. It should perform
+validation of its input to ensure that the native dtype is an instance of the ``dtype_cls`` class attribute, for example. For some
+data types, additional checks are needed -- in Numpy "structured" data types and "void" data types use the same class, with different properties.
+A ``DTypeWrapper`` that wraps Numpy structured data types must do additional checks to ensure that the input ``dtype`` is actually a structured data type.
+If input validation succeeds, this method will call ``_from_dtype_unsafe``.
+
+(method) ``to_dtype(self) -> dtype``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This method produces a native data type consistent with the properties of the ``DTypeWrapper``. Together
+with ``from_dtype``, this method allows round-trip conversion of a native data type in to a wrapper class and then out again.
+
+That is, for some ``DTypeWrapper`` class ``FooWrapper`` that wraps a native data type called ``foo``, ``FooWrapper.from_dtype(instance_of_foo).to_dtype() == instance_of_foo`` should be true.
+
+(method) ``to_dict(self) -> dict``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This method generates a JSON-serialiazable representation of the wrapped data type which can be stored in
+Zarr metadata.
+
+(method) ``cast_value(self, value: object) -> scalar``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This method converts a python object to an instance of the wrapped data type. It is used for generating the default
+value associated with this data type.
+
+
+(method) ``default_value(self) -> scalar``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This method returns the default value for the wrapped data type. Zarr-Python uses this method to generate a default fill value
+for an array when a user has not requested one.
+
+Why is this a method and not a static attribute? Although some data types
+can have a static default value, parametrized data types like fixed-length strings or structured data types cannot. For these data types,
+a default value must be calculated based on the attributes of the wrapped data type.
+
+(class method) ``check_dtype(cls, dtype) -> bool``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This class method checks if a native dtype is compatible with the ``DTypeWrapper`` class. It returns ``True``
+if ``dtype`` is compatible with the wrapper class, and ``False`` otherwise. For many data types, this check is as simple
+as checking that ``cls.dtype_cls`` matches ``type(dtype)``, i.e. checking that the data type class wrapped
+by the ``DTypeWrapper`` is the same as the class of ``dtype``. But there are some data types where this check alone is not sufficient,
+in which case this method is overridden so that additional properties of ``dtype`` can be inspected and compared with
+the expectations of ``cls``.
+
+(class method) ``from_dict(cls, dtype) -> Self``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This class method creates a ``DTypeWrapper`` from an appropriately structured dictionary. The default
+implementation first checks that the dictionary has the correct structure, and then uses its data
+to instantiate the ``DTypeWrapper`` instance.
+
+(method) ``to_dict(self) -> dict[str, JSON]``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Returns a dictionary form of the wrapped data type. This is used prior to writing array metadata.
+
+(class method) ``get_name(self, zarr_format: Literal[2, 3]) -> str``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This method generates a name for the wrapped data type, depending on the Zarr format. If ``zarr_format`` is
+2 and the wrapped data type is a Numpy data type, then the Numpy string representation of that data type is returned.
+If ``zarr_format`` is 3, then the Zarr V3 name for the wrapped data type is returned. For most data types
+the Zarr V3 name will be stored as the ``_zarr_v3_name`` class attribute, but for parametric data types the
+name must be computed at runtime based on the parameters of the data type.
+
+
+(method) ``to_json_value(self, data: scalar, zarr_format: Literal[2, 3]) -> JSON``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This method converts a scalar instance of the data type into a JSON-serialiazable value.
+For some data types like bool and integers this conversion is simple -- just return a JSON boolean
+or number -- but other data types define a JSON serialization for scalars that is a bit more involved.
+And this JSON serialization depends on the Zarr format.
+
+(method) ``from_json_value(self, data: JSON, zarr_format: Literal[2, 3]) -> scalar``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Convert a JSON-serialiazed scalar to a native scalar. This inverts the operation of ``to_json_value``.
+
+
diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst
@@ -8,6 +8,7 @@ User guide
 
     installation
     arrays
+    data_types
     groups
     attributes
     storage

diff --git a/src/zarr/api/asynchronous.py b/src/zarr/api/asynchronous.py
@@ -9,7 +9,13 @@
 import numpy.typing as npt
 from typing_extensions import deprecated
 
-from zarr.core.array import Array, AsyncArray, create_array, get_array_metadata
+from zarr.core.array import (
+    Array,
+    AsyncArray,
+    _get_default_chunk_encoding_v2,
+    create_array,
+    get_array_metadata,
+)
 from zarr.core.array_spec import ArrayConfig, ArrayConfigLike, ArrayConfigParams
 from zarr.core.buffer import NDArrayLike
 from zarr.core.common import (
@@ -21,16 +27,15 @@
     _default_zarr_format,
     _warn_order_kwarg,
     _warn_write_empty_chunks_kwarg,
-    parse_dtype,
 )
+from zarr.core.dtype import get_data_type_from_numpy
 from zarr.core.group import (
     AsyncGroup,
     ConsolidatedMetadata,
     GroupMetadata,
     create_hierarchy,
 )
 from zarr.core.metadata import ArrayMetadataDict, ArrayV2Metadata, ArrayV3Metadata
-from zarr.core.metadata.v2 import _default_compressor, _default_filters
 from zarr.errors import NodeTypeValidationError
 from zarr.storage._common import make_store_path
 
@@ -428,11 +433,12 @@
     shape = arr.shape
     chunks = getattr(arr, "chunks", None)  # for array-likes with chunks attribute
     overwrite = kwargs.pop("overwrite", None) or _infer_overwrite(mode)
+    zarr_dtype = get_data_type_from_numpy(arr.dtype)
     new = await AsyncArray._create(
         store_path,
         zarr_format=zarr_format,
         shape=shape,
-        dtype=arr.dtype,
+        dtype=zarr_dtype,
         chunks=chunks,
         overwrite=overwrite,
         **kwargs,
@@ -978,15 +984,15 @@
         _handle_zarr_version_or_format(zarr_version=zarr_version, zarr_format=zarr_format)
         or _default_zarr_format()
     )
-
+    dtype_wrapped = get_data_type_from_numpy(dtype)
     if zarr_format == 2:
         if chunks is None:
             chunks = shape
-        dtype = parse_dtype(dtype, zarr_format)
-        if not filters:
-            filters = _default_filters(dtype)
-        if not compressor:
-            compressor = _default_compressor(dtype)
+        default_filters, default_compressor = _get_default_chunk_encoding_v2(dtype_wrapped)
+        if filters is None:
+            filters = default_filters
+        if compressor is None:
+            compressor = default_compressor
     elif zarr_format == 3 and chunk_shape is None:  # type: ignore[redundant-expr]
         if chunks is not None:
             chunk_shape = chunks
@@ -1051,7 +1057,7 @@
         store_path,
         shape=shape,
         chunks=chunks,
-        dtype=dtype,
+        dtype=dtype_wrapped,
         compressor=compressor,
         fill_value=fill_value,
         overwrite=overwrite,

diff --git a/src/zarr/codecs/_v2.py b/src/zarr/codecs/_v2.py
@@ -48,15 +48,15 @@
         # segfaults and other bad things happening
         if chunk_spec.dtype != object:
             try:
-                chunk = chunk.view(chunk_spec.dtype)
+                chunk = chunk.view(chunk_spec.dtype.to_dtype())
             except TypeError:
                 # this will happen if the dtype of the chunk
                 # does not match the dtype of the array spec i.g. if
                 # the dtype of the chunk_spec is a string dtype, but the chunk
                 # is an object array. In this case, we need to convert the object
                 # array to the correct dtype.
 
-                chunk = np.array(chunk).astype(chunk_spec.dtype)
+                chunk = np.array(chunk).astype(chunk_spec.dtype.to_dtype())
 
         elif chunk.dtype != object:
             # If we end up here, someone must have hacked around with the filters.
@@ -80,7 +80,7 @@
         chunk = chunk_array.as_ndarray_like()
 
         # ensure contiguous and correct order
-        chunk = chunk.astype(chunk_spec.dtype, order=chunk_spec.order, copy=False)
+        chunk = chunk.astype(chunk_spec.dtype.to_dtype(), order=chunk_spec.order, copy=False)
 
         # apply filters
         if self.filters:

diff --git a/src/zarr/codecs/blosc.py b/src/zarr/codecs/blosc.py
@@ -139,11 +139,15 @@ def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
         dtype = array_spec.dtype
         new_codec = self
         if new_codec.typesize is None:
-            new_codec = replace(new_codec, typesize=dtype.itemsize)
+            new_codec = replace(new_codec, typesize=dtype.to_dtype().itemsize)
         if new_codec.shuffle is None:
             new_codec = replace(
                 new_codec,
-                shuffle=(BloscShuffle.bitshuffle if dtype.itemsize == 1 else BloscShuffle.shuffle),
+                shuffle=(
+                    BloscShuffle.bitshuffle
+                    if dtype.to_dtype().itemsize == 1
+                    else BloscShuffle.shuffle
+                ),
             )
 
         return new_codec

diff --git a/src/zarr/codecs/bytes.py b/src/zarr/codecs/bytes.py
@@ -10,6 +10,7 @@
 from zarr.abc.codec import ArrayBytesCodec
 from zarr.core.buffer import Buffer, NDArrayLike, NDBuffer
 from zarr.core.common import JSON, parse_enum, parse_named_configuration
+from zarr.core.dtype.common import endianness_to_numpy_str
 from zarr.registry import register_codec
 
 if TYPE_CHECKING:
@@ -56,7 +57,7 @@ def to_dict(self) -> dict[str, JSON]:
             return {"name": "bytes", "configuration": {"endian": self.endian.value}}
 
     def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
-        if array_spec.dtype.itemsize == 0:
+        if array_spec.dtype.to_dtype().itemsize == 1:
             if self.endian is not None:
                 return replace(self, endian=None)
         elif self.endian is None:
@@ -71,14 +72,9 @@ async def _decode_single(
         chunk_spec: ArraySpec,
     ) -> NDBuffer:
         assert isinstance(chunk_bytes, Buffer)
-        if chunk_spec.dtype.itemsize > 0:
-            if self.endian == Endian.little:
-                prefix = "<"
-            else:
-                prefix = ">"
-            dtype = np.dtype(f"{prefix}{chunk_spec.dtype.str[1:]}")
-        else:
-            dtype = np.dtype(f"|{chunk_spec.dtype.str[1:]}")
+        # TODO: remove endianness enum in favor of literal union
+        endian_str = self.endian.value if self.endian is not None else None
+        dtype = chunk_spec.dtype.to_dtype().newbyteorder(endianness_to_numpy_str(endian_str))
-        dtype = chunk_spec.dtype.to_dtype().newbyteorder(endianness_to_numpy_str(endian_str))
+        dtype = chunk_spec.dtype.to_dtype()
+        # Set endianess
+        dtype = dtype.newbyteorder(endianness_to_numpy_str(endian_str))
-        dtype = chunk_spec.dtype.to_dtype().newbyteorder(endianness_to_numpy_str(endian_str))
+        dtype = chunk_spec.dtype.to_dtype()
+        # Set endianess
+        dtype = dtype.newbyteorder(endianness_to_numpy_str(endian_str))
 
         as_array_like = chunk_bytes.as_array_like()
         if isinstance(as_array_like, NDArrayLike):
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,6 +8,7 @@ User guide @@
         installation
         arrays
+        data_types
         groups
         attributes
         storage
@@ Expand Down @@