Skip to content
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
164 commits
Select commit Hold shift + click to select a range
f5e3f78
modernize typing
d-v-b Feb 21, 2025
b4e71e2
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Feb 24, 2025
3c50f54
lint
d-v-b Feb 24, 2025
d74e7a4
new dtypes
d-v-b Feb 26, 2025
5000dcb
rename base dtype, change type to kind
d-v-b Feb 26, 2025
9cd5c51
start working on JSON serialization
d-v-b Feb 27, 2025
042fac1
get json de/serialization largely working, and start making tests pass
d-v-b Feb 27, 2025
556e390
tweak json type guards
d-v-b Feb 27, 2025
b588f70
fix dtype sizes, adjust fill value parsing in from_dict, fix tests
d-v-b Feb 27, 2025
4ed41c6
mid-refactor commit
d-v-b Mar 2, 2025
1b2c773
working form for dtype classes
d-v-b Mar 2, 2025
24930b3
remove unused code
d-v-b Mar 2, 2025
703e0e1
use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…
d-v-b Mar 2, 2025
3c232a4
push into v2
d-v-b Mar 3, 2025
b7fe986
remove endianness kwarg to methods, make it an instance variable instead
d-v-b Mar 3, 2025
d9b44b4
make wrapping safe by default
d-v-b Mar 4, 2025
bf24d69
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 4, 2025
c1a8566
dtype-specific tests
d-v-b Mar 4, 2025
2868994
more tests, fix void type default value logic
d-v-b Mar 5, 2025
9ab0b1e
fix dtype mechanics in bytescodec
d-v-b Mar 5, 2025
e9f5e26
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 5, 2025
6df84a9
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Mar 7, 2025
e14279d
remove __post_init__ magic in favor of more explicit declaration
d-v-b Mar 7, 2025
381a264
fix tests
d-v-b Mar 9, 2025
6a7857b
refactor data types
d-v-b Mar 12, 2025
e8fd72c
start design doc
d-v-b Mar 13, 2025
b22f324
more design doc
d-v-b Mar 13, 2025
b7a231e
update docs
d-v-b Mar 13, 2025
7dfcd0f
fix sphinx warnings
d-v-b Mar 13, 2025
706e6b6
tweak docs
d-v-b Mar 13, 2025
8fbf673
info about v3 data types
d-v-b Mar 13, 2025
e9aff64
adjust note
d-v-b Mar 13, 2025
44e78f5
fix: use unparametrized types in direct assignment
d-v-b Mar 13, 2025
60cac04
start fixing config
d-v-b Mar 17, 2025
120df57
Update src/zarr/core/_info.py
d-v-b Mar 17, 2025
0d9922b
add placeholder disclaimer to v3 data types summary
d-v-b Mar 17, 2025
2075952
make example runnable
d-v-b Mar 17, 2025
44369d6
placeholder section for adding a custom dtype
d-v-b Mar 17, 2025
4f3381f
define native data type and native scalar
d-v-b Mar 17, 2025
c8d7680
update data type names
d-v-b Mar 17, 2025
2a7b5a8
fix config test failures
d-v-b Mar 17, 2025
e855e54
call to_dtype once in blosc evolve_from_array_spec
d-v-b Mar 17, 2025
a2da99a
refactor dtypewrapper -> zdtype
d-v-b Mar 19, 2025
5ea3fa4
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 19, 2025
cbb159d
update code examples in docs; remove native endianness
d-v-b Mar 19, 2025
c506d09
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 19, 2025
bb11867
adjust type annotations
d-v-b Mar 20, 2025
7a619e0
fix info tests to use zdtype
d-v-b Mar 20, 2025
ea2d0bf
remove dead code and add code coverage exemption to zarr format checks
d-v-b Mar 20, 2025
042c9e5
fix: add special check for resolving int32 on windows
d-v-b Mar 20, 2025
def5eb2
add dtype entry point test
d-v-b Mar 20, 2025
1b7273b
remove default parameters for parametric dtypes; add mixin classes fo…
d-v-b Mar 21, 2025
60b2e9d
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 21, 2025
83f508c
Update docs/user-guide/data_types.rst
d-v-b Mar 24, 2025
4ceb6ed
refactor: use inheritance to remove boilerplate in dtype definitions
d-v-b Mar 24, 2025
5b9cff0
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
65f0453
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 24, 2025
cb0a7d4
update data types documentation, and expose core/dtype module to autodoc
d-v-b Mar 24, 2025
40f0063
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
9989c64
add failing endianness round-trip test
d-v-b Mar 24, 2025
a276c84
fix endianness
d-v-b Mar 24, 2025
6285739
additional check in test_explicit_endianness
d-v-b Mar 24, 2025
e9241b9
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 24, 2025
2bffe1a
add failing test for round-tripping vlen strings
d-v-b Mar 24, 2025
aa32271
route object dtype arrays to vlen string dtype when numpy > 2
d-v-b Mar 25, 2025
617d3f0
relax endianness mismatch to a warning instead of an error
d-v-b Mar 25, 2025
2b5fd8f
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
1831f20
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
a427a16
silence mypy error about array indexing
d-v-b Mar 25, 2025
41d7e58
add release note
d-v-b Mar 25, 2025
c08ffd9
fix doctests, excluding config tests
d-v-b Mar 25, 2025
778d740
revert addition of linkage between dtype endianness and bytes codec e…
d-v-b Mar 26, 2025
269215e
remove Any types
d-v-b Mar 26, 2025
8af0ce4
add docstring for wrapper module
d-v-b Mar 26, 2025
df60d05
simplify config and docs
d-v-b Mar 26, 2025
7f54bbf
update config test
d-v-b Mar 26, 2025
be83f03
fix S dtype test for v2
d-v-b Mar 26, 2025
3979746
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Apr 28, 2025
a210f9f
fully remove v3jsonencoder
d-v-b Apr 28, 2025
8fbd29a
refactor dtype module structure
d-v-b Apr 29, 2025
afc9872
add timedelta64
d-v-b Apr 29, 2025
e1bf901
refactor time dtypes
d-v-b Apr 30, 2025
45f0c88
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 1, 2025
890077e
widen dtype test strategies
d-v-b May 1, 2025
a3f05f0
modify structured dtype fill value rt to avoid to_dict
d-v-b May 2, 2025
4788f05
wip: begin creating isomorphic test suite for dtypes
d-v-b May 2, 2025
d3f9204
finish common tests
d-v-b May 2, 2025
fdf17e3
wip: test infrastructure for dtypes
d-v-b May 7, 2025
4afa42a
wip: use class-based tests for all dtypes
d-v-b May 7, 2025
4990803
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 7, 2025
1458aad
fill out more tests, and adjust sized dtypes
d-v-b May 8, 2025
9673997
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 8, 2025
aa11df4
wip: json schema test
d-v-b May 12, 2025
f706b46
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 12, 2025
52518c2
add casting tests
d-v-b May 13, 2025
4ab1c58
use relative link for changes
d-v-b May 13, 2025
e4c89f3
typo
d-v-b May 13, 2025
e386c2b
make bytes codec dtype logic a bit more literate
d-v-b May 13, 2025
703192c
increase deadline to 500ms
d-v-b May 13, 2025
0fab5e5
fewer commented sections of problematic lru_store_cache section of th…
d-v-b May 13, 2025
2f945bf
add link to gh issue about lru_cache for sharding codec
d-v-b May 13, 2025
63a6af4
attempt to speed up hypothesis tests by reducing max array size
d-v-b May 13, 2025
56e7c84
clean up docs
d-v-b May 13, 2025
eee0d7b
remove placeholder
d-v-b May 13, 2025
1dc8e72
make final example section doctested and more readable
d-v-b May 13, 2025
13ca230
revert change to auto chunking
d-v-b May 13, 2025
2a42205
revert quotation of literal type
d-v-b May 13, 2025
3f775c8
lint
d-v-b May 13, 2025
5320a77
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 13, 2025
b525b8e
fix broken code block
d-v-b May 13, 2025
ec94878
specialize test to handle stringdtype changes coming in numpy 2.3
d-v-b May 13, 2025
3af98aa
add docstring to _TestZDType class
d-v-b May 13, 2025
6388203
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
6ef7924
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
1329c69
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
d8c3672
type hints
d-v-b May 15, 2025
3f4d87a
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 16, 2025
d8a382a
expand changelog
d-v-b May 16, 2025
9aa751b
tweak docstring
d-v-b May 16, 2025
e4a0372
support v3 nan strings in JSON for float dtypes
d-v-b May 19, 2025
8a976d6
revert removal of metadata chunk grid attribute
d-v-b May 21, 2025
be0d2df
use none to denote default fill value; remove old structured tests; u…
d-v-b May 22, 2025
8c90d2c
add item size abstraction
d-v-b May 22, 2025
0fc653f
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b May 22, 2025
7c58f7a
rename fixed-length string dtypes, and be strict about the numpy obje…
d-v-b May 22, 2025
3a21845
remove vestigial use of to_dtype().itemsize()
d-v-b May 22, 2025
ce0afe3
remove another vestigial use of to_dtype().itemsize()
d-v-b May 22, 2025
e67d4dc
emit warning about unstable dtype when serializing Structured dtype t…
d-v-b May 23, 2025
4e2a157
put string dtypes in the strings module
d-v-b May 24, 2025
a1deda6
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 24, 2025
528a942
make tests isomorphic to source code
d-v-b May 24, 2025
c9c8181
remove old string logic
d-v-b May 25, 2025
1cb7734
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 26, 2025
d80d565
use scale_factor and unit in cast_value for datetime
d-v-b May 26, 2025
7806563
add regression testing against v2.18
d-v-b May 27, 2025
39219fa
truncate U and S scalars in _cast_value_unsafe
d-v-b May 27, 2025
4a7a550
docstrings and simplification for regression tests
d-v-b May 27, 2025
807c585
changes necessary for linting with regression tests
d-v-b May 27, 2025
5150d60
improve method names, refactor type hints with typeddictionaries, fix…
d-v-b May 29, 2025
9ddbe97
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b May 29, 2025
d6535d6
fix storage info discrepancy in docs
d-v-b May 29, 2025
42e14ef
fix docstring that was troubling sphinx
d-v-b May 29, 2025
3991406
wip: add vlen-bytes
d-v-b May 29, 2025
d7da3d9
add vlen-bytes
d-v-b May 29, 2025
c3c3288
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jun 2, 2025
d1feaee
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 5, 2025
3ef138a
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Jun 6, 2025
1f767e4
replace placeholder text with links to a github issue
d-v-b Jun 6, 2025
cf55041
refactor fixed-length bytes dtypes
d-v-b Jun 6, 2025
24b6b35
more v3 unstable dtype warnings, and their exemptions from tests
d-v-b Jun 6, 2025
7f099a2
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 7, 2025
bf7e2c5
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 7, 2025
cbb0b0d
clean up typeddicts
d-v-b Jun 7, 2025
8f3aa68
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 7, 2025
e885869
update docstrings
d-v-b Jun 9, 2025
63de7c4
Update docs/user-guide/data_types.rst
d-v-b Jun 11, 2025
b069d36
refactor wrapper to allow subclasses to freely define their own type …
d-v-b Jun 13, 2025
ae36dbf
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Jun 13, 2025
a1f2c94
Merge branch 'feat/fixed-length-strings' of https://github.com/d-v-b/…
d-v-b Jun 13, 2025
b2e56c8
make method definition order consistent
d-v-b Jun 14, 2025
d26b695
allow structured scalars to be np.void
d-v-b Jun 14, 2025
49f0062
use a common function signature for from_json by packing the object_c…
d-v-b Jun 15, 2025
70da4da
fix dtype doc example
d-v-b Jun 15, 2025
16b4ac6
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 183 additions & 0 deletions docs/user-guide/data_types.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
Data types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a super useful read. I'm wondering what to do with it though. Were you thinking it would go under the Advanced Topics section in the user guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion from me. IMO our docs right now are not the most logically organized, so I anticipate some churn there in any case.

==========

Zarr's data type model
----------------------

Every Zarr array has a "data type", which defines the meaning and physical layout of the
array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and
Zarr-Python supports creating arrays with Numpy data types::

>>> import zarr
>>> import numpy as np
>>> zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
>>> z
<Array memory://126225407345920 shape=(10,) dtype=uint8>

Unlike Numpy arrays, Zarr arrays are designed to be persisted to storage and read by Zarr implementations in different programming languages.
This means Zarr data types must be interpreted correctly when clients read an array. So each Zarr data type defines a procedure for
encoding / decoding that data type to / from Zarr array metadata, and also encoding / decoding **instances** of that data type to / from
array metadata. These serialization procedures depend on the Zarr format.

Data types in Zarr version 2
-----------------------------

Version 2 of the Zarr format defined its data types relative to `Numpy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, and added a few non-Numpy data types as well.
Thus the JSON identifier for a Numpy-compatible data type is just the Numpy ``str`` attribute of that dtype:

>>> import zarr
>>> import numpy as np
>>> import json
>>> np_dtype = np.dtype('int64')
>>> z = zarr.create_array(shape=(1,), dtype=np_dtype, zarr_format=2)
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
>>> assert dtype_meta == np_dtype.str # True
>>> dtype_meta
<i8

.. note::
The ``<`` character in the data type metadata encodes the `endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_, or "byte order", of the data type. Following Numpy's example,
in Zarr version 2 each data type has an endianness where applicable. However, Zarr version 3 data types do not store endianness information.

In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``, Zarr also
defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers,
as are floats, with the caveat that `NaN`, positive infinity, and negative infinity are stored as special strings.

Data types in Zarr version 3
----------------------------

* Data type names are different -- Zarr V2 represented the 16 bit unsigned integer data type as ``>i2``; Zarr V3 represents the same data type as ``int16``.
* No endianness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a document somewhere that explains this decision and/or how endianness should be handled in zarr v3? If so, it should be linked here; if not, perhaps a paragraph or two are warranted in this doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I see it's dealt with below. It would have been good to have something like:

Suggested change
* No endianness
* No endianness; endianness is instead defined as part of the codec pipeline (see below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the v3 section is very placeholder right now. eventually it will get a proper prose form that explains the endianness thing

* A data type can be encoded in metadata as a string or a ``JSON`` object with the structure ``{"name": <string identifier>, "configuration": {...}}``

Data types in Zarr-Python
-------------------------

Zarr-Python supports two different Zarr formats, and those two formats specify data types in rather different ways:
data types in Zarr version 2 are encoded as Numpy-compatible strings, while data types in Zarr version 3 are encoded as either strings or ``JSON`` objects,
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.

We also want Zarr-Python to support data types beyond what's available in Numpy. So it's crucial that we have a
model of array data types that can adapt to the differences between Zarr V2 and V3 and doesn't over-fit to Numpy.

Here are the operations we need to perform on data types in Zarr-Python:

* Round-trip native data types to fields in array metadata documents.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to define "native data types"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a definition in 4f3381f, let me know what you think

For example, the Numpy data type ``np.dtype('>i2')`` should be saved as ``{..., "dtype" : ">i2"}`` in Zarr V2 metadata.

In Zarr V3 metadata, the same Numpy data type would be saved as ``{..., "data_type": "int16", "codecs": [..., {"name": "bytes", "configuration": {"endian": "big"}, ...]}``

* Define a default fill value. This is not mandated by the Zarr specifications, but it's convenient for users
to have a useful default. For numeric types like integers and floats the default can be statically set to 0, but for
parametric data types like fixed-length strings the default can only be generated after the data type has been parametrized at runtime.

* Round-trip scalars to the ``fill_value`` field in Zarr V2 and V3 array metadata documents. The Zarr V2 and V3 specifications
define how scalars of each data type should be stored as JSON in array metadata documents, and in principle each data type
can define this encoding separately.

* Do all of the above for *user-defined data types*. Zarr-Python should support data types added as extensions,so we cannot
hard-code the list of data types. We need to ensure that users can easily (or easily enough) define a python object
that models their custom data type and register this object with Zarr-Python, so that the above operations all succeed for their
custom data type.

To achieve these goals, Zarr Python uses a class called :class:`zarr.core.dtype.DTypeWrapper` to wrap native data types. Each data type
supported by Zarr Python is modeled by a subclass of `DTypeWrapper`, which has the following structure:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If developers are to subclass the DtypeWrapper class, perhaps we drop the Wrapper and just call it a Dtype? Or DtypeABC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other alternative names:

  • ZarrDType
  • CanonicalDType
  • AbstractDType
  • LocalDType
  • UniversalDType
  • HarmonizedDType
  • DTypeSpec
  • CrossLibraryDType

I don't like terms like Wrapper or ABC because they are vague computery terms. It would be good to use a descriptive term (in the vein of the above list) about what the wrapper is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like terms like Wrapper or ABC because they are vague computery terms. It would be good to use a descriptive term (in the vein of the above list) about what the wrapper is doing.

The DTypeWrapper class is wrapping / abstracting over / managing creation of a dtype used by the library responsible for creating the in-memory arrays used by zarr-python for reading and writing data. I don't think any of your suggested names capture this behavior.

Maybe it's better to avoid attempting to convey the behavior of the class. I like ZarrDtype or ZDtype or DTypeABC. And I think we can ask people who choose to dig into our data type API to tolerate some "computery terms" :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of a2da99a I'm going with ZDType, how does that work for yall

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am against unnecessary abbreviations FWIW - ZarrDtype is my favorite


(attribute) ``dtype_cls``
^^^^^^^^^^^^^^^^^^^^^^^^^
The ``dtype_cls`` attribute is a **class variable** that is bound to a class that can produce
an instance of a native data type. For example, on the ``DTypeWrapper`` used to model the boolean
data type, the ``dtype_cls`` attribute is bound to the numpy bool data type class: ``np.dtypes.BoolDType``.
This attribute is used when we need to create an instance of the native data type, for example when
defining a Numpy array that will contain Zarr data.

It might seem odd that ``DTypeWrapper.dtype_cls`` binds to a *class* that produces a native data type instead of an instance of that native data type --
why not have a ``DTypeWrapper.dtype`` attribute that binds to ``np.dtypes.BoolDType()``? The reason why ``DTypeWrapper``
doesn't wrap a concrete data type instance is because data type instances may have endianness information, but Zarr V3
data types do not. To model Zarr V3 data types, we need endianness to be an **instance variable** which is
defined when creating an instance of the ```DTypeWrapper``. Subclasses of ``DTypeWrapper`` that model data types with
byte order semantics thus have ``endianness`` as an instance variable, and this value can be set when creating an instance of the wrapper.


(attribute) ``_zarr_v3_name``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``_zarr_v3_name`` attribute encodes the canonical name for a data type for Zarr V3. For many data types these names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the thought behind making this a private attribute? If it is required to be implemented, should we make it public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason it's private is because there's a public method for getting the name of a dtype instance (get_name), which takes a zarr_format parameter. The _zarr_v3_name is the name of the class, but at least in the case of the wonky r* dtype, the name of the class will never be the name of an actual dtype instance. r* is the name of the class, but r8, r16, etc would be the names of the data type instances. I would love to remove support for the r* dtype, but even if we did, zarr v2 dtypes like U4 would still require us to compute the name based on instance attributes.

are defined in the `Zarr V3 specification <https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types>`_ For nearly all of the
data types defined in Zarr V3, this name can be used to uniquely specify a data type. The one exception is the ``r*`` data type,
which is parametrized by a number of bits, and so may take the form ``r8``, ``r16``, ... etc.

(class method) ``from_dtype(cls, dtype) -> Self``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method defines a procedure for safely converting a native dtype instance into an instance of ``DTypeWrapper``. It should perform
validation of its input to ensure that the native dtype is an instance of the ``dtype_cls`` class attribute, for example. For some
data types, additional checks are needed -- in Numpy "structured" data types and "void" data types use the same class, with different properties.
A ``DTypeWrapper`` that wraps Numpy structured data types must do additional checks to ensure that the input ``dtype`` is actually a structured data type.
If input validation succeeds, this method will call ``_from_dtype_unsafe``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why even have an unsafe version? Can the check ever be expensive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these docs are stale by now, but the idea was that from_dtype does input validation, but _from_dtype_unsafe does not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's an example for int32:

    @classmethod
    def from_dtype(cls: type[Self], dtype: _BaseDType) -> Self:
        # We override the base implementation to address a windows-specific, pre-numpy 2 issue where
        # ``np.dtype('i')`` is an instance of ``np.dtypes.IntDType`` that acts like `int32` instead of ``np.dtype('int32')``
        # In this case, ``type(np.dtype('i')) == np.dtypes.Int32DType``  will evaluate to ``True``,
        # despite the two classes being different. Thus we will create an instance of `cls` with the
        # latter dtype, after pulling in the byte order of the input
        if dtype == np.dtypes.Int32DType():
            return cls._from_dtype_unsafe(np.dtypes.Int32DType().newbyteorder(dtype.byteorder))
        else:
            return super().from_dtype(dtype)

    @classmethod
    def _from_dtype_unsafe(cls, dtype: _BaseDType) -> Self:
        byte_order = cast("EndiannessNumpy", dtype.byteorder)
        return cls(endianness=endianness_from_numpy_str(byte_order))

from_dtype has to do some platform-specific input validation to ensure that the dtype instance is actually correct, and _from_dtype_unsafe just creates an instance of the data type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these docs are stale by now, but the idea was that from_dtype does input validation, but _from_dtype_unsafe does not.

I guess my question was more along the lines of why provide the option of doing this if the check is cheap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the two operations are logically separable. so my default approach is to separate them. this allows us to write subclasses that only override the input validation step without needing to also override the object creation step.


(method) ``to_dtype(self) -> dtype``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method produces a native data type consistent with the properties of the ``DTypeWrapper``. Together
with ``from_dtype``, this method allows round-trip conversion of a native data type in to a wrapper class and then out again.

That is, for some ``DTypeWrapper`` class ``FooWrapper`` that wraps a native data type called ``foo``, ``FooWrapper.from_dtype(instance_of_foo).to_dtype() == instance_of_foo`` should be true.

(method) ``to_dict(self) -> dict``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method generates a JSON-serialiazable representation of the wrapped data type which can be stored in
Zarr metadata.

(method) ``cast_value(self, value: object) -> scalar``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method converts a python object to an instance of the wrapped data type. It is used for generating the default
value associated with this data type.


(method) ``default_value(self) -> scalar``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method returns the default value for the wrapped data type. Zarr-Python uses this method to generate a default fill value
for an array when a user has not requested one.

Why is this a method and not a static attribute? Although some data types
can have a static default value, parametrized data types like fixed-length strings or structured data types cannot. For these data types,
a default value must be calculated based on the attributes of the wrapped data type.

(class method) ``check_dtype(cls, dtype) -> bool``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This class method checks if a native dtype is compatible with the ``DTypeWrapper`` class. It returns ``True``
if ``dtype`` is compatible with the wrapper class, and ``False`` otherwise. For many data types, this check is as simple
as checking that ``cls.dtype_cls`` matches ``type(dtype)``, i.e. checking that the data type class wrapped
by the ``DTypeWrapper`` is the same as the class of ``dtype``. But there are some data types where this check alone is not sufficient,
in which case this method is overridden so that additional properties of ``dtype`` can be inspected and compared with
the expectations of ``cls``.

(class method) ``from_dict(cls, dtype) -> Self``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This class method creates a ``DTypeWrapper`` from an appropriately structured dictionary. The default
implementation first checks that the dictionary has the correct structure, and then uses its data
to instantiate the ``DTypeWrapper`` instance.

(method) ``to_dict(self) -> dict[str, JSON]``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Returns a dictionary form of the wrapped data type. This is used prior to writing array metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listed twice


(class method) ``get_name(self, zarr_format: Literal[2, 3]) -> str``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method generates a name for the wrapped data type, depending on the Zarr format. If ``zarr_format`` is
2 and the wrapped data type is a Numpy data type, then the Numpy string representation of that data type is returned.
If ``zarr_format`` is 3, then the Zarr V3 name for the wrapped data type is returned. For most data types
the Zarr V3 name will be stored as the ``_zarr_v3_name`` class attribute, but for parametric data types the
name must be computed at runtime based on the parameters of the data type.


(method) ``to_json_value(self, data: scalar, zarr_format: Literal[2, 3]) -> JSON``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This method converts a scalar instance of the data type into a JSON-serialiazable value.
For some data types like bool and integers this conversion is simple -- just return a JSON boolean
or number -- but other data types define a JSON serialization for scalars that is a bit more involved.
And this JSON serialization depends on the Zarr format.

(method) ``from_json_value(self, data: JSON, zarr_format: Literal[2, 3]) -> scalar``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Convert a JSON-serialiazed scalar to a native scalar. This inverts the operation of ``to_json_value``.


1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ User guide

installation
arrays
data_types
groups
attributes
storage
Expand Down
28 changes: 17 additions & 11 deletions src/zarr/api/asynchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,13 @@
import numpy.typing as npt
from typing_extensions import deprecated

from zarr.core.array import Array, AsyncArray, create_array, get_array_metadata
from zarr.core.array import (
Array,
AsyncArray,
_get_default_chunk_encoding_v2,
create_array,
get_array_metadata,
)
from zarr.core.array_spec import ArrayConfig, ArrayConfigLike, ArrayConfigParams
from zarr.core.buffer import NDArrayLike
from zarr.core.common import (
Expand All @@ -21,16 +27,15 @@
_default_zarr_format,
_warn_order_kwarg,
_warn_write_empty_chunks_kwarg,
parse_dtype,
)
from zarr.core.dtype import get_data_type_from_numpy
from zarr.core.group import (
AsyncGroup,
ConsolidatedMetadata,
GroupMetadata,
create_hierarchy,
)
from zarr.core.metadata import ArrayMetadataDict, ArrayV2Metadata, ArrayV3Metadata
from zarr.core.metadata.v2 import _default_compressor, _default_filters
from zarr.errors import NodeTypeValidationError
from zarr.storage._common import make_store_path

Expand Down Expand Up @@ -428,11 +433,12 @@
shape = arr.shape
chunks = getattr(arr, "chunks", None) # for array-likes with chunks attribute
overwrite = kwargs.pop("overwrite", None) or _infer_overwrite(mode)
zarr_dtype = get_data_type_from_numpy(arr.dtype)

Check warning on line 436 in src/zarr/api/asynchronous.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/api/asynchronous.py#L436

Added line #L436 was not covered by tests
new = await AsyncArray._create(
store_path,
zarr_format=zarr_format,
shape=shape,
dtype=arr.dtype,
dtype=zarr_dtype,
chunks=chunks,
overwrite=overwrite,
**kwargs,
Expand Down Expand Up @@ -978,15 +984,15 @@
_handle_zarr_version_or_format(zarr_version=zarr_version, zarr_format=zarr_format)
or _default_zarr_format()
)

dtype_wrapped = get_data_type_from_numpy(dtype)

Check warning on line 987 in src/zarr/api/asynchronous.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/api/asynchronous.py#L987

Added line #L987 was not covered by tests
if zarr_format == 2:
if chunks is None:
chunks = shape
dtype = parse_dtype(dtype, zarr_format)
if not filters:
filters = _default_filters(dtype)
if not compressor:
compressor = _default_compressor(dtype)
default_filters, default_compressor = _get_default_chunk_encoding_v2(dtype_wrapped)
if filters is None:
filters = default_filters
if compressor is None:
compressor = default_compressor

Check warning on line 995 in src/zarr/api/asynchronous.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/api/asynchronous.py#L991-L995

Added lines #L991 - L995 were not covered by tests
elif zarr_format == 3 and chunk_shape is None: # type: ignore[redundant-expr]
if chunks is not None:
chunk_shape = chunks
Expand Down Expand Up @@ -1051,7 +1057,7 @@
store_path,
shape=shape,
chunks=chunks,
dtype=dtype,
dtype=dtype_wrapped,
compressor=compressor,
fill_value=fill_value,
overwrite=overwrite,
Expand Down
6 changes: 3 additions & 3 deletions src/zarr/codecs/_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,15 @@
# segfaults and other bad things happening
if chunk_spec.dtype != object:
try:
chunk = chunk.view(chunk_spec.dtype)
chunk = chunk.view(chunk_spec.dtype.to_dtype())
except TypeError:
# this will happen if the dtype of the chunk
# does not match the dtype of the array spec i.g. if
# the dtype of the chunk_spec is a string dtype, but the chunk
# is an object array. In this case, we need to convert the object
# array to the correct dtype.

chunk = np.array(chunk).astype(chunk_spec.dtype)
chunk = np.array(chunk).astype(chunk_spec.dtype.to_dtype())

Check warning on line 59 in src/zarr/codecs/_v2.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/codecs/_v2.py#L59

Added line #L59 was not covered by tests

elif chunk.dtype != object:
# If we end up here, someone must have hacked around with the filters.
Expand All @@ -80,7 +80,7 @@
chunk = chunk_array.as_ndarray_like()

# ensure contiguous and correct order
chunk = chunk.astype(chunk_spec.dtype, order=chunk_spec.order, copy=False)
chunk = chunk.astype(chunk_spec.dtype.to_dtype(), order=chunk_spec.order, copy=False)

# apply filters
if self.filters:
Expand Down
8 changes: 6 additions & 2 deletions src/zarr/codecs/blosc.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,11 +139,15 @@ def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
dtype = array_spec.dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling to_dtype here would help avoid having to call it twice below.

new_codec = self
if new_codec.typesize is None:
new_codec = replace(new_codec, typesize=dtype.itemsize)
new_codec = replace(new_codec, typesize=dtype.to_dtype().itemsize)
if new_codec.shuffle is None:
new_codec = replace(
new_codec,
shuffle=(BloscShuffle.bitshuffle if dtype.itemsize == 1 else BloscShuffle.shuffle),
shuffle=(
BloscShuffle.bitshuffle
if dtype.to_dtype().itemsize == 1
else BloscShuffle.shuffle
),
)

return new_codec
Expand Down
14 changes: 5 additions & 9 deletions src/zarr/codecs/bytes.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from zarr.abc.codec import ArrayBytesCodec
from zarr.core.buffer import Buffer, NDArrayLike, NDBuffer
from zarr.core.common import JSON, parse_enum, parse_named_configuration
from zarr.core.dtype.common import endianness_to_numpy_str
from zarr.registry import register_codec

if TYPE_CHECKING:
Expand Down Expand Up @@ -56,7 +57,7 @@ def to_dict(self) -> dict[str, JSON]:
return {"name": "bytes", "configuration": {"endian": self.endian.value}}

def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
if array_spec.dtype.itemsize == 0:
if array_spec.dtype.to_dtype().itemsize == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array_spec.dtype is a ZDType which does not have an itemsize attribute. We have to create the wrapped numpy dtype to get its itemsize.

if self.endian is not None:
return replace(self, endian=None)
elif self.endian is None:
Expand All @@ -71,14 +72,9 @@ async def _decode_single(
chunk_spec: ArraySpec,
) -> NDBuffer:
assert isinstance(chunk_bytes, Buffer)
if chunk_spec.dtype.itemsize > 0:
if self.endian == Endian.little:
prefix = "<"
else:
prefix = ">"
dtype = np.dtype(f"{prefix}{chunk_spec.dtype.str[1:]}")
else:
dtype = np.dtype(f"|{chunk_spec.dtype.str[1:]}")
# TODO: remove endianness enum in favor of literal union
endian_str = self.endian.value if self.endian is not None else None
dtype = chunk_spec.dtype.to_dtype().newbyteorder(endianness_to_numpy_str(endian_str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity and to reduce the amount of chaining I would do this in two steps

Suggested change
dtype = chunk_spec.dtype.to_dtype().newbyteorder(endianness_to_numpy_str(endian_str))
dtype = chunk_spec.dtype.to_dtype()
# Set endianess
dtype = dtype.newbyteorder(endianness_to_numpy_str(endian_str))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line (getting the dtype) could even move up above the endian_str line too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned this up a bit in e386c2b, let me know if this is better?


as_array_like = chunk_bytes.as_array_like()
if isinstance(as_array_like, NDArrayLike):
Expand Down
Loading
Loading