-
-
Notifications
You must be signed in to change notification settings - Fork 368
refactor v3 data types #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor v3 data types #2874
Changes from 33 commits
f5e3f78
b4e71e2
3c50f54
d74e7a4
5000dcb
9cd5c51
042fac1
556e390
b588f70
4ed41c6
1b2c773
24930b3
703e0e1
3c232a4
b7fe986
d9b44b4
bf24d69
c1a8566
2868994
9ab0b1e
e9f5e26
6df84a9
e14279d
381a264
6a7857b
e8fd72c
b22f324
b7a231e
7dfcd0f
706e6b6
8fbf673
e9aff64
44e78f5
60cac04
120df57
0d9922b
2075952
44369d6
4f3381f
c8d7680
2a7b5a8
e855e54
a2da99a
5ea3fa4
cbb159d
c506d09
bb11867
7a619e0
ea2d0bf
042c9e5
def5eb2
1b7273b
60b2e9d
83f508c
4ceb6ed
5b9cff0
65f0453
cb0a7d4
40f0063
9989c64
a276c84
6285739
e9241b9
2bffe1a
aa32271
617d3f0
2b5fd8f
1831f20
a427a16
41d7e58
c08ffd9
778d740
269215e
8af0ce4
df60d05
7f54bbf
be83f03
3979746
a210f9f
8fbd29a
afc9872
e1bf901
45f0c88
890077e
a3f05f0
4788f05
d3f9204
fdf17e3
4afa42a
4990803
1458aad
9673997
aa11df4
f706b46
52518c2
4ab1c58
e4c89f3
e386c2b
703192c
0fab5e5
2f945bf
63a6af4
56e7c84
eee0d7b
1dc8e72
13ca230
2a42205
3f775c8
5320a77
b525b8e
ec94878
3af98aa
6388203
6ef7924
1329c69
d8c3672
3f4d87a
d8a382a
9aa751b
e4a0372
8a976d6
be0d2df
8c90d2c
0fc653f
7c58f7a
3a21845
ce0afe3
e67d4dc
4e2a157
a1deda6
528a942
c9c8181
1cb7734
d80d565
7806563
39219fa
4a7a550
807c585
5150d60
9ddbe97
d6535d6
42e14ef
3991406
d7da3d9
c3c3288
d1feaee
3ef138a
1f767e4
cf55041
24b6b35
7f099a2
bf7e2c5
cbb0b0d
8f3aa68
e885869
63de7c4
b069d36
ae36dbf
a1f2c94
b2e56c8
d26b695
49f0062
70da4da
16b4ac6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,183 @@ | ||||||
| Data types | ||||||
| ========== | ||||||
|
|
||||||
| Zarr's data type model | ||||||
| ---------------------- | ||||||
|
|
||||||
| Every Zarr array has a "data type", which defines the meaning and physical layout of the | ||||||
| array's elements. Zarr is heavily influenced by `NumPy <https://numpy.org/doc/stable/>`_, and | ||||||
d-v-b marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| Zarr-Python supports creating arrays with Numpy data types:: | ||||||
|
|
||||||
| >>> import zarr | ||||||
| >>> import numpy as np | ||||||
| >>> zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8')) | ||||||
| >>> z | ||||||
| <Array memory://126225407345920 shape=(10,) dtype=uint8> | ||||||
|
|
||||||
| Unlike Numpy arrays, Zarr arrays are designed to be persisted to storage and read by Zarr implementations in different programming languages. | ||||||
| This means Zarr data types must be interpreted correctly when clients read an array. So each Zarr data type defines a procedure for | ||||||
| encoding / decoding that data type to / from Zarr array metadata, and also encoding / decoding **instances** of that data type to / from | ||||||
| array metadata. These serialization procedures depend on the Zarr format. | ||||||
d-v-b marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Data types in Zarr version 2 | ||||||
| ----------------------------- | ||||||
|
|
||||||
| Version 2 of the Zarr format defined its data types relative to `Numpy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, and added a few non-Numpy data types as well. | ||||||
| Thus the JSON identifier for a Numpy-compatible data type is just the Numpy ``str`` attribute of that dtype: | ||||||
|
|
||||||
| >>> import zarr | ||||||
| >>> import numpy as np | ||||||
| >>> import json | ||||||
| >>> np_dtype = np.dtype('int64') | ||||||
| >>> z = zarr.create_array(shape=(1,), dtype=np_dtype, zarr_format=2) | ||||||
| >>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"] | ||||||
d-v-b marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| >>> assert dtype_meta == np_dtype.str # True | ||||||
| >>> dtype_meta | ||||||
| <i8 | ||||||
|
|
||||||
| .. note:: | ||||||
| The ``<`` character in the data type metadata encodes the `endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_, or "byte order", of the data type. Following Numpy's example, | ||||||
| in Zarr version 2 each data type has an endianness where applicable. However, Zarr version 3 data types do not store endianness information. | ||||||
|
|
||||||
| In addition to defining a representation of the data type itself (which in the example above was just a simple string ``"<i8"``, Zarr also | ||||||
d-v-b marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| defines a metadata representation of scalars associated with that data type. Integers are stored as ``JSON`` numbers, | ||||||
d-v-b marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| as are floats, with the caveat that `NaN`, positive infinity, and negative infinity are stored as special strings. | ||||||
|
|
||||||
| Data types in Zarr version 3 | ||||||
| ---------------------------- | ||||||
|
|
||||||
| * Data type names are different -- Zarr V2 represented the 16 bit unsigned integer data type as ``>i2``; Zarr V3 represents the same data type as ``int16``. | ||||||
| * No endianness | ||||||
|
||||||
| * No endianness | |
| * No endianness; endianness is instead defined as part of the codec pipeline (see below). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the v3 section is very placeholder right now. eventually it will get a proper prose form that explains the endianness thing
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to define "native data types"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a definition in 4f3381f, let me know what you think
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If developers are to subclass the DtypeWrapper class, perhaps we drop the Wrapper and just call it a Dtype? Or DtypeABC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other alternative names:
- ZarrDType
- CanonicalDType
- AbstractDType
- LocalDType
- UniversalDType
- HarmonizedDType
- DTypeSpec
- CrossLibraryDType
I don't like terms like Wrapper or ABC because they are vague computery terms. It would be good to use a descriptive term (in the vein of the above list) about what the wrapper is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like terms like Wrapper or ABC because they are vague computery terms. It would be good to use a descriptive term (in the vein of the above list) about what the wrapper is doing.
The DTypeWrapper class is wrapping / abstracting over / managing creation of a dtype used by the library responsible for creating the in-memory arrays used by zarr-python for reading and writing data. I don't think any of your suggested names capture this behavior.
Maybe it's better to avoid attempting to convey the behavior of the class. I like ZarrDtype or ZDtype or DTypeABC. And I think we can ask people who choose to dig into our data type API to tolerate some "computery terms" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as of a2da99a I'm going with ZDType, how does that work for yall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am against unnecessary abbreviations FWIW - ZarrDtype is my favorite
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the thought behind making this a private attribute? If it is required to be implemented, should we make it public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason it's private is because there's a public method for getting the name of a dtype instance (get_name), which takes a zarr_format parameter. The _zarr_v3_name is the name of the class, but at least in the case of the wonky r* dtype, the name of the class will never be the name of an actual dtype instance. r* is the name of the class, but r8, r16, etc would be the names of the data type instances. I would love to remove support for the r* dtype, but even if we did, zarr v2 dtypes like U4 would still require us to compute the name based on instance attributes.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why even have an unsafe version? Can the check ever be expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these docs are stale by now, but the idea was that from_dtype does input validation, but _from_dtype_unsafe does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's an example for int32:
@classmethod
def from_dtype(cls: type[Self], dtype: _BaseDType) -> Self:
# We override the base implementation to address a windows-specific, pre-numpy 2 issue where
# ``np.dtype('i')`` is an instance of ``np.dtypes.IntDType`` that acts like `int32` instead of ``np.dtype('int32')``
# In this case, ``type(np.dtype('i')) == np.dtypes.Int32DType`` will evaluate to ``True``,
# despite the two classes being different. Thus we will create an instance of `cls` with the
# latter dtype, after pulling in the byte order of the input
if dtype == np.dtypes.Int32DType():
return cls._from_dtype_unsafe(np.dtypes.Int32DType().newbyteorder(dtype.byteorder))
else:
return super().from_dtype(dtype)
@classmethod
def _from_dtype_unsafe(cls, dtype: _BaseDType) -> Self:
byte_order = cast("EndiannessNumpy", dtype.byteorder)
return cls(endianness=endianness_from_numpy_str(byte_order))
from_dtype has to do some platform-specific input validation to ensure that the dtype instance is actually correct, and _from_dtype_unsafe just creates an instance of the data type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these docs are stale by now, but the idea was that from_dtype does input validation, but _from_dtype_unsafe does not.
I guess my question was more along the lines of why provide the option of doing this if the check is cheap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the two operations are logically separable. so my default approach is to separate them. this allows us to write subclasses that only override the input validation step without needing to also override the object creation step.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listed twice
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,6 +8,7 @@ User guide | |
|
|
||
| installation | ||
| arrays | ||
| data_types | ||
| groups | ||
| attributes | ||
| storage | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -139,11 +139,15 @@ def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self: | |
| dtype = array_spec.dtype | ||
|
||
| new_codec = self | ||
| if new_codec.typesize is None: | ||
| new_codec = replace(new_codec, typesize=dtype.itemsize) | ||
| new_codec = replace(new_codec, typesize=dtype.to_dtype().itemsize) | ||
| if new_codec.shuffle is None: | ||
| new_codec = replace( | ||
| new_codec, | ||
| shuffle=(BloscShuffle.bitshuffle if dtype.itemsize == 1 else BloscShuffle.shuffle), | ||
| shuffle=( | ||
| BloscShuffle.bitshuffle | ||
| if dtype.to_dtype().itemsize == 1 | ||
| else BloscShuffle.shuffle | ||
| ), | ||
| ) | ||
|
|
||
| return new_codec | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -10,6 +10,7 @@ | |||||||||
| from zarr.abc.codec import ArrayBytesCodec | ||||||||||
| from zarr.core.buffer import Buffer, NDArrayLike, NDBuffer | ||||||||||
| from zarr.core.common import JSON, parse_enum, parse_named_configuration | ||||||||||
| from zarr.core.dtype.common import endianness_to_numpy_str | ||||||||||
| from zarr.registry import register_codec | ||||||||||
|
|
||||||||||
| if TYPE_CHECKING: | ||||||||||
|
|
@@ -56,7 +57,7 @@ def to_dict(self) -> dict[str, JSON]: | |||||||||
| return {"name": "bytes", "configuration": {"endian": self.endian.value}} | ||||||||||
|
|
||||||||||
| def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self: | ||||||||||
| if array_spec.dtype.itemsize == 0: | ||||||||||
| if array_spec.dtype.to_dtype().itemsize == 1: | ||||||||||
|
||||||||||
| if self.endian is not None: | ||||||||||
| return replace(self, endian=None) | ||||||||||
| elif self.endian is None: | ||||||||||
|
|
@@ -71,14 +72,9 @@ async def _decode_single( | |||||||||
| chunk_spec: ArraySpec, | ||||||||||
| ) -> NDBuffer: | ||||||||||
| assert isinstance(chunk_bytes, Buffer) | ||||||||||
| if chunk_spec.dtype.itemsize > 0: | ||||||||||
| if self.endian == Endian.little: | ||||||||||
| prefix = "<" | ||||||||||
| else: | ||||||||||
| prefix = ">" | ||||||||||
| dtype = np.dtype(f"{prefix}{chunk_spec.dtype.str[1:]}") | ||||||||||
| else: | ||||||||||
| dtype = np.dtype(f"|{chunk_spec.dtype.str[1:]}") | ||||||||||
| # TODO: remove endianness enum in favor of literal union | ||||||||||
| endian_str = self.endian.value if self.endian is not None else None | ||||||||||
| dtype = chunk_spec.dtype.to_dtype().newbyteorder(endianness_to_numpy_str(endian_str)) | ||||||||||
|
||||||||||
| dtype = chunk_spec.dtype.to_dtype().newbyteorder(endianness_to_numpy_str(endian_str)) | |
| dtype = chunk_spec.dtype.to_dtype() | |
| # Set endianess | |
| dtype = dtype.newbyteorder(endianness_to_numpy_str(endian_str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first line (getting the dtype) could even move up above the endian_str line too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned this up a bit in e386c2b, let me know if this is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is a super useful read. I'm wondering what to do with it though. Were you thinking it would go under the
Advanced Topicssection in the user guide?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion from me. IMO our docs right now are not the most logically organized, so I anticipate some churn there in any case.