Skip to content

Commit e889419

Browse files
authored
Merge pull request #135 from grlee77/more-dtype-extensions
Add minimal drafts of a few more dtype extensions
2 parents b38f4a2 + b73be77 commit e889419

File tree

4 files changed

+371
-0
lines changed

4 files changed

+371
-0
lines changed

docs/protocol/extensions.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ Under construction.
1111
extensions/filters/v1.0
1212
extensions/complex-dtypes/v1.0
1313
extensions/datetime-dtypes/v1.0
14+
extensions/object-dtypes/v1.0
15+
extensions/string-dtypes/v1.0
16+
extensions/struct-dtypes/v1.0
1417

1518

1619
A number of other features might be included in the core protocol v3, but are currently considered as extensions.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
===================================
2+
String data types (version 1.0)
3+
===================================
4+
-----------------------------
5+
Editor's draft 2 March 2022
6+
-----------------------------
7+
8+
Specification URI:
9+
http://purl.org/zarr/spec/protocol/extensions/object-dtypes/1.0
10+
Issue tracking:
11+
`GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/object-dtypes-v1.0>`_
12+
Suggest an edit for this spec:
13+
`GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/core-protocol-v3.0-dev/docs/protocol/extension/object-dtypes/v1.0.rst>`_
14+
15+
Copyright 2022 `Zarr core development
16+
team <https://github.com/orgs/zarr-developers/teams/core-devs>`_ (@@TODO
17+
list institutions?). This work is licensed under a `Creative Commons
18+
Attribution 3.0 Unported
19+
License <https://creativecommons.org/licenses/by/3.0/>`_.
20+
21+
----
22+
23+
24+
Abstract
25+
========
26+
27+
This specification is a Zarr protocol extension defining a data type where each
28+
element is an arbitrary Python object.
29+
30+
31+
Status of this document
32+
=======================
33+
34+
This document is a **Work in Progress**. It may be updated, replaced
35+
or obsoleted by other documents at any time. It is inappapropriate to
36+
cite this document as other than work in progress.
37+
38+
Comments, questions or contributions to this document are very
39+
welcome. Comments and questions should be raised via `GitHub issues
40+
<https://github.com/zarr-developers/zarr-specs/labels/object-dtypes-v1.0>`_. When
41+
raising an issue, please add the label "object-dtypes-v1.0".
42+
43+
This document was produced by the `Zarr core development team
44+
<https://github.com/orgs/zarr-developers/teams/core-devs>`_.
45+
46+
47+
Document conventions
48+
====================
49+
50+
Conformance requirements are expressed with a combination of
51+
descriptive assertions and [RFC2119]_ terminology. The key words
52+
"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
53+
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative
54+
parts of this document are to be interpreted as described in
55+
[RFC2119]_. However, for readability, these words do not appear in all
56+
uppercase letters in this specification.
57+
58+
All of the text of this specification is normative except sections
59+
explicitly marked as non-normative, examples, and notes. Examples in
60+
this specification are introduced with the words "for example".
61+
62+
63+
Object data types
64+
=================
65+
NumPy's object arrays are arrays where each element is an arbitrary Python
66+
object. The array elements correspond to the
67+
`numpy.object_ <https://numpy.org/doc/1.22/reference/arrays.scalars.html#numpy.object_>`
68+
type which has character code `'O'`. A common concrete use case for this type
69+
is to have an array where each element is another array (and each array can
70+
have a different length). Another use case is to store an array of variable
71+
length strings. It is important to note that such an array actually just stores the references to the Python objects and not the objects themselves. Accessing
72+
an element of the array returns the Python object it refers to.
73+
74+
Data Types added by this extension
75+
==================================
76+
77+
.. list-table:: Data types
78+
:header-rows: 1
79+
80+
* - Identifier
81+
- Numerical type
82+
- Size (no. bytes)
83+
- Byte order
84+
* - ``O`` (uppercase letter o)
85+
- 8 (TODO: I assume this is actually a hardware-dependent memory address size?)
86+
- address of a Python object
87+
- None
88+
89+
90+
References
91+
==========
92+
93+
.. [NumPy] NumPy Data type objects. NumPy version 1.22.0
94+
documentation. URL:
95+
https://numpy.org/doc/1.22/reference/arrays.dtypes.html
96+
97+
.. [H5Py variable length strings] Variable length strings
98+
documentation. URL:
99+
https://docs.h5py.org/en/stable/special.html#variable-length-strings
100+
101+
Change log
102+
==========
103+
104+
@@TODO
105+
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
===================================
2+
String data types (version 1.0)
3+
===================================
4+
-----------------------------
5+
Editor's draft 2 March 2022
6+
-----------------------------
7+
8+
Specification URI:
9+
http://purl.org/zarr/spec/protocol/extensions/string-dtypes/1.0
10+
Issue tracking:
11+
`GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/string-dtypes-v1.0>`_
12+
Suggest an edit for this spec:
13+
`GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/core-protocol-v3.0-dev/docs/protocol/extension/string-dtypes/v1.0.rst>`_
14+
15+
Copyright 2022 `Zarr core development
16+
team <https://github.com/orgs/zarr-developers/teams/core-devs>`_ (@@TODO
17+
list institutions?). This work is licensed under a `Creative Commons
18+
Attribution 3.0 Unported
19+
License <https://creativecommons.org/licenses/by/3.0/>`_.
20+
21+
----
22+
23+
24+
Abstract
25+
========
26+
27+
This specification is a Zarr protocol extension defining data types
28+
for strings. It is an early draft and currently just describes existing support
29+
for NumPy string types that have already worked with zarr-python, but are not
30+
part of the core v3 spec.
31+
32+
33+
Status of this document
34+
=======================
35+
36+
This document is a **Work in Progress**. It may be updated, replaced
37+
or obsoleted by other documents at any time. It is inappapropriate to
38+
cite this document as other than work in progress.
39+
40+
Comments, questions or contributions to this document are very
41+
welcome. Comments and questions should be raised via `GitHub issues
42+
<https://github.com/zarr-developers/zarr-specs/labels/string-dtypes-v1.0>`_. When
43+
raising an issue, please add the label "string-dtypes-v1.0".
44+
45+
This document was produced by the `Zarr core development team
46+
<https://github.com/orgs/zarr-developers/teams/core-devs>`_.
47+
48+
49+
Document conventions
50+
====================
51+
52+
Conformance requirements are expressed with a combination of
53+
descriptive assertions and [RFC2119]_ terminology. The key words
54+
"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
55+
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative
56+
parts of this document are to be interpreted as described in
57+
[RFC2119]_. However, for readability, these words do not appear in all
58+
uppercase letters in this specification.
59+
60+
All of the text of this specification is normative except sections
61+
explicitly marked as non-normative, examples, and notes. Examples in
62+
this specification are introduced with the words "for example".
63+
64+
65+
Extension data types
66+
====================
67+
68+
Two extension data types are defined to represent zero-terminated bytestrings as
69+
well as fixed-length 32-bit unicode arrays.
70+
71+
Fixed length byte strings (zero-terminated)
72+
-------------------------------------------
73+
74+
These are fixed width strings corresponding to NumPy dtypes with `kind` 'S'.
75+
For backward compatibility with Python 2's ``str`` these are zero-terminated
76+
bytes and correspond to
77+
`numpy.bytes_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.bytes_>`
78+
(or its alias
79+
`numpy.string_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.bytes_>`.)
80+
81+
For example ``a = np.array(["a", "bcd", "efgh"], dtype="S4")`` creates an array where ``a.dtype.kind`` is 'S' and ``a.data.tobytes()`` is ``b'a\x00\x00\x00bcd\x00efgh'``. Note that any elements of length less than 4 characters were padded with zeros so that each array element uses 4 bytes (as
82+
indicated by ``dtype="S4"``).
83+
84+
85+
Fixed width unicode strings
86+
---------------------------
87+
88+
These are fixed width strings corresponding to NumPy dtypes with `kind` 'U'.
89+
These are zero-terminated bytes and correspond to
90+
`numpy.str_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.str_>`
91+
(or its alias
92+
`numpy.unicode_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.unicode_>`.)
93+
94+
For example ``a = np.array(["a", "bcd", "efgh"], dtype="U4")`` creates an array where ``a.dtype.kind`` is 'U' and each element is a sequence of 4 characters where each character occupies 4 bytes (UTF-32).
95+
96+
97+
Data Types added by this extension
98+
==================================
99+
100+
.. list-table:: Data types
101+
:header-rows: 1
102+
103+
* - Identifier
104+
- Numerical type
105+
- Size (no. bytes)
106+
- Byte order
107+
* - ``Sn``
108+
- ``n`` character fixed-width byte string
109+
- n
110+
- None
111+
* - ``<Un``
112+
- ``n`` character fixed-width unicode string (UTF-32)
113+
- 4n
114+
- little-endian
115+
* - ``>Un``
116+
- ``n`` character fixed-width unicode string (UTF-32)
117+
- 4n
118+
- big-endian
119+
120+
121+
References
122+
==========
123+
124+
.. [UTF-32] UTF-32 on Wikipedia.
125+
documentation. URL:
126+
https://en.wikipedia.org/wiki/UTF-32
127+
128+
.. [NumPy] NumPy Data type objects. NumPy version 1.22.0
129+
documentation. URL:
130+
https://numpy.org/doc/1.22/reference/arrays.dtypes.html
131+
132+
133+
Change log
134+
==========
135+
136+
@@TODO
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
===================================
2+
String data types (version 1.0)
3+
===================================
4+
-----------------------------
5+
Editor's draft 2 March 2022
6+
-----------------------------
7+
8+
Specification URI:
9+
http://purl.org/zarr/spec/protocol/extensions/struct-dtypes/1.0
10+
Issue tracking:
11+
`GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/struct-dtypes-v1.0>`_
12+
Suggest an edit for this spec:
13+
`GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/core-protocol-v3.0-dev/docs/protocol/extension/struct-dtypes/v1.0.rst>`_
14+
15+
Copyright 2022 `Zarr core development
16+
team <https://github.com/orgs/zarr-developers/teams/core-devs>`_ (@@TODO
17+
list institutions?). This work is licensed under a `Creative Commons
18+
Attribution 3.0 Unported
19+
License <https://creativecommons.org/licenses/by/3.0/>`_.
20+
21+
----
22+
23+
24+
Abstract
25+
========
26+
27+
This specification is a Zarr protocol extension defining data types
28+
for structured arrays. It is an early draft and currently just describes existing support for NumPy-style `structured arrays`_ that already have support in
29+
zarr-python, but are not part of the core Zarr v3 spec.
30+
31+
32+
Status of this document
33+
=======================
34+
35+
This document is a **Work in Progress**. It may be updated, replaced
36+
or obsoleted by other documents at any time. It is inappapropriate to
37+
cite this document as other than work in progress.
38+
39+
Comments, questions or contributions to this document are very
40+
welcome. Comments and questions should be raised via `GitHub issues
41+
<https://github.com/zarr-developers/zarr-specs/labels/struct-dtypes-v1.0>`_. When
42+
raising an issue, please add the label "struct-dtypes-v1.0".
43+
44+
This document was produced by the `Zarr core development team
45+
<https://github.com/orgs/zarr-developers/teams/core-devs>`_.
46+
47+
48+
Document conventions
49+
====================
50+
51+
Conformance requirements are expressed with a combination of
52+
descriptive assertions and [RFC2119]_ terminology. The key words
53+
"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
54+
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative
55+
parts of this document are to be interpreted as described in
56+
[RFC2119]_. However, for readability, these words do not appear in all
57+
uppercase letters in this specification.
58+
59+
All of the text of this specification is normative except sections
60+
explicitly marked as non-normative, examples, and notes. Examples in
61+
this specification are introduced with the words "for example".
62+
63+
64+
Extension data types
65+
====================
66+
67+
NumPy allows representation of `Structured Arrays`_ where each element of the
68+
array is actually some combination of fields, each of which may have its own
69+
unique data type. NumPy's Record Arrays (``numpy.recarray``) also use this data type. The actual data is stored as an opaque seqeuence of bytes
70+
(i.e. a structure) as represented by (``numpy.void``) and thus the string
71+
representation of this dtype in NumPy is ``'|Vn'`` where ``n`` is some integer
72+
number of bytes. In order to be able to properly interpret data of this type
73+
if is necessary to store information on the fields
74+
75+
A concrete example of such an array from the NumPy docs is::
76+
77+
dogs = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
78+
dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])
79+
80+
where here ``dogs.dtype.kind`` is 'V' and ``dogs.dtype.str`` is ``'|V48'``
81+
indicating the 48 bytes are needed to store each element (4 bytes each for
82+
``age`` and ``weight`` and 4 * 10 = 40 bytes for a 10-character UTF-32
83+
``name``). If we were to read such a sequence of bytes from a Zarr array, we
84+
need the dtype description to know how to properly interpret this sequence of
85+
48 bytes. The NumPy dtype object has a ``descr`` attribute that describes this.
86+
In this case ``dogs.dtype.descr`` is ``[('name', '<U10'), ('age', '<i4'), ('weight', '<f4')]``.
87+
88+
89+
Data Types added by this extension
90+
==================================
91+
92+
.. list-table:: Data types
93+
:header-rows: 1
94+
95+
* - Identifier
96+
- Numerical type
97+
- Size (no. bytes)
98+
- Byte order
99+
* - list of (<name>, <type>) tuples
100+
- structure with named fields, each with possibly unique data type
101+
- sum over the size of the dtypes in the identifier
102+
- None
103+
104+
Here <field name> is the name of the struct field and <type> is any of the
105+
scalar dtypes supported by the core Zarr v3 spec or the available extensions.
106+
In the case of NumPy's structured arrays this identifier is simply
107+
array's ``.dtype.descr`` attribute.
108+
109+
110+
References
111+
==========
112+
113+
.. [NumPy] NumPy Data type objects. NumPy version 1.22.0
114+
documentation. URL:
115+
https://numpy.org/doc/1.22/reference/arrays.dtypes.html
116+
117+
.. [NumPy Structured] Structured Arrays.
118+
documentation. URL:
119+
https://numpy.org/doc/1.22/user/basics.rec.html
120+
121+
Change log
122+
==========
123+
124+
@@TODO
125+
126+
127+
.. _structured arrays: https://numpy.org/doc/1.22/user/basics.rec.html

0 commit comments

Comments
 (0)