Skip to content

Commit 177c552

Browse files
committed
add draft of string dtypes extension
1 parent cf207ae commit 177c552

File tree

2 files changed

+137
-0
lines changed

2 files changed

+137
-0
lines changed

docs/protocol/extensions.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Under construction.
1111
extensions/filters/v1.0
1212
extensions/complex-dtypes/v1.0
1313
extensions/datetime-dtypes/v1.0
14+
extensions/string-dtypes/v1.0
1415

1516

1617
A number of other features might be included in the core protocol v3, but are currently considered as extensions.
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
===================================
2+
String data types (version 1.0)
3+
===================================
4+
-----------------------------
5+
Editor's draft 2 March 2022
6+
-----------------------------
7+
8+
Specification URI:
9+
http://purl.org/zarr/spec/protocol/extensions/string-dtypes/1.0
10+
Issue tracking:
11+
`GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/string-dtypes-v1.0>`_
12+
Suggest an edit for this spec:
13+
`GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/core-protocol-v3.0-dev/docs/protocol/extension/string-dtypes/v1.0.rst>`_
14+
15+
Copyright 2022 `Zarr core development
16+
team <https://github.com/orgs/zarr-developers/teams/core-devs>`_ (@@TODO
17+
list institutions?). This work is licensed under a `Creative Commons
18+
Attribution 3.0 Unported
19+
License <https://creativecommons.org/licenses/by/3.0/>`_.
20+
21+
----
22+
23+
24+
Abstract
25+
========
26+
27+
This specification is a Zarr protocol extension defining data types
28+
for strings. It is an early draft and currently just describes existing support
29+
for NumPy string types that have already worked with zarr-python, but are not
30+
part of the core v3 spec.
31+
32+
33+
Status of this document
34+
=======================
35+
36+
This document is a **Work in Progress**. It may be updated, replaced
37+
or obsoleted by other documents at any time. It is inappapropriate to
38+
cite this document as other than work in progress.
39+
40+
Comments, questions or contributions to this document are very
41+
welcome. Comments and questions should be raised via `GitHub issues
42+
<https://github.com/zarr-developers/zarr-specs/labels/string-dtypes-v1.0>`_. When
43+
raising an issue, please add the label "string-dtypes-v1.0".
44+
45+
This document was produced by the `Zarr core development team
46+
<https://github.com/orgs/zarr-developers/teams/core-devs>`_.
47+
48+
49+
Document conventions
50+
====================
51+
52+
Conformance requirements are expressed with a combination of
53+
descriptive assertions and [RFC2119]_ terminology. The key words
54+
"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
55+
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative
56+
parts of this document are to be interpreted as described in
57+
[RFC2119]_. However, for readability, these words do not appear in all
58+
uppercase letters in this specification.
59+
60+
All of the text of this specification is normative except sections
61+
explicitly marked as non-normative, examples, and notes. Examples in
62+
this specification are introduced with the words "for example".
63+
64+
65+
Extension data types
66+
====================
67+
68+
Two extension data types are defined to represent zero-terminated bytestrings as
69+
well as fixed-length 32-bit unicode arrays.
70+
71+
Fixed length byte strings (zero-terminated)
72+
-------------------------------------------
73+
74+
These are fixed width strings corresponding to NumPy dtypes with `kind` 'S'.
75+
For backward compatibility with Python 2's ``str`` these are zero-terminated
76+
bytes and correspond to
77+
`numpy.bytes_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.bytes_>`
78+
(or its alias
79+
`numpy.string_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.bytes_>`.)
80+
81+
For example ``a = np.array(["a", "bcd", "efgh"], dtype="S4")`` creates an array where ``a.dtype.kind`` is 'S' and ``a.data.tobytes()`` is ``b'a\x00\x00\x00bcd\x00efgh'``. Note that any elements of length less than 4 characters were padded with zeros so that each array element uses 4 bytes (as
82+
indicated by ``dtype="S4"``).
83+
84+
85+
Fixed width unicode strings
86+
---------------------------
87+
88+
These are fixed width strings corresponding to NumPy dtypes with `kind` 'U'.
89+
These are zero-terminated bytes and correspond to
90+
`numpy.str_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.str_>`
91+
(or its alias
92+
`numpy.unicode_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.unicode_>`.)
93+
94+
For example ``a = np.array(["a", "bcd", "efgh"], dtype="U4")`` creates an array where ``a.dtype.kind`` is 'U' and each element is a sequence of 4 characters where each character occupies 4 bytes (UTF-32).
95+
96+
97+
Units
98+
=====
99+
100+
.. list-table:: Data types
101+
:header-rows: 1
102+
103+
* - Identifier
104+
- Numerical type
105+
- Size (no. bytes)
106+
- Byte order
107+
* - ``Sn``
108+
- ``n`` character fixed-width byte string
109+
- n
110+
- None
111+
* - ``<Un``
112+
- ``n`` character fixed-width unicode string (UTF-32)
113+
- 4n
114+
- little-endian
115+
* - ``>Un``
116+
- ``n`` character fixed-width unicode string (UTF-32)
117+
- 4n
118+
- big-endian
119+
120+
121+
References
122+
==========
123+
124+
.. [UTF-32] UTF-32 on Wikipedia.
125+
documentation. URL:
126+
https://en.wikipedia.org/wiki/UTF-32
127+
128+
.. [NumPy] NumPy Data type objects. NumPy version 1.22.0
129+
documentation. URL:
130+
https://numpy.org/doc/1.22/reference/arrays.dtypes.html
131+
132+
133+
Change log
134+
==========
135+
136+
@@TODO

0 commit comments

Comments
 (0)