Skip to content

Commit 89f4107

Browse files
authored
Merge pull request #215 from alimanfoo/object-convenience-20171206
Object dtype convenience API; datetime64/timedelta64 support
2 parents 36db483 + f272f88 commit 89f4107

File tree

11 files changed

+332
-105
lines changed

11 files changed

+332
-105
lines changed

docs/release.rst

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,16 @@
11
Release notes
22
=============
33

4-
.. release_2.2.0rc1
4+
.. _release_2.2.0rc1:
55

66
2.2.0rc1
77
--------
88

9+
To install the release candidate version::
10+
11+
$ pip install --pre zarr==2.2.0rc1
12+
13+
914
Enhancements
1015
~~~~~~~~~~~~
1116

@@ -119,6 +124,9 @@ Enhancements
119124
continue to work, however a warning will be raised to encourage use of the
120125
``object_codec`` parameter. :issue:`208`, :issue:`212`.
121126

127+
* **Added support for ``datetime64`` and ``timedelta64`` data types**;
128+
:issue:`85`, :issue:`215`.
129+
122130
Bug fixes
123131
~~~~~~~~~
124132

@@ -146,14 +154,8 @@ Documentation
146154
* Some changes have been made to the :ref:`spec_v2` document to clarify
147155
ambiguities and add some missing information. These changes do not break compatibility
148156
with any of the material as previously implemented, and so the changes have been made
149-
in-place in the document without incrementing the document version number. The
150-
specification now describes how bytes fill values should be encoded and
151-
decoded for arrays with a fixed-length byte string data type (:issue:`165`,
152-
:issue:`176`). The specification now clarifies that datetime64 and
153-
timedelta64 data types are not supported in this version (:issue:`85`). The
154-
specification now clarifies that the '.zattrs' key does not have to be present for
155-
either arrays or groups, and if absent then custom attributes should be treated as
156-
empty.
157+
in-place in the document without incrementing the document version number. See the
158+
section on :ref:`spec_v2_changes` in the specification document for more information.
157159
* A new :ref:`tutorial_indexing` section has been added to the tutorial.
158160
* A new :ref:`tutorial_strings` section has been added to the tutorial
159161
(:issue:`135`, :issue:`175`).

docs/spec/v2.rst

Lines changed: 39 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Status
1515
This specification is the latest version. See :ref:`spec` for previous
1616
versions.
1717

18+
.. _spec_v2_storage:
19+
1820
Storage
1921
-------
2022

@@ -32,9 +34,13 @@ resources can be read, written or deleted via HTTP.
3234

3335
Below an "array store" refers to any system implementing this interface.
3436

37+
.. _spec_v2_array:
38+
3539
Arrays
3640
------
3741

42+
.. _spec_v2_array_metadata:
43+
3844
Metadata
3945
~~~~~~~~
4046

@@ -105,6 +111,8 @@ using the Blosc compression library prior to storage::
105111
"zarr_format": 2
106112
}
107113

114+
.. _spec_v2_array_dtype:
115+
108116
Data type encoding
109117
~~~~~~~~~~~~~~~~~~
110118

@@ -117,17 +125,20 @@ consists of 3 parts:
117125
``">"``: big-endian; ``"|"``: not-relevant)
118126
* One character code giving the basic type of the array (``"b"``: Boolean (integer
119127
type where all values are only True or False); ``"i"``: integer; ``"u"``: unsigned
120-
integer; ``"f"``: floating point; ``"c"``: complex floating point; ``"S"``: string
121-
(fixed-length sequence of char); ``"U"``: unicode (fixed-length sequence of
122-
Py_UNICODE); ``"V"``: other (void * – each item is a fixed-size chunk of memory))
128+
integer; ``"f"``: floating point; ``"c"``: complex floating point; ``"m"``: timedelta;
129+
``"M"``: datetime; ``"S"``: string (fixed-length sequence of char); ``"U"``: unicode
130+
(fixed-length sequence of Py_UNICODE); ``"V"``: other (void * – each item is a
131+
fixed-size chunk of memory))
123132
* An integer specifying the number of bytes the type uses.
124133

125134
The byte order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and
126135
``"|S12"`` are valid data type encodings.
127136

128-
Please note that NumPy's datetime64 ("M") and timedelta64 ("m") data types are **not**
129-
currently supported. Please store data using an appropriate physical data type instead,
130-
e.g., 64-bit integer.
137+
For datetime64 ("M") and timedelta64 ("m") data types, these MUST also include the
138+
units within square brackets. A list of valid units and their definitions are given in
139+
the `NumPy documentation on Datetimes and Timedeltas
140+
<https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units>`_.
141+
For example, ``"<M8[ns]"`` specifies a datetime64 data type with nanosecond time units.
131142

132143
Structured data types (i.e., with multiple named fields) are encoded as a list
133144
of two-element lists, following `NumPy array protocol type descriptions (descr)
@@ -136,6 +147,8 @@ example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]`` defines a
136147
data type composed of three single-byte unsigned integers labelled "r", "g" and
137148
"b".
138149

150+
.. _spec_v2_array_fill_value:
151+
139152
Fill value encoding
140153
~~~~~~~~~~~~~~~~~~~
141154

@@ -154,6 +167,8 @@ If an array has a fixed length byte string data type (e.g., ``"|S12"``), or a
154167
structured data type, and if the fill value is not null, then the fill value
155168
MUST be encoded as an ASCII string using the standard Base64 alphabet.
156169

170+
.. _spec_v2_array_chunks:
171+
157172
Chunks
158173
~~~~~~
159174

@@ -187,6 +202,8 @@ array dimension is not exactly divisible by the length of the corresponding
187202
chunk dimension then some chunks will overhang the edge of the array. The
188203
contents of any chunk region falling outside the array are undefined.
189204

205+
.. _spec_v2_array_filters:
206+
190207
Filters
191208
~~~~~~~
192209

@@ -197,9 +214,13 @@ the primary compressor. When retrieving data, stored chunk data are
197214
decompressed by the primary compressor then decoded using filters in the
198215
reverse order.
199216

217+
.. _spec_v2_hierarchy:
218+
200219
Hierarchies
201220
-----------
202221

222+
.. _spec_v2_hierarchy_paths:
223+
203224
Logical storage paths
204225
~~~~~~~~~~~~~~~~~~~~~
205226

@@ -235,6 +256,8 @@ treat all keys as opaque ASCII strings; equally, an array store could map
235256
logical paths onto some kind of hierarchical storage (e.g., directories on a
236257
file system).
237258

259+
.. _spec_v2_hierarchy_groups:
260+
238261
Groups
239262
~~~~~~
240263

@@ -269,6 +292,8 @@ under the logical paths "foo" and "foo/bar" and an array exists at logical path
269292
"foo/baz" then the members of the group at path "foo" are the group at path
270293
"foo/bar" and the array at path "foo/baz".
271294

295+
.. _spec_v2_attrs:
296+
272297
Attributes
273298
----------
274299

@@ -287,6 +312,8 @@ For example, the JSON object below encodes three attributes named
287312
"baz": [1, 2, 3, 4]
288313
}
289314

315+
.. _spec_v2_examples:
316+
290317
Examples
291318
--------
292319

@@ -463,6 +490,8 @@ What has been stored::
463490
foo/bar/1.0
464491
foo/bar/1.1
465492

493+
.. _spec_v2_changes:
494+
466495
Changes
467496
-------
468497

@@ -476,16 +505,16 @@ initially published to clarify ambiguities and add some missing information.
476505
decoded for arrays with a fixed-length byte string data type (:issue:`165`,
477506
:issue:`176`).
478507

479-
* The specification now clarifies that datetime64 and timedelta64 data types are not
480-
supported in this version (:issue:`85`).
508+
* The specification now clarifies that units must be specified for datetime64 and
509+
timedelta64 data types (:issue:`85`, :issue:`215`).
481510

482511
* The specification now clarifies that the '.zattrs' key does not have to be present for
483512
either arrays or groups, and if absent then custom attributes should be treated as
484513
empty.
485514

486515

487-
Changes in version 2
488-
~~~~~~~~~~~~~~~~~~~~
516+
Changes from version 1 to version 2
517+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
489518

490519
The following changes were made between version 1 and version 2 of this specification:
491520

docs/tutorial.rst

Lines changed: 93 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -779,6 +779,8 @@ If your strings are all ASCII strings, and you know the maximum length of the st
779779
your dataset, then you can use an array with a fixed-length bytes dtype. E.g.::
780780

781781
>>> z = zarr.zeros(10, dtype='S6')
782+
>>> z
783+
<zarr.core.Array (10,) |S6>
782784
>>> z[0] = b'Hello'
783785
>>> z[1] = b'world!'
784786
>>> z[:]
@@ -793,37 +795,68 @@ A fixed-length unicode dtype is also available, e.g.::
793795
... 'เฮลโลเวิลด์']
794796
>>> text_data = greetings * 10000
795797
>>> z = zarr.array(text_data, dtype='U20')
798+
>>> z
799+
<zarr.core.Array (120000,) <U20>
796800
>>> z[:]
797801
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
798802
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
799803
dtype='<U20')
800804

801-
For variable-length strings, the "object" dtype can be used, but a codec must be
805+
For variable-length strings, the ``object`` dtype can be used, but a codec must be
802806
provided to encode the data (see also :ref:`tutorial_objects` below). At the time of
803-
writing there are three codecs available that can encode variable length string
804-
objects, :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
805-
:class:`numcodecs.Pickle`. E.g. using JSON::
807+
writing there are four codecs available that can encode variable length string
808+
objects: :class:`numcodecs.VLenUTF8`, :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`.
809+
and :class:`numcodecs.Pickle`. E.g. using ``VLenUTF8``::
806810

807811
>>> import numcodecs
808-
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.JSON())
812+
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.VLenUTF8())
813+
>>> z
814+
<zarr.core.Array (120000,) object>
815+
>>> z.filters
816+
[VLenUTF8()]
809817
>>> z[:]
810818
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
811819
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
812820

813-
...or alternatively using msgpack (requires `msgpack-python
814-
<https://github.com/msgpack/msgpack-python>`_ to be installed)::
821+
As a convenience, ``dtype=str`` (or ``dtype=unicode`` on Python 2.7) can be used, which
822+
is a short-hand for ``dtype=object, object_codec=numcodecs.VLenUTF8()``, e.g.::
815823

816-
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.MsgPack())
824+
>>> z = zarr.array(text_data, dtype=str)
825+
>>> z
826+
<zarr.core.Array (120000,) object>
827+
>>> z.filters
828+
[VLenUTF8()]
817829
>>> z[:]
818830
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
819831
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
820832

821-
If you know ahead of time all the possible string values that can occur, then you could
822-
also use the :class:`numcodecs.Categorize` codec to encode each unique value as an
833+
Variable-length byte strings are also supported via ``dtype=object``. Again an
834+
``object_codec`` is required, which can be one of :class:`numcodecs.VLenBytes` or
835+
:class:`numcodecs.Pickle`. For convenience, ``dtype=bytes`` (or ``dtype=str`` on Python
836+
2.7) can be used as a short-hand for ``dtype=object, object_codec=numcodecs.VLenBytes()``,
837+
e.g.::
838+
839+
>>> bytes_data = [g.encode('utf-8') for g in greetings] * 10000
840+
>>> z = zarr.array(bytes_data, dtype=bytes)
841+
>>> z
842+
<zarr.core.Array (120000,) object>
843+
>>> z.filters
844+
[VLenBytes()]
845+
>>> z[:]
846+
array([b'\xc2\xa1Hola mundo!', b'Hej V\xc3\xa4rlden!', b'Servus Woid!',
847+
..., b'Hell\xc3\xb3, vil\xc3\xa1g!', b'Zdravo svete!',
848+
b'\xe0\xb9\x80\xe0\xb8\xae\xe0\xb8\xa5\xe0\xb9\x82\xe0\xb8\xa5\xe0\xb9\x80\xe0\xb8\xa7\xe0\xb8\xb4\xe0\xb8\xa5\xe0\xb8\x94\xe0\xb9\x8c'], dtype=object)
849+
850+
If you know ahead of time all the possible string values that can occur, you could
851+
also use the :class:`numcodecs.Categorize` codec to encode each unique string value as an
823852
integer. E.g.::
824853

825854
>>> categorize = numcodecs.Categorize(greetings, dtype=object)
826855
>>> z = zarr.array(text_data, dtype=object, object_codec=categorize)
856+
>>> z
857+
<zarr.core.Array (120000,) object>
858+
>>> z.filters
859+
[Categorize(dtype='|O', astype='|u1', labels=['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...])]
827860
>>> z[:]
828861
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
829862
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
@@ -835,13 +868,14 @@ Object arrays
835868
-------------
836869

837870
Zarr supports arrays with an "object" dtype. This allows arrays to contain any type of
838-
object, such as variable length unicode strings, or variable length lists, or other
839-
possibilities. When creating an object array, a codec must be provided via the
871+
object, such as variable length unicode strings, or variable length arrays of numbers, or
872+
other possibilities. When creating an object array, a codec must be provided via the
840873
``object_codec`` argument. This codec handles encoding (serialization) of Python objects.
841-
At the time of writing there are three codecs available that can serve as a
842-
general purpose object codec and support encoding of a variety of
843-
object types: :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
844-
:class:`numcodecs.Pickle`.
874+
The best codec to use will depend on what type of objects are present in the array.
875+
876+
At the time of writing there are three codecs available that can serve as a general
877+
purpose object codec and support encoding of a mixture of object types:
878+
:class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and :class:`numcodecs.Pickle`.
845879

846880
For example, using the JSON codec::
847881

@@ -861,6 +895,40 @@ code can be embedded within pickled data. The JSON and MsgPack codecs do not hav
861895
security issues and support encoding of unicode strings, lists and dictionaries.
862896
MsgPack is usually faster for both encoding and decoding.
863897

898+
Ragged arrays
899+
~~~~~~~~~~~~~
900+
901+
If you need to store an array of arrays, where each member array can be of any length
902+
and stores the same primitive type (a.k.a. a ragged array), the
903+
:class:`numcodecs.VLenArray` codec can be used, e.g.::
904+
905+
>>> z = zarr.empty(4, dtype=object, object_codec=numcodecs.VLenArray(int))
906+
>>> z
907+
<zarr.core.Array (4,) object>
908+
>>> z.filters
909+
[VLenArray(dtype='<i8')]
910+
>>> z[0] = np.array([1, 3, 5])
911+
>>> z[1] = np.array([4])
912+
>>> z[2] = np.array([7, 9, 14])
913+
>>> z[:]
914+
array([array([1, 3, 5]), array([4]), array([ 7, 9, 14]),
915+
array([], dtype=int64)], dtype=object)
916+
917+
As a convenience, ``dtype='array:T'`` can be used as a short-hand for
918+
``dtype=object, object_codec=numcodecs.VLenArray('T')``, where 'T' can be any NumPy
919+
primitive dtype such as 'i4' or 'f8'. E.g.::
920+
921+
>>> z = zarr.empty(4, dtype='array:i8')
922+
>>> z
923+
<zarr.core.Array (4,) object>
924+
>>> z.filters
925+
[VLenArray(dtype='<i8')]
926+
>>> z[0] = np.array([1, 3, 5])
927+
>>> z[1] = np.array([4])
928+
>>> z[2] = np.array([7, 9, 14])
929+
>>> z[:]
930+
array([array([1, 3, 5]), array([4]), array([ 7, 9, 14]),
931+
array([], dtype=int64)], dtype=object)
864932

865933
.. _tutorial_chunks:
866934

@@ -1079,25 +1147,19 @@ E.g., pickle/unpickle an array stored on disk::
10791147
Datetimes and timedeltas
10801148
------------------------
10811149

1082-
Please note that NumPy's ``datetime64`` and ``timedelta64`` dtypes are **not** currently
1083-
supported for Zarr arrays. If you would like to store datetime or timedelta data, you
1084-
can store the data in an array with an integer dtype, e.g.::
1150+
NumPy's ``datetime64`` ('M8') and ``timedelta64`` ('m8') dtypes are supported for Zarr
1151+
arrays, as long as the units are specified. E.g.::
10851152

1086-
>>> a = np.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
1087-
>>> z = zarr.array(a.view('i8'))
1153+
>>> z = zarr.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='M8[D]')
10881154
>>> z
1089-
<zarr.core.Array (3,) int64>
1155+
<zarr.core.Array (3,) datetime64[D]>
10901156
>>> z[:]
1091-
array([13707, 13161, 14834])
1092-
>>> z[:].view(a.dtype)
1093-
array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
1094-
1095-
If you would like a convenient way to retrieve the data from this array viewed as the
1096-
original datetime64 dtype, try the :func:`zarr.core.Array.astype` method, e.g.::
1097-
1098-
>>> zv = z.astype(a.dtype)
1099-
>>> zv[:]
11001157
array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
1158+
>>> z[0]
1159+
numpy.datetime64('2007-07-13')
1160+
>>> z[0] = '1999-12-31'
1161+
>>> z[:]
1162+
array(['1999-12-31', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
11011163

11021164
.. _tutorial_tips:
11031165

requirements_dev.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ mccabe==0.6.1
1515
monotonic==1.3
1616
msgpack-python==0.4.8
1717
nose==1.3.7
18-
numcodecs==0.4.1
18+
numcodecs==0.5.2
1919
numpy==1.13.3
2020
packaging==16.8
2121
pkginfo==1.4.1

0 commit comments

Comments
 (0)