zarr-developers
diff --git a/‎docs/release.rst‎
Lines changed: 11 additions & 9 deletions b/‎docs/release.rst‎
Lines changed: 11 additions & 9 deletions
diff --git a/‎docs/spec/v2.rst‎
Lines changed: 39 additions & 10 deletions b/‎docs/spec/v2.rst‎
Lines changed: 39 additions & 10 deletions
diff --git a/‎docs/tutorial.rst‎
Lines changed: 93 additions & 31 deletions b/‎docs/tutorial.rst‎
Lines changed: 93 additions & 31 deletions
diff --git a/‎requirements_dev.txt‎
Lines changed: 1 addition & 1 deletion b/‎requirements_dev.txt‎
Lines changed: 1 addition & 1 deletion
@@ -1,11 +1,16 @@
 Release notes
 =============
 
-.. release_2.2.0rc1
+.. _release_2.2.0rc1:
 
 2.2.0rc1
 --------
 
+To install the release candidate version::
+
+    $ pip install --pre zarr==2.2.0rc1
+
+
 Enhancements
 ~~~~~~~~~~~~
 
@@ -119,6 +124,9 @@ Enhancements
   continue to work, however a warning will be raised to encourage use of the
   ``object_codec`` parameter. :issue:`208`, :issue:`212`.
 
+* **Added support for ``datetime64`` and ``timedelta64`` data types**;
+  :issue:`85`, :issue:`215`.
+
 Bug fixes
 ~~~~~~~~~
 
@@ -146,14 +154,8 @@ Documentation
 * Some changes have been made to the :ref:`spec_v2` document to clarify
   ambiguities and add some missing information. These changes do not break compatibility
   with any of the material as previously implemented, and so the changes have been made
-  in-place in the document without incrementing the document version number. The
-  specification now describes how bytes fill values should be encoded and
-  decoded for arrays with a fixed-length byte string data type (:issue:`165`,
-  :issue:`176`). The specification now clarifies that datetime64 and
-  timedelta64 data types are not supported in this version (:issue:`85`). The
-  specification now clarifies that the '.zattrs' key does not have to be present for
-  either arrays or groups, and if absent then custom attributes should be treated as
-  empty.
+  in-place in the document without incrementing the document version number. See the
+  section on :ref:`spec_v2_changes` in the specification document for more information.
 * A new :ref:`tutorial_indexing` section has been added to the tutorial.
 * A new :ref:`tutorial_strings` section has been added to the tutorial
   (:issue:`135`, :issue:`175`).
 
@@ -15,6 +15,8 @@ Status
 This specification is the latest version. See :ref:`spec` for previous
 versions.
 
+.. _spec_v2_storage:
+
 Storage
 -------
 
@@ -32,9 +34,13 @@ resources can be read, written or deleted via HTTP.
 
 Below an "array store" refers to any system implementing this interface.
 
+.. _spec_v2_array:
+
 Arrays
 ------
 
+.. _spec_v2_array_metadata:
+
 Metadata
 ~~~~~~~~
 
@@ -105,6 +111,8 @@ using the Blosc compression library prior to storage::
         "zarr_format": 2
     }
 
+.. _spec_v2_array_dtype:
+
 Data type encoding
 ~~~~~~~~~~~~~~~~~~
 
@@ -117,17 +125,20 @@ consists of 3 parts:
   ``">"``: big-endian; ``"|"``: not-relevant)
 * One character code giving the basic type of the array (``"b"``: Boolean (integer
   type where all values are only True or False); ``"i"``: integer; ``"u"``: unsigned
-  integer; ``"f"``: floating point; ``"c"``: complex floating point; ``"S"``: string
-  (fixed-length sequence of char); ``"U"``: unicode (fixed-length sequence of
-  Py_UNICODE); ``"V"``: other (void * – each item is a fixed-size chunk of memory))
+  integer; ``"f"``: floating point; ``"c"``: complex floating point; ``"m"``: timedelta;
+  ``"M"``: datetime; ``"S"``: string (fixed-length sequence of char); ``"U"``: unicode
+  (fixed-length sequence of Py_UNICODE); ``"V"``: other (void * – each item is a
+  fixed-size chunk of memory))
 * An integer specifying the number of bytes the type uses.
 
 The byte order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and
 ``"|S12"`` are valid data type encodings.
 
-Please note that NumPy's datetime64 ("M") and timedelta64 ("m") data types are **not**
-currently supported. Please store data using an appropriate physical data type instead,
-e.g., 64-bit integer.
+For datetime64 ("M") and timedelta64 ("m") data types, these MUST also include the
+units within square brackets. A list of valid units and their definitions are given in
+the `NumPy documentation on Datetimes and Timedeltas
+<https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units>`_.
+For example, ``"<M8[ns]"`` specifies a datetime64 data type with nanosecond time units.
 
 Structured data types (i.e., with multiple named fields) are encoded as a list
 of two-element lists, following `NumPy array protocol type descriptions (descr)
@@ -136,6 +147,8 @@ example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]`` defines a
 data type composed of three single-byte unsigned integers labelled "r", "g" and
 "b".
 
+.. _spec_v2_array_fill_value:
+
 Fill value encoding
 ~~~~~~~~~~~~~~~~~~~
 
@@ -154,6 +167,8 @@ If an array has a fixed length byte string data type (e.g., ``"|S12"``), or a
 structured data type, and if the fill value is not null, then the fill value
 MUST be encoded as an ASCII string using the standard Base64 alphabet.
 
+.. _spec_v2_array_chunks:
+
 Chunks
 ~~~~~~
 
@@ -187,6 +202,8 @@ array dimension is not exactly divisible by the length of the corresponding
 chunk dimension then some chunks will overhang the edge of the array. The
 contents of any chunk region falling outside the array are undefined.
 
+.. _spec_v2_array_filters:
+
 Filters
 ~~~~~~~
 
@@ -197,9 +214,13 @@ the primary compressor. When retrieving data, stored chunk data are
 decompressed by the primary compressor then decoded using filters in the
 reverse order.
 
+.. _spec_v2_hierarchy:
+
 Hierarchies
 -----------
 
+.. _spec_v2_hierarchy_paths:
+
 Logical storage paths
 ~~~~~~~~~~~~~~~~~~~~~
 
@@ -235,6 +256,8 @@ treat all keys as opaque ASCII strings; equally, an array store could map
 logical paths onto some kind of hierarchical storage (e.g., directories on a
 file system).
 
+.. _spec_v2_hierarchy_groups:
+
 Groups
 ~~~~~~
 
@@ -269,6 +292,8 @@ under the logical paths "foo" and "foo/bar" and an array exists at logical path
 "foo/baz" then the members of the group at path "foo" are the group at path
 "foo/bar" and the array at path "foo/baz".
 
+.. _spec_v2_attrs:
+
 Attributes
 ----------
 
@@ -287,6 +312,8 @@ For example, the JSON object below encodes three attributes named
         "baz": [1, 2, 3, 4]
     }
 
+.. _spec_v2_examples:
+
 Examples
 --------
 
@@ -463,6 +490,8 @@ What has been stored::
     foo/bar/1.0
     foo/bar/1.1
 
+.. _spec_v2_changes:
+
 Changes
 -------
 
@@ -476,16 +505,16 @@ initially published to clarify ambiguities and add some missing information.
   decoded for arrays with a fixed-length byte string data type (:issue:`165`,
   :issue:`176`).
 
-* The specification now clarifies that datetime64 and timedelta64 data types are not
-  supported in this version (:issue:`85`).
+* The specification now clarifies that units must be specified for datetime64 and
+  timedelta64 data types (:issue:`85`, :issue:`215`).
 
 * The specification now clarifies that the '.zattrs' key does not have to be present for
   either arrays or groups, and if absent then custom attributes should be treated as
   empty.
 
 
-Changes in version 2
-~~~~~~~~~~~~~~~~~~~~
+Changes from version 1 to version 2
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The following changes were made between version 1 and version 2 of this specification:
 
 
@@ -779,6 +779,8 @@ If your strings are all ASCII strings, and you know the maximum length of the st
 your dataset, then you can use an array with a fixed-length bytes dtype. E.g.::
 
     >>> z = zarr.zeros(10, dtype='S6')
+    >>> z
+    <zarr.core.Array (10,) |S6>
     >>> z[0] = b'Hello'
     >>> z[1] = b'world!'
     >>> z[:]
@@ -793,37 +795,68 @@ A fixed-length unicode dtype is also available, e.g.::
     ...              'เฮลโลเวิลด์']
     >>> text_data = greetings * 10000
     >>> z = zarr.array(text_data, dtype='U20')
+    >>> z
+    <zarr.core.Array (120000,) <U20>
     >>> z[:]
     array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
            'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
           dtype='<U20')
 
-For variable-length strings, the "object" dtype can be used, but a codec must be
+For variable-length strings, the ``object`` dtype can be used, but a codec must be
 provided to encode the data (see also :ref:`tutorial_objects` below). At the time of
-writing there are three codecs available that can encode variable length string
-objects, :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
-:class:`numcodecs.Pickle`. E.g. using JSON::
+writing there are four codecs available that can encode variable length string
+objects: :class:`numcodecs.VLenUTF8`, :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`.
+and :class:`numcodecs.Pickle`. E.g. using ``VLenUTF8``::
 
     >>> import numcodecs
-    >>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.JSON())
+    >>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.VLenUTF8())
+    >>> z
+    <zarr.core.Array (120000,) object>
+    >>> z.filters
+    [VLenUTF8()]
     >>> z[:]
     array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
            'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
 
-...or alternatively using msgpack (requires `msgpack-python
-<https://github.com/msgpack/msgpack-python>`_ to be installed)::
+As a convenience, ``dtype=str`` (or ``dtype=unicode`` on Python 2.7) can be used, which
+is a short-hand for ``dtype=object, object_codec=numcodecs.VLenUTF8()``, e.g.::
 
-    >>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.MsgPack())
+    >>> z = zarr.array(text_data, dtype=str)
+    >>> z
+    <zarr.core.Array (120000,) object>
+    >>> z.filters
+    [VLenUTF8()]
     >>> z[:]
     array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
            'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
 
-If you know ahead of time all the possible string values that can occur, then you could
-also use the :class:`numcodecs.Categorize` codec to encode each unique value as an
+Variable-length byte strings are also supported via ``dtype=object``. Again an
+``object_codec`` is required, which can be one of :class:`numcodecs.VLenBytes` or
+:class:`numcodecs.Pickle`. For convenience, ``dtype=bytes`` (or ``dtype=str`` on Python
+2.7) can be used as a short-hand for ``dtype=object, object_codec=numcodecs.VLenBytes()``,
+e.g.::
+
+    >>> bytes_data = [g.encode('utf-8') for g in greetings] * 10000
+    >>> z = zarr.array(bytes_data, dtype=bytes)
+    >>> z
+    <zarr.core.Array (120000,) object>
+    >>> z.filters
+    [VLenBytes()]
+    >>> z[:]
+    array([b'\xc2\xa1Hola mundo!', b'Hej V\xc3\xa4rlden!', b'Servus Woid!',
+           ..., b'Hell\xc3\xb3, vil\xc3\xa1g!', b'Zdravo svete!',
+           b'\xe0\xb9\x80\xe0\xb8\xae\xe0\xb8\xa5\xe0\xb9\x82\xe0\xb8\xa5\xe0\xb9\x80\xe0\xb8\xa7\xe0\xb8\xb4\xe0\xb8\xa5\xe0\xb8\x94\xe0\xb9\x8c'], dtype=object)
+
+If you know ahead of time all the possible string values that can occur, you could
+also use the :class:`numcodecs.Categorize` codec to encode each unique string value as an
 integer. E.g.::
 
     >>> categorize = numcodecs.Categorize(greetings, dtype=object)
     >>> z = zarr.array(text_data, dtype=object, object_codec=categorize)
+    >>> z
+    <zarr.core.Array (120000,) object>
+    >>> z.filters
+    [Categorize(dtype='|O', astype='|u1', labels=['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...])]
     >>> z[:]
     array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
            'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
@@ -835,13 +868,14 @@ Object arrays
 -------------
 
 Zarr supports arrays with an "object" dtype. This allows arrays to contain any type of
-object, such as variable length unicode strings, or variable length lists, or other
-possibilities. When creating an object array, a codec must be provided via the
+object, such as variable length unicode strings, or variable length arrays of numbers, or
+other possibilities. When creating an object array, a codec must be provided via the
 ``object_codec`` argument. This codec handles encoding (serialization) of Python objects.
-At the time of writing there are three codecs available that can serve as a
-general purpose object codec and support encoding of a variety of
-object types: :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
-:class:`numcodecs.Pickle`.
+The best codec to use will depend on what type of objects are present in the array.
+
+At the time of writing there are three codecs available that can serve as a general
+purpose object codec and support encoding of a mixture of object types:
+:class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and :class:`numcodecs.Pickle`.
 
 For example, using the JSON codec::
 
@@ -861,6 +895,40 @@ code can be embedded within pickled data. The JSON and MsgPack codecs do not hav
 security issues and support encoding of unicode strings, lists and dictionaries.
 MsgPack is usually faster for both encoding and decoding.
 
+Ragged arrays
+~~~~~~~~~~~~~
+
+If you need to store an array of arrays, where each member array can be of any length
+and stores the same primitive type (a.k.a. a ragged array), the
+:class:`numcodecs.VLenArray` codec can be used, e.g.::
+
+    >>> z = zarr.empty(4, dtype=object, object_codec=numcodecs.VLenArray(int))
+    >>> z
+    <zarr.core.Array (4,) object>
+    >>> z.filters
+    [VLenArray(dtype='<i8')]
+    >>> z[0] = np.array([1, 3, 5])
+    >>> z[1] = np.array([4])
+    >>> z[2] = np.array([7, 9, 14])
+    >>> z[:]
+    array([array([1, 3, 5]), array([4]), array([ 7,  9, 14]),
+           array([], dtype=int64)], dtype=object)
+
+As a convenience, ``dtype='array:T'`` can be used as a short-hand for
+``dtype=object, object_codec=numcodecs.VLenArray('T')``, where 'T' can be any NumPy
+primitive dtype such as 'i4' or 'f8'. E.g.::
+
+    >>> z = zarr.empty(4, dtype='array:i8')
+    >>> z
+    <zarr.core.Array (4,) object>
+    >>> z.filters
+    [VLenArray(dtype='<i8')]
+    >>> z[0] = np.array([1, 3, 5])
+    >>> z[1] = np.array([4])
+    >>> z[2] = np.array([7, 9, 14])
+    >>> z[:]
+    array([array([1, 3, 5]), array([4]), array([ 7,  9, 14]),
+           array([], dtype=int64)], dtype=object)
 
 .. _tutorial_chunks:
 
@@ -1079,25 +1147,19 @@ E.g., pickle/unpickle an array stored on disk::
 Datetimes and timedeltas
 ------------------------
 
-Please note that NumPy's ``datetime64`` and ``timedelta64`` dtypes are **not** currently
-supported for Zarr arrays. If you would like to store datetime or timedelta data, you
-can store the data in an array with an integer dtype, e.g.::
+NumPy's ``datetime64`` ('M8') and ``timedelta64`` ('m8') dtypes are supported for Zarr
+arrays, as long as the units are specified. E.g.::
 
-    >>> a = np.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
-    >>> z = zarr.array(a.view('i8'))
+    >>> z = zarr.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='M8[D]')
     >>> z
-    <zarr.core.Array (3,) int64>
+    <zarr.core.Array (3,) datetime64[D]>
     >>> z[:]
-    array([13707, 13161, 14834])
-    >>> z[:].view(a.dtype)
-    array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
-
-If you would like a convenient way to retrieve the data from this array viewed as the
-original datetime64 dtype, try the :func:`zarr.core.Array.astype` method, e.g.::
-
-    >>> zv = z.astype(a.dtype)
-    >>> zv[:]
     array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
+    >>> z[0]
+    numpy.datetime64('2007-07-13')
+    >>> z[0] = '1999-12-31'
+    >>> z[:]
+    array(['1999-12-31', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
 
 .. _tutorial_tips:
 
 
@@ -15,7 +15,7 @@ mccabe==0.6.1
 monotonic==1.3
 msgpack-python==0.4.8
 nose==1.3.7
-numcodecs==0.4.1
+numcodecs==0.5.2
 numpy==1.13.3
 packaging==16.8
 pkginfo==1.4.1