Skip to content

Commit f380374

Browse files
committed
edit tutorial on strings and objects; bump numcodecs
1 parent 2bb9f79 commit f380374

File tree

3 files changed

+91
-18
lines changed

3 files changed

+91
-18
lines changed

docs/tutorial.rst

Lines changed: 89 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -779,6 +779,8 @@ If your strings are all ASCII strings, and you know the maximum length of the st
779779
your dataset, then you can use an array with a fixed-length bytes dtype. E.g.::
780780

781781
>>> z = zarr.zeros(10, dtype='S6')
782+
>>> z
783+
<zarr.core.Array (10,) |S6>
782784
>>> z[0] = b'Hello'
783785
>>> z[1] = b'world!'
784786
>>> z[:]
@@ -793,37 +795,68 @@ A fixed-length unicode dtype is also available, e.g.::
793795
... 'เฮลโลเวิลด์']
794796
>>> text_data = greetings * 10000
795797
>>> z = zarr.array(text_data, dtype='U20')
798+
>>> z
799+
<zarr.core.Array (120000,) <U20>
796800
>>> z[:]
797801
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
798802
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
799803
dtype='<U20')
800804

801-
For variable-length strings, the "object" dtype can be used, but a codec must be
805+
For variable-length strings, the ``object`` dtype can be used, but a codec must be
802806
provided to encode the data (see also :ref:`tutorial_objects` below). At the time of
803-
writing there are three codecs available that can encode variable length string
804-
objects, :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
805-
:class:`numcodecs.Pickle`. E.g. using JSON::
807+
writing there are four codecs available that can encode variable length string
808+
objects: :class:`numcodecs.VLenUTF8`, :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`.
809+
and :class:`numcodecs.Pickle`. E.g. using ``VLenUTF8``::
806810

807811
>>> import numcodecs
808-
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.JSON())
812+
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.VLenUTF8())
813+
>>> z
814+
<zarr.core.Array (120000,) object>
815+
>>> z.filters
816+
[VLenUTF8()]
809817
>>> z[:]
810818
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
811819
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
812820

813-
...or alternatively using msgpack (requires `msgpack-python
814-
<https://github.com/msgpack/msgpack-python>`_ to be installed)::
821+
As a convenience, ``dtype=str`` (or ``dtype=unicode`` on Python 2.7) can be used, which
822+
is a short-hand for ``dtype=object, object_codec=numcodecs.VLenUTF8()``, e.g.::
815823

816-
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.MsgPack())
824+
>>> z = zarr.array(text_data, dtype=str)
825+
>>> z
826+
<zarr.core.Array (120000,) object>
827+
>>> z.filters
828+
[VLenUTF8()]
817829
>>> z[:]
818830
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
819831
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
820832

821-
If you know ahead of time all the possible string values that can occur, then you could
822-
also use the :class:`numcodecs.Categorize` codec to encode each unique value as an
833+
Variable-length byte strings are also supported via ``dtype=object``. Again an
834+
``object_codec`` is required, which can be one of :class:`numcodecs.VLenBytes` or
835+
:class:`numcodecs.Pickle`. For convenience, ``dtype=bytes`` (or ``dtype=str`` on Python
836+
2.7) can be used as a short-hand for ``dtype=object, object_codec=numcodecs.VLenBytes()``,
837+
e.g.::
838+
839+
>>> bytes_data = [g.encode('utf-8') for g in greetings] * 10000
840+
>>> z = zarr.array(bytes_data, dtype=bytes)
841+
>>> z
842+
<zarr.core.Array (120000,) object>
843+
>>> z.filters
844+
[VLenBytes()]
845+
>>> z[:]
846+
array([b'\xc2\xa1Hola mundo!', b'Hej V\xc3\xa4rlden!', b'Servus Woid!',
847+
..., b'Hell\xc3\xb3, vil\xc3\xa1g!', b'Zdravo svete!',
848+
b'\xe0\xb9\x80\xe0\xb8\xae\xe0\xb8\xa5\xe0\xb9\x82\xe0\xb8\xa5\xe0\xb9\x80\xe0\xb8\xa7\xe0\xb8\xb4\xe0\xb8\xa5\xe0\xb8\x94\xe0\xb9\x8c'], dtype=object)
849+
850+
If you know ahead of time all the possible string values that can occur, you could
851+
also use the :class:`numcodecs.Categorize` codec to encode each unique string value as an
823852
integer. E.g.::
824853

825854
>>> categorize = numcodecs.Categorize(greetings, dtype=object)
826855
>>> z = zarr.array(text_data, dtype=object, object_codec=categorize)
856+
>>> z
857+
<zarr.core.Array (120000,) object>
858+
>>> z.filters
859+
[Categorize(dtype='|O', astype='|u1', labels=['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...])]
827860
>>> z[:]
828861
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
829862
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
@@ -835,13 +868,14 @@ Object arrays
835868
-------------
836869

837870
Zarr supports arrays with an "object" dtype. This allows arrays to contain any type of
838-
object, such as variable length unicode strings, or variable length lists, or other
839-
possibilities. When creating an object array, a codec must be provided via the
871+
object, such as variable length unicode strings, or variable length arrays of numbers, or
872+
other possibilities. When creating an object array, a codec must be provided via the
840873
``object_codec`` argument. This codec handles encoding (serialization) of Python objects.
841-
At the time of writing there are three codecs available that can serve as a
842-
general purpose object codec and support encoding of a variety of
843-
object types: :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
844-
:class:`numcodecs.Pickle`.
874+
The best codec to use will depend on what type of objects are present in the array.
875+
876+
At the time of writing there are three codecs available that can serve as a general
877+
purpose object codec and support encoding of a mixture of object types:
878+
:class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and :class:`numcodecs.Pickle`.
845879

846880
For example, using the JSON codec::
847881

@@ -861,6 +895,40 @@ code can be embedded within pickled data. The JSON and MsgPack codecs do not hav
861895
security issues and support encoding of unicode strings, lists and dictionaries.
862896
MsgPack is usually faster for both encoding and decoding.
863897

898+
Ragged arrays
899+
~~~~~~~~~~~~~
900+
901+
If you need to store an array of arrays, where each member array can be of any length
902+
and stores the same primitive type (a.k.a. a ragged array), the
903+
:class:`numcodecs.VLenArray` codec can be used, e.g.::
904+
905+
>>> z = zarr.empty(4, dtype=object, object_codec=numcodecs.VLenArray(int))
906+
>>> z
907+
<zarr.core.Array (4,) object>
908+
>>> z.filters
909+
[VLenArray(dtype='<i8')]
910+
>>> z[0] = np.array([1, 3, 5])
911+
>>> z[1] = np.array([4])
912+
>>> z[2] = np.array([7, 9, 14])
913+
>>> z[:]
914+
array([array([1, 3, 5]), array([4]), array([ 7, 9, 14]),
915+
array([], dtype=int64)], dtype=object)
916+
917+
As a convenience, ``dtype='array:T'`` can be used as a short-hand for
918+
``dtype=object, object_codec=numcodecs.VLenArray('T')``, where 'T' can be any NumPy
919+
primitive dtype such as 'i4' or 'f8'. E.g.::
920+
921+
>>> z = zarr.empty(4, dtype='array:i8')
922+
>>> z
923+
<zarr.core.Array (4,) object>
924+
>>> z.filters
925+
[VLenArray(dtype='<i8')]
926+
>>> z[0] = np.array([1, 3, 5])
927+
>>> z[1] = np.array([4])
928+
>>> z[2] = np.array([7, 9, 14])
929+
>>> z[:]
930+
array([array([1, 3, 5]), array([4]), array([ 7, 9, 14]),
931+
array([], dtype=int64)], dtype=object)
864932

865933
.. _tutorial_chunks:
866934

@@ -1087,6 +1155,11 @@ arrays, as long as the units are specified. E.g.::
10871155
<zarr.core.Array (3,) datetime64[D]>
10881156
>>> z[:]
10891157
array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
1158+
>>> z[0]
1159+
numpy.datetime64('2007-07-13')
1160+
>>> z[0] = '1999-12-31'
1161+
>>> z[:]
1162+
array(['1999-12-31', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
10901163

10911164
.. _tutorial_tips:
10921165

requirements_dev.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ mccabe==0.6.1
1515
monotonic==1.3
1616
msgpack-python==0.4.8
1717
nose==1.3.7
18-
numcodecs==0.5.1
18+
numcodecs==0.5.2
1919
numpy==1.13.3
2020
packaging==16.8
2121
pkginfo==1.4.1

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
'asciitree',
2727
'numpy>=1.7',
2828
'fasteners',
29-
'numcodecs>=0.5.1',
29+
'numcodecs>=0.5.2',
3030
],
3131
package_dir={'': '.'},
3232
packages=['zarr', 'zarr.tests'],

0 commit comments

Comments
 (0)