@@ -779,6 +779,8 @@ If your strings are all ASCII strings, and you know the maximum length of the st
779
779
your dataset, then you can use an array with a fixed-length bytes dtype. E.g.::
780
780
781
781
>>> z = zarr.zeros(10, dtype='S6')
782
+ >>> z
783
+ <zarr.core.Array (10,) |S6>
782
784
>>> z[0] = b'Hello'
783
785
>>> z[1] = b'world!'
784
786
>>> z[:]
@@ -793,37 +795,68 @@ A fixed-length unicode dtype is also available, e.g.::
793
795
... 'เฮลโลเวิลด์']
794
796
>>> text_data = greetings * 10000
795
797
>>> z = zarr.array(text_data, dtype='U20')
798
+ >>> z
799
+ <zarr.core.Array (120000,) <U20>
796
800
>>> z[:]
797
801
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
798
802
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
799
803
dtype='<U20')
800
804
801
- For variable-length strings, the " object" dtype can be used, but a codec must be
805
+ For variable-length strings, the `` object `` dtype can be used, but a codec must be
802
806
provided to encode the data (see also :ref: `tutorial_objects ` below). At the time of
803
- writing there are three codecs available that can encode variable length string
804
- objects, :class: `numcodecs.JSON `, :class: `numcodecs.MsgPack `. and
805
- :class: `numcodecs.Pickle `. E.g. using JSON ::
807
+ writing there are four codecs available that can encode variable length string
808
+ objects: :class: ` numcodecs.VLenUTF8 ` , :class: `numcodecs.JSON `, :class: `numcodecs.MsgPack `.
809
+ and :class: `numcodecs.Pickle `. E.g. using `` VLenUTF8 `` ::
806
810
807
811
>>> import numcodecs
808
- >>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.JSON())
812
+ >>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.VLenUTF8())
813
+ >>> z
814
+ <zarr.core.Array (120000,) object>
815
+ >>> z.filters
816
+ [VLenUTF8()]
809
817
>>> z[:]
810
818
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
811
819
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
812
820
813
- ...or alternatively using msgpack (requires ` msgpack-python
814
- <https://github.com/msgpack/msgpack-python> `_ to be installed) ::
821
+ As a convenience, `` dtype=str `` (or `` dtype=unicode `` on Python 2.7) can be used, which
822
+ is a short-hand for `` dtype=object, object_codec=numcodecs.VLenUTF8() ``, e.g. ::
815
823
816
- >>> z = zarr.array(text_data, dtype = object , object_codec = numcodecs.MsgPack())
824
+ >>> z = zarr.array(text_data, dtype=str)
825
+ >>> z
826
+ <zarr.core.Array (120000,) object>
827
+ >>> z.filters
828
+ [VLenUTF8()]
817
829
>>> z[:]
818
830
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
819
831
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
820
832
821
- If you know ahead of time all the possible string values that can occur, then you could
822
- also use the :class: `numcodecs.Categorize ` codec to encode each unique value as an
833
+ Variable-length byte strings are also supported via ``dtype=object ``. Again an
834
+ ``object_codec `` is required, which can be one of :class: `numcodecs.VLenBytes ` or
835
+ :class: `numcodecs.Pickle `. For convenience, ``dtype=bytes `` (or ``dtype=str `` on Python
836
+ 2.7) can be used as a short-hand for ``dtype=object, object_codec=numcodecs.VLenBytes() ``,
837
+ e.g.::
838
+
839
+ >>> bytes_data = [g.encode('utf-8') for g in greetings] * 10000
840
+ >>> z = zarr.array(bytes_data, dtype=bytes)
841
+ >>> z
842
+ <zarr.core.Array (120000,) object>
843
+ >>> z.filters
844
+ [VLenBytes()]
845
+ >>> z[:]
846
+ array([b'\xc2\xa1Hola mundo!', b'Hej V\xc3\xa4rlden!', b'Servus Woid!',
847
+ ..., b'Hell\xc3\xb3, vil\xc3\xa1g!', b'Zdravo svete!',
848
+ b'\xe0\xb9\x80\xe0\xb8\xae\xe0\xb8\xa5\xe0\xb9\x82\xe0\xb8\xa5\xe0\xb9\x80\xe0\xb8\xa7\xe0\xb8\xb4\xe0\xb8\xa5\xe0\xb8\x94\xe0\xb9\x8c'], dtype=object)
849
+
850
+ If you know ahead of time all the possible string values that can occur, you could
851
+ also use the :class: `numcodecs.Categorize ` codec to encode each unique string value as an
823
852
integer. E.g.::
824
853
825
854
>>> categorize = numcodecs.Categorize(greetings, dtype=object)
826
855
>>> z = zarr.array(text_data, dtype=object, object_codec=categorize)
856
+ >>> z
857
+ <zarr.core.Array (120000,) object>
858
+ >>> z.filters
859
+ [Categorize(dtype='|O', astype='|u1', labels=['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...])]
827
860
>>> z[:]
828
861
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
829
862
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
@@ -835,13 +868,14 @@ Object arrays
835
868
-------------
836
869
837
870
Zarr supports arrays with an "object" dtype. This allows arrays to contain any type of
838
- object, such as variable length unicode strings, or variable length lists , or other
839
- possibilities. When creating an object array, a codec must be provided via the
871
+ object, such as variable length unicode strings, or variable length arrays of numbers , or
872
+ other possibilities. When creating an object array, a codec must be provided via the
840
873
``object_codec `` argument. This codec handles encoding (serialization) of Python objects.
841
- At the time of writing there are three codecs available that can serve as a
842
- general purpose object codec and support encoding of a variety of
843
- object types: :class: `numcodecs.JSON `, :class: `numcodecs.MsgPack `. and
844
- :class: `numcodecs.Pickle `.
874
+ The best codec to use will depend on what type of objects are present in the array.
875
+
876
+ At the time of writing there are three codecs available that can serve as a general
877
+ purpose object codec and support encoding of a mixture of object types:
878
+ :class: `numcodecs.JSON `, :class: `numcodecs.MsgPack `. and :class: `numcodecs.Pickle `.
845
879
846
880
For example, using the JSON codec::
847
881
@@ -861,6 +895,40 @@ code can be embedded within pickled data. The JSON and MsgPack codecs do not hav
861
895
security issues and support encoding of unicode strings, lists and dictionaries.
862
896
MsgPack is usually faster for both encoding and decoding.
863
897
898
+ Ragged arrays
899
+ ~~~~~~~~~~~~~
900
+
901
+ If you need to store an array of arrays, where each member array can be of any length
902
+ and stores the same primitive type (a.k.a. a ragged array), the
903
+ :class: `numcodecs.VLenArray ` codec can be used, e.g.::
904
+
905
+ >>> z = zarr.empty(4, dtype=object, object_codec=numcodecs.VLenArray(int))
906
+ >>> z
907
+ <zarr.core.Array (4,) object>
908
+ >>> z.filters
909
+ [VLenArray(dtype='<i8')]
910
+ >>> z[0] = np.array([1, 3, 5])
911
+ >>> z[1] = np.array([4])
912
+ >>> z[2] = np.array([7, 9, 14])
913
+ >>> z[:]
914
+ array([array([1, 3, 5]), array([4]), array([ 7, 9, 14]),
915
+ array([], dtype=int64)], dtype=object)
916
+
917
+ As a convenience, ``dtype='array:T' `` can be used as a short-hand for
918
+ ``dtype=object, object_codec=numcodecs.VLenArray('T') ``, where 'T' can be any NumPy
919
+ primitive dtype such as 'i4' or 'f8'. E.g.::
920
+
921
+ >>> z = zarr.empty(4, dtype='array:i8')
922
+ >>> z
923
+ <zarr.core.Array (4,) object>
924
+ >>> z.filters
925
+ [VLenArray(dtype='<i8')]
926
+ >>> z[0] = np.array([1, 3, 5])
927
+ >>> z[1] = np.array([4])
928
+ >>> z[2] = np.array([7, 9, 14])
929
+ >>> z[:]
930
+ array([array([1, 3, 5]), array([4]), array([ 7, 9, 14]),
931
+ array([], dtype=int64)], dtype=object)
864
932
865
933
.. _tutorial_chunks :
866
934
@@ -1079,25 +1147,19 @@ E.g., pickle/unpickle an array stored on disk::
1079
1147
Datetimes and timedeltas
1080
1148
------------------------
1081
1149
1082
- Please note that NumPy's ``datetime64 `` and ``timedelta64 `` dtypes are **not ** currently
1083
- supported for Zarr arrays. If you would like to store datetime or timedelta data, you
1084
- can store the data in an array with an integer dtype, e.g.::
1150
+ NumPy's ``datetime64 `` ('M8') and ``timedelta64 `` ('m8') dtypes are supported for Zarr
1151
+ arrays, as long as the units are specified. E.g.::
1085
1152
1086
- >>> a = np.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
1087
- >>> z = zarr.array(a.view('i8'))
1153
+ >>> z = zarr.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='M8[D]')
1088
1154
>>> z
1089
- <zarr.core.Array (3,) int64 >
1155
+ <zarr.core.Array (3,) datetime64[D] >
1090
1156
>>> z[:]
1091
- array([13707, 13161, 14834])
1092
- >>> z[:].view(a.dtype)
1093
- array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
1094
-
1095
- If you would like a convenient way to retrieve the data from this array viewed as the
1096
- original datetime64 dtype, try the :func: `zarr.core.Array.astype ` method, e.g.::
1097
-
1098
- >>> zv = z.astype(a.dtype)
1099
- >>> zv[:]
1100
1157
array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
1158
+ >>> z[0]
1159
+ numpy.datetime64('2007-07-13')
1160
+ >>> z[0] = '1999-12-31'
1161
+ >>> z[:]
1162
+ array(['1999-12-31', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
1101
1163
1102
1164
.. _tutorial_tips :
1103
1165
0 commit comments