@@ -178,8 +178,8 @@ print some diagnostics, e.g.::
178
178
: blocksize=0)
179
179
Store type : builtins.dict
180
180
No. bytes : 400000000 (381.5M)
181
- No. bytes stored : 4565053 (4.4M )
182
- Storage ratio : 87.6
181
+ No. bytes stored : 3702484 (3.5M )
182
+ Storage ratio : 108.0
183
183
Chunks initialized : 100/100
184
184
185
185
If you don't specify a compressor, by default Zarr uses the Blosc
@@ -270,8 +270,8 @@ Here is an example using a delta filter with the Blosc compressor::
270
270
Compressor : Blosc(cname='zstd', clevel=1, shuffle=SHUFFLE, blocksize=0)
271
271
Store type : builtins.dict
272
272
No. bytes : 400000000 (381.5M)
273
- No. bytes stored : 648605 (633 .4K)
274
- Storage ratio : 616.7
273
+ No. bytes stored : 328085 (320 .4K)
274
+ Storage ratio : 1219.2
275
275
Chunks initialized : 100/100
276
276
277
277
For more information about available filter codecs, see the `Numcodecs
@@ -394,8 +394,8 @@ property. E.g.::
394
394
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
395
395
Store type : zarr.storage.DictStore
396
396
No. bytes : 8000000 (7.6M)
397
- No. bytes stored : 37480 (36.6K )
398
- Storage ratio : 213.4
397
+ No. bytes stored : 34840 (34.0K )
398
+ Storage ratio : 229.6
399
399
Chunks initialized : 10/10
400
400
401
401
>>> baz.info
@@ -409,8 +409,8 @@ property. E.g.::
409
409
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
410
410
Store type : zarr.storage.DictStore
411
411
No. bytes : 4000000 (3.8M)
412
- No. bytes stored : 23243 (22.7K )
413
- Storage ratio : 172.1
412
+ No. bytes stored : 20443 (20.0K )
413
+ Storage ratio : 195.7
414
414
Chunks initialized : 100/100
415
415
416
416
Groups also have the :func: `zarr.hierarchy.Group.tree ` method, e.g.::
@@ -768,7 +768,6 @@ Here is an example using S3Map to read an array created previously::
768
768
b'Hello from the cloud!'
769
769
770
770
771
-
772
771
.. _tutorial_strings :
773
772
774
773
String arrays
@@ -788,40 +787,80 @@ your dataset, then you can use an array with a fixed-length bytes dtype. E.g.::
788
787
789
788
A fixed-length unicode dtype is also available, e.g.::
790
789
791
- >>> z = zarr.zeros(12, dtype='U20')
792
790
>>> greetings = ['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
793
791
... 'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!',
794
792
... 'こんにちは世界', '世界,你好!', 'Helló, világ!', 'Zdravo svete!',
795
793
... 'เฮลโลเวิลด์']
796
- >>> z[:] = greetings
794
+ >>> text_data = greetings * 10000
795
+ >>> z = zarr.array(text_data, dtype='U20')
797
796
>>> z[:]
798
- array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
799
- 'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
800
- '世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
797
+ array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
798
+ 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
801
799
dtype='<U20')
802
800
803
- For variable-length strings, the "object" dtype can be used, but a filter must be
804
- provided to encode the data. There are currently two codecs available that can encode
805
- variable length string objects, :class: `numcodecs.Pickle ` and :class: `numcodecs.MsgPack `.
806
- E.g. using pickle::
801
+ For variable-length strings, the "object" dtype can be used, but a codec must be
802
+ provided to encode the data (see also :ref: `tutorial_objects ` below). At the time of
803
+ writing there are three codecs available that can encode variable length string
804
+ objects, :class: `numcodecs.JSON `, :class: `numcodecs.MsgPack `. and
805
+ :class: `numcodecs.Pickle `. E.g. using JSON::
807
806
808
807
>>> import numcodecs
809
- >>> z = zarr.zeros(12, dtype=object, filters=[numcodecs.Pickle()])
810
- >>> z[:] = greetings
808
+ >>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.JSON())
811
809
>>> z[:]
812
- array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
813
- 'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
814
- '世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
810
+ array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
811
+ 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
815
812
816
813
...or alternatively using msgpack (requires `msgpack-python
817
814
<https://github.com/msgpack/msgpack-python> `_ to be installed)::
818
815
819
- >>> z = zarr.zeros(12 , dtype = object , filters = [numcodecs.MsgPack()])
820
- >>> z[:] = greetings
816
+ >>> z = zarr.array(text_data, dtype = object , object_codec = numcodecs.MsgPack())
817
+ >>> z[:]
818
+ array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
819
+ 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
820
+
821
+ If you know ahead of time all the possible string values that can occur, then you could
822
+ also use the :class: `numcodecs.Categorize ` codec to encode each unique value as an
823
+ integer. E.g.::
824
+
825
+ >>> categorize = numcodecs.Categorize(greetings, dtype=object)
826
+ >>> z = zarr.array(text_data, dtype=object, object_codec=categorize)
827
+ >>> z[:]
828
+ array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
829
+ 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
830
+
831
+
832
+ .. _tutorial_objects :
833
+
834
+ Object arrays
835
+ -------------
836
+
837
+ Zarr supports arrays with an "object" dtype. This allows arrays to contain any type of
838
+ object, such as variable length unicode strings, or variable length lists, or other
839
+ possibilities. When creating an object array, a codec must be provided via the
840
+ ``object_codec `` argument. This codec handles encoding (serialization) of Python objects.
841
+ At the time of writing there are three codecs available that can serve as a
842
+ general purpose object codec and support encoding of a variety of
843
+ object types: :class: `numcodecs.JSON `, :class: `numcodecs.MsgPack `. and
844
+ :class: `numcodecs.Pickle `.
845
+
846
+ For example, using the JSON codec::
847
+
848
+ >>> z = zarr.empty(5, dtype=object, object_codec=numcodecs.JSON())
849
+ >>> z[0] = 42
850
+ >>> z[1] = 'foo'
851
+ >>> z[2] = ['bar', 'baz', 'qux']
852
+ >>> z[3] = {'a': 1, 'b': 2.2}
821
853
>>> z[:]
822
- array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
823
- 'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
824
- '世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
854
+ array([42, 'foo', list(['bar', 'baz', 'qux']), {'a': 1, 'b': 2.2}, None], dtype=object)
855
+
856
+ Not all codecs support encoding of all object types. The
857
+ :class: `numcodecs.Pickle ` codec is the most flexible, supporting encoding any type
858
+ of Python object. However, if you are sharing data with anyone other than yourself, then
859
+ Pickle is not recommended as it is a potential security risk. This is because malicious
860
+ code can be embedded within pickled data. The JSON and MsgPack codecs do not have any
861
+ security issues and support encoding of unicode strings, lists and dictionaries.
862
+ MsgPack is usually faster for both encoding and decoding.
863
+
825
864
826
865
.. _tutorial_chunks :
827
866
@@ -898,8 +937,8 @@ ratios, depending on the correlation structure within the data. E.g.::
898
937
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
899
938
Store type : builtins.dict
900
939
No. bytes : 400000000 (381.5M)
901
- No. bytes stored : 26805735 (25.6M )
902
- Storage ratio : 14.9
940
+ No. bytes stored : 15857834 (15.1M )
941
+ Storage ratio : 25.2
903
942
Chunks initialized : 100/100
904
943
>>> f = zarr.array(a, chunks=(1000, 1000), order='F')
905
944
>>> f.info
@@ -912,8 +951,8 @@ ratios, depending on the correlation structure within the data. E.g.::
912
951
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
913
952
Store type : builtins.dict
914
953
No. bytes : 400000000 (381.5M)
915
- No. bytes stored : 9633601 (9.2M )
916
- Storage ratio : 41.5
954
+ No. bytes stored : 7233241 (6.9M )
955
+ Storage ratio : 55.3
917
956
Chunks initialized : 100/100
918
957
919
958
In the above example, Fortran order gives a better compression ratio. This is an
@@ -1014,7 +1053,7 @@ E.g., pickle/unpickle an in-memory array::
1014
1053
>>> import pickle
1015
1054
>>> z1 = zarr.array(np.arange(100000))
1016
1055
>>> s = pickle.dumps(z1)
1017
- >>> len(s) > 10000 # relatively large because data have been pickled
1056
+ >>> len(s) > 5000 # relatively large because data have been pickled
1018
1057
True
1019
1058
>>> z2 = pickle.loads(s)
1020
1059
>>> z1 == z2
0 commit comments