Skip to content

Commit 607f4c5

Browse files
committed
fix tutorial for object arrays
1 parent 601510a commit 607f4c5

File tree

3 files changed

+77
-33
lines changed

3 files changed

+77
-33
lines changed

docs/tutorial.rst

Lines changed: 72 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -178,8 +178,8 @@ print some diagnostics, e.g.::
178178
: blocksize=0)
179179
Store type : builtins.dict
180180
No. bytes : 400000000 (381.5M)
181-
No. bytes stored : 4565053 (4.4M)
182-
Storage ratio : 87.6
181+
No. bytes stored : 3702484 (3.5M)
182+
Storage ratio : 108.0
183183
Chunks initialized : 100/100
184184

185185
If you don't specify a compressor, by default Zarr uses the Blosc
@@ -270,8 +270,8 @@ Here is an example using a delta filter with the Blosc compressor::
270270
Compressor : Blosc(cname='zstd', clevel=1, shuffle=SHUFFLE, blocksize=0)
271271
Store type : builtins.dict
272272
No. bytes : 400000000 (381.5M)
273-
No. bytes stored : 648605 (633.4K)
274-
Storage ratio : 616.7
273+
No. bytes stored : 328085 (320.4K)
274+
Storage ratio : 1219.2
275275
Chunks initialized : 100/100
276276

277277
For more information about available filter codecs, see the `Numcodecs
@@ -394,8 +394,8 @@ property. E.g.::
394394
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
395395
Store type : zarr.storage.DictStore
396396
No. bytes : 8000000 (7.6M)
397-
No. bytes stored : 37480 (36.6K)
398-
Storage ratio : 213.4
397+
No. bytes stored : 34840 (34.0K)
398+
Storage ratio : 229.6
399399
Chunks initialized : 10/10
400400

401401
>>> baz.info
@@ -409,8 +409,8 @@ property. E.g.::
409409
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
410410
Store type : zarr.storage.DictStore
411411
No. bytes : 4000000 (3.8M)
412-
No. bytes stored : 23243 (22.7K)
413-
Storage ratio : 172.1
412+
No. bytes stored : 20443 (20.0K)
413+
Storage ratio : 195.7
414414
Chunks initialized : 100/100
415415

416416
Groups also have the :func:`zarr.hierarchy.Group.tree` method, e.g.::
@@ -768,7 +768,6 @@ Here is an example using S3Map to read an array created previously::
768768
b'Hello from the cloud!'
769769

770770

771-
772771
.. _tutorial_strings:
773772

774773
String arrays
@@ -788,40 +787,80 @@ your dataset, then you can use an array with a fixed-length bytes dtype. E.g.::
788787

789788
A fixed-length unicode dtype is also available, e.g.::
790789

791-
>>> z = zarr.zeros(12, dtype='U20')
792790
>>> greetings = ['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
793791
... 'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!',
794792
... 'こんにちは世界', '世界,你好!', 'Helló, világ!', 'Zdravo svete!',
795793
... 'เฮลโลเวิลด์']
796-
>>> z[:] = greetings
794+
>>> text_data = greetings * 10000
795+
>>> z = zarr.array(text_data, dtype='U20')
797796
>>> z[:]
798-
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
799-
'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
800-
'世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
797+
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
798+
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
801799
dtype='<U20')
802800

803-
For variable-length strings, the "object" dtype can be used, but a filter must be
804-
provided to encode the data. There are currently two codecs available that can encode
805-
variable length string objects, :class:`numcodecs.Pickle` and :class:`numcodecs.MsgPack`.
806-
E.g. using pickle::
801+
For variable-length strings, the "object" dtype can be used, but a codec must be
802+
provided to encode the data (see also :ref:`tutorial_objects` below). At the time of
803+
writing there are three codecs available that can encode variable length string
804+
objects, :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
805+
:class:`numcodecs.Pickle`. E.g. using JSON::
807806

808807
>>> import numcodecs
809-
>>> z = zarr.zeros(12, dtype=object, filters=[numcodecs.Pickle()])
810-
>>> z[:] = greetings
808+
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.JSON())
811809
>>> z[:]
812-
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
813-
'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
814-
'世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
810+
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
811+
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
815812

816813
...or alternatively using msgpack (requires `msgpack-python
817814
<https://github.com/msgpack/msgpack-python>`_ to be installed)::
818815

819-
>>> z = zarr.zeros(12, dtype=object, filters=[numcodecs.MsgPack()])
820-
>>> z[:] = greetings
816+
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.MsgPack())
817+
>>> z[:]
818+
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
819+
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
820+
821+
If you know ahead of time all the possible string values that can occur, then you could
822+
also use the :class:`numcodecs.Categorize` codec to encode each unique value as an
823+
integer. E.g.::
824+
825+
>>> categorize = numcodecs.Categorize(greetings, dtype=object)
826+
>>> z = zarr.array(text_data, dtype=object, object_codec=categorize)
827+
>>> z[:]
828+
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
829+
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
830+
831+
832+
.. _tutorial_objects:
833+
834+
Object arrays
835+
-------------
836+
837+
Zarr supports arrays with an "object" dtype. This allows arrays to contain any type of
838+
object, such as variable length unicode strings, or variable length lists, or other
839+
possibilities. When creating an object array, a codec must be provided via the
840+
``object_codec`` argument. This codec handles encoding (serialization) of Python objects.
841+
At the time of writing there are three codecs available that can serve as a
842+
general purpose object codec and support encoding of a variety of
843+
object types: :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
844+
:class:`numcodecs.Pickle`.
845+
846+
For example, using the JSON codec::
847+
848+
>>> z = zarr.empty(5, dtype=object, object_codec=numcodecs.JSON())
849+
>>> z[0] = 42
850+
>>> z[1] = 'foo'
851+
>>> z[2] = ['bar', 'baz', 'qux']
852+
>>> z[3] = {'a': 1, 'b': 2.2}
821853
>>> z[:]
822-
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
823-
'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
824-
'世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
854+
array([42, 'foo', list(['bar', 'baz', 'qux']), {'a': 1, 'b': 2.2}, None], dtype=object)
855+
856+
Not all codecs support encoding of all object types. The
857+
:class:`numcodecs.Pickle` codec is the most flexible, supporting encoding any type
858+
of Python object. However, if you are sharing data with anyone other than yourself then
859+
Pickle is not recommended as it is a potential security risk, because malicious code can
860+
be embedded within pickled data. The JSON and MsgPack codecs support encoding of unicode
861+
strings, lists and dictionaries, with MsgPack usually faster for both encoding and
862+
decoding.
863+
825864

826865
.. _tutorial_chunks:
827866

@@ -898,8 +937,8 @@ ratios, depending on the correlation structure within the data. E.g.::
898937
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
899938
Store type : builtins.dict
900939
No. bytes : 400000000 (381.5M)
901-
No. bytes stored : 26805735 (25.6M)
902-
Storage ratio : 14.9
940+
No. bytes stored : 15857834 (15.1M)
941+
Storage ratio : 25.2
903942
Chunks initialized : 100/100
904943
>>> f = zarr.array(a, chunks=(1000, 1000), order='F')
905944
>>> f.info
@@ -912,8 +951,8 @@ ratios, depending on the correlation structure within the data. E.g.::
912951
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
913952
Store type : builtins.dict
914953
No. bytes : 400000000 (381.5M)
915-
No. bytes stored : 9633601 (9.2M)
916-
Storage ratio : 41.5
954+
No. bytes stored : 7233241 (6.9M)
955+
Storage ratio : 55.3
917956
Chunks initialized : 100/100
918957

919958
In the above example, Fortran order gives a better compression ratio. This is an
@@ -1014,7 +1053,7 @@ E.g., pickle/unpickle an in-memory array::
10141053
>>> import pickle
10151054
>>> z1 = zarr.array(np.arange(100000))
10161055
>>> s = pickle.dumps(z1)
1017-
>>> len(s) > 10000 # relatively large because data have been pickled
1056+
>>> len(s) > 5000 # relatively large because data have been pickled
10181057
True
10191058
>>> z2 = pickle.loads(s)
10201059
>>> z1 == z2

pytest.ini

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[pytest]
2+
doctest_optionflags = NORMALIZE_WHITESPACE ELLIPSIS IGNORE_EXCEPTION_DETAIL
3+

requirements_dev.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ py==1.4.34
2424
pycodestyle==2.3.1
2525
pyflakes==1.6.0
2626
pyparsing==2.2.0
27+
pytest==3.2.3
28+
pytest-cov==2.5.1
2729
requests==2.18.4
2830
requests-toolbelt==0.8.0
2931
setuptools-scm==1.15.6

0 commit comments

Comments
 (0)