Skip to content

Commit 216b35e

Browse files
authored
Merge pull request #212 from alimanfoo/object_encoding
Object encoding
2 parents 1c7efb8 + 53968f8 commit 216b35e

17 files changed

+683
-90
lines changed

appveyor.yml

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -40,18 +40,11 @@ environment:
4040

4141
install:
4242
- "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%"
43-
- git submodule update --init --recursive
4443

4544
build: off
4645

4746
test_script:
4847
- "%CMD_IN_ENV% python -m pip install -U pip setuptools wheel"
4948
- "%CMD_IN_ENV% python -m pip install -rrequirements_dev.txt"
50-
- "%CMD_IN_ENV% python setup.py build_ext --inplace"
51-
- "%CMD_IN_ENV% python -m nose -v"
52-
53-
after_test:
5449
- "%CMD_IN_ENV% python setup.py bdist_wheel"
55-
56-
artifacts:
57-
- path: dist\*
50+
- "%CMD_IN_ENV% python -m pytest -v zarr"

docs/release.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,17 @@ Enhancements
107107
* **New Array.hexdigest() method** computes an ``Array``'s hash with ``hashlib``.
108108
By :user:`John Kirkham <jakirkham>`, :issue:`98`, :issue:`203`.
109109

110+
* **Improved support for object arrays**. In previous versions of Zarr,
111+
creating an array with ``dtype=object`` was possible but could under certain
112+
circumstances lead to unexpected errors and/or segmentation faults. To make it easier
113+
to properly configure an object array, a new ``object_codec`` parameter has been
114+
added to array creation functions. See the tutorial section on :ref:`tutorial_objects`
115+
for more information and examples. Also, runtime checks have been added in both Zarr
116+
and Numcodecs so that segmentation faults are no longer possible, even with a badly
117+
configured array. This API change is backwards compatible and previous code that created
118+
an object array and provided an object codec via the ``filters`` parameter will
119+
continue to work, however a warning will be raised to encourage use of the
120+
``object_codec`` parameter. :issue:`208`, :issue:`212`.
110121

111122
Bug fixes
112123
~~~~~~~~~

docs/tutorial.rst

Lines changed: 72 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -178,8 +178,8 @@ print some diagnostics, e.g.::
178178
: blocksize=0)
179179
Store type : builtins.dict
180180
No. bytes : 400000000 (381.5M)
181-
No. bytes stored : 4565053 (4.4M)
182-
Storage ratio : 87.6
181+
No. bytes stored : 3702484 (3.5M)
182+
Storage ratio : 108.0
183183
Chunks initialized : 100/100
184184

185185
If you don't specify a compressor, by default Zarr uses the Blosc
@@ -270,8 +270,8 @@ Here is an example using a delta filter with the Blosc compressor::
270270
Compressor : Blosc(cname='zstd', clevel=1, shuffle=SHUFFLE, blocksize=0)
271271
Store type : builtins.dict
272272
No. bytes : 400000000 (381.5M)
273-
No. bytes stored : 648605 (633.4K)
274-
Storage ratio : 616.7
273+
No. bytes stored : 328085 (320.4K)
274+
Storage ratio : 1219.2
275275
Chunks initialized : 100/100
276276

277277
For more information about available filter codecs, see the `Numcodecs
@@ -394,8 +394,8 @@ property. E.g.::
394394
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
395395
Store type : zarr.storage.DictStore
396396
No. bytes : 8000000 (7.6M)
397-
No. bytes stored : 37480 (36.6K)
398-
Storage ratio : 213.4
397+
No. bytes stored : 34840 (34.0K)
398+
Storage ratio : 229.6
399399
Chunks initialized : 10/10
400400

401401
>>> baz.info
@@ -409,8 +409,8 @@ property. E.g.::
409409
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
410410
Store type : zarr.storage.DictStore
411411
No. bytes : 4000000 (3.8M)
412-
No. bytes stored : 23243 (22.7K)
413-
Storage ratio : 172.1
412+
No. bytes stored : 20443 (20.0K)
413+
Storage ratio : 195.7
414414
Chunks initialized : 100/100
415415

416416
Groups also have the :func:`zarr.hierarchy.Group.tree` method, e.g.::
@@ -768,7 +768,6 @@ Here is an example using S3Map to read an array created previously::
768768
b'Hello from the cloud!'
769769

770770

771-
772771
.. _tutorial_strings:
773772

774773
String arrays
@@ -788,40 +787,80 @@ your dataset, then you can use an array with a fixed-length bytes dtype. E.g.::
788787

789788
A fixed-length unicode dtype is also available, e.g.::
790789

791-
>>> z = zarr.zeros(12, dtype='U20')
792790
>>> greetings = ['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
793791
... 'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!',
794792
... 'こんにちは世界', '世界,你好!', 'Helló, világ!', 'Zdravo svete!',
795793
... 'เฮลโลเวิลด์']
796-
>>> z[:] = greetings
794+
>>> text_data = greetings * 10000
795+
>>> z = zarr.array(text_data, dtype='U20')
797796
>>> z[:]
798-
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
799-
'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
800-
'世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
797+
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
798+
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
801799
dtype='<U20')
802800

803-
For variable-length strings, the "object" dtype can be used, but a filter must be
804-
provided to encode the data. There are currently two codecs available that can encode
805-
variable length string objects, :class:`numcodecs.Pickle` and :class:`numcodecs.MsgPack`.
806-
E.g. using pickle::
801+
For variable-length strings, the "object" dtype can be used, but a codec must be
802+
provided to encode the data (see also :ref:`tutorial_objects` below). At the time of
803+
writing there are three codecs available that can encode variable length string
804+
objects, :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
805+
:class:`numcodecs.Pickle`. E.g. using JSON::
807806

808807
>>> import numcodecs
809-
>>> z = zarr.zeros(12, dtype=object, filters=[numcodecs.Pickle()])
810-
>>> z[:] = greetings
808+
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.JSON())
811809
>>> z[:]
812-
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
813-
'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
814-
'世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
810+
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
811+
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
815812

816813
...or alternatively using msgpack (requires `msgpack-python
817814
<https://github.com/msgpack/msgpack-python>`_ to be installed)::
818815

819-
>>> z = zarr.zeros(12, dtype=object, filters=[numcodecs.MsgPack()])
820-
>>> z[:] = greetings
816+
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.MsgPack())
817+
>>> z[:]
818+
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
819+
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
820+
821+
If you know ahead of time all the possible string values that can occur, then you could
822+
also use the :class:`numcodecs.Categorize` codec to encode each unique value as an
823+
integer. E.g.::
824+
825+
>>> categorize = numcodecs.Categorize(greetings, dtype=object)
826+
>>> z = zarr.array(text_data, dtype=object, object_codec=categorize)
827+
>>> z[:]
828+
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
829+
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
830+
831+
832+
.. _tutorial_objects:
833+
834+
Object arrays
835+
-------------
836+
837+
Zarr supports arrays with an "object" dtype. This allows arrays to contain any type of
838+
object, such as variable length unicode strings, or variable length lists, or other
839+
possibilities. When creating an object array, a codec must be provided via the
840+
``object_codec`` argument. This codec handles encoding (serialization) of Python objects.
841+
At the time of writing there are three codecs available that can serve as a
842+
general purpose object codec and support encoding of a variety of
843+
object types: :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
844+
:class:`numcodecs.Pickle`.
845+
846+
For example, using the JSON codec::
847+
848+
>>> z = zarr.empty(5, dtype=object, object_codec=numcodecs.JSON())
849+
>>> z[0] = 42
850+
>>> z[1] = 'foo'
851+
>>> z[2] = ['bar', 'baz', 'qux']
852+
>>> z[3] = {'a': 1, 'b': 2.2}
821853
>>> z[:]
822-
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
823-
'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
824-
'世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
854+
array([42, 'foo', list(['bar', 'baz', 'qux']), {'a': 1, 'b': 2.2}, None], dtype=object)
855+
856+
Not all codecs support encoding of all object types. The
857+
:class:`numcodecs.Pickle` codec is the most flexible, supporting encoding any type
858+
of Python object. However, if you are sharing data with anyone other than yourself, then
859+
Pickle is not recommended as it is a potential security risk. This is because malicious
860+
code can be embedded within pickled data. The JSON and MsgPack codecs do not have any
861+
security issues and support encoding of unicode strings, lists and dictionaries.
862+
MsgPack is usually faster for both encoding and decoding.
863+
825864

826865
.. _tutorial_chunks:
827866

@@ -898,8 +937,8 @@ ratios, depending on the correlation structure within the data. E.g.::
898937
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
899938
Store type : builtins.dict
900939
No. bytes : 400000000 (381.5M)
901-
No. bytes stored : 26805735 (25.6M)
902-
Storage ratio : 14.9
940+
No. bytes stored : 15857834 (15.1M)
941+
Storage ratio : 25.2
903942
Chunks initialized : 100/100
904943
>>> f = zarr.array(a, chunks=(1000, 1000), order='F')
905944
>>> f.info
@@ -912,8 +951,8 @@ ratios, depending on the correlation structure within the data. E.g.::
912951
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
913952
Store type : builtins.dict
914953
No. bytes : 400000000 (381.5M)
915-
No. bytes stored : 9633601 (9.2M)
916-
Storage ratio : 41.5
954+
No. bytes stored : 7233241 (6.9M)
955+
Storage ratio : 55.3
917956
Chunks initialized : 100/100
918957

919958
In the above example, Fortran order gives a better compression ratio. This is an
@@ -1014,7 +1053,7 @@ E.g., pickle/unpickle an in-memory array::
10141053
>>> import pickle
10151054
>>> z1 = zarr.array(np.arange(100000))
10161055
>>> s = pickle.dumps(z1)
1017-
>>> len(s) > 10000 # relatively large because data have been pickled
1056+
>>> len(s) > 5000 # relatively large because data have been pickled
10181057
True
10191058
>>> z2 = pickle.loads(s)
10201059
>>> z1 == z2

0 commit comments

Comments
 (0)