Skip to content

Commit 3d183dd

Browse files
authored
Merge branch 'master' into cache-attrs-20171224b
2 parents 26d7366 + c4e2e96 commit 3d183dd

16 files changed

+1930
-59
lines changed

docs/api/convenience.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,7 @@ Convenience functions (``zarr.convenience``)
66
.. autofunction:: load
77
.. autofunction:: save_array
88
.. autofunction:: save_group
9+
.. autofunction:: copy
10+
.. autofunction:: copy_all
11+
.. autofunction:: copy_store
12+
.. autofunction:: tree

docs/api/storage.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,18 @@ Storage (``zarr.storage``)
2121
.. automethod:: close
2222
.. automethod:: flush
2323

24+
.. autoclass:: LRUStoreCache
25+
26+
.. automethod:: invalidate
27+
.. automethod:: invalidate_values
28+
.. automethod:: invalidate_keys
29+
2430
.. autofunction:: init_array
2531
.. autofunction:: init_group
32+
.. autofunction:: contains_array
33+
.. autofunction:: contains_group
34+
.. autofunction:: listdir
35+
.. autofunction:: rmdir
36+
.. autofunction:: getsize
37+
.. autofunction:: rename
2638
.. autofunction:: migrate_1to2

docs/release.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,18 @@ Enhancements
131131
slow stores, e.g., stores accessing data via the network; :issue:`220`, :issue:`218`,
132132
:issue:`204`.
133133

134+
* **New LRUStoreCache class**. The class :class:`zarr.storage.LRUStoreCache` has been
135+
added and provides a means to locally cache data in memory from a store that may be
136+
slow, e.g., a store that retrieves data from a remote server via the network;
137+
:issue:`223`.
138+
139+
* **New copy functions**. The new functions :func:`zarr.convenience.copy` and
140+
:func:`zarr.convenience.copy_all` provide a way to copy groups and/or arrays
141+
between HDF5 and Zarr, or between two Zarr groups. The
142+
:func:`zarr.convenience.copy_store` provides a more efficient way to copy
143+
data directly between two Zarr stores. :issue:`87`, :issue:`113`,
144+
:issue:`137`, :issue:`217`.
145+
134146
Bug fixes
135147
~~~~~~~~~
136148

docs/tutorial.rst

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -729,6 +729,9 @@ group (requires `lmdb <http://lmdb.readthedocs.io/>`_ to be installed)::
729729
>>> z[:] = 42
730730
>>> store.close()
731731

732+
Distributed/cloud storage
733+
~~~~~~~~~~~~~~~~~~~~~~~~~
734+
732735
It is also possible to use distributed storage systems. The Dask project has
733736
implementations of the ``MutableMapping`` interface for Amazon S3 (`S3Map
734737
<http://s3fs.readthedocs.io/en/latest/api.html#s3fs.mapping.S3Map>`_), Hadoop
@@ -767,6 +770,141 @@ Here is an example using S3Map to read an array created previously::
767770
>>> z[:].tostring()
768771
b'Hello from the cloud!'
769772

773+
Note that retrieving data from a remote service via the network can be significantly
774+
slower than retrieving data from a local file system, and will depend on network latency
775+
and bandwidth between the client and server systems. If you are experiencing poor
776+
performance, there are several things you can try. One option is to increase the array
777+
chunk size, which will reduce the number of chunks and thus reduce the number of network
778+
round-trips required to retrieve data for an array (and thus reduce the impact of network
779+
latency). Another option is to try to increase the compression ratio by changing
780+
compression options or trying a different compressor (which will reduce the impact of
781+
limited network bandwidth). As of version 2.2, Zarr also provides the
782+
:class:`zarr.storage.LRUStoreCache` which can be used to implement a local in-memory cache
783+
layer over a remote store. E.g.::
784+
785+
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
786+
>>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
787+
>>> cache = zarr.LRUStoreCache(store, max_size=2**28)
788+
>>> root = zarr.group(store=cache)
789+
>>> z = root['foo/bar/baz']
790+
>>> from timeit import timeit
791+
>>> # first data access is relatively slow, retrieved from store
792+
... timeit('print(z[:].tostring())', number=1, globals=globals()) # doctest: +SKIP
793+
b'Hello from the cloud!'
794+
0.1081731989979744
795+
>>> # second data access is faster, uses cache
796+
... timeit('print(z[:].tostring())', number=1, globals=globals()) # doctest: +SKIP
797+
b'Hello from the cloud!'
798+
0.0009490990014455747
799+
800+
If you are still experiencing poor performance with distributed/cloud storage, please
801+
raise an issue on the GitHub issue tracker with any profiling data you can provide, as
802+
there may be opportunities to optimise further either within Zarr or within the mapping
803+
interface to the storage.
804+
805+
.. _tutorial_copy:
806+
807+
Copying/migrating data
808+
----------------------
809+
810+
If you have some data in an HDF5 file and would like to copy some or all of it
811+
into a Zarr group, or vice-versa, the :func:`zarr.convenience.copy` and
812+
:func:`zarr.convenience.copy_all` functions can be used. Here's an example
813+
copying a group named 'foo' from an HDF5 file to a Zarr group::
814+
815+
>>> import h5py
816+
>>> import zarr
817+
>>> import numpy as np
818+
>>> source = h5py.File('data/example.h5', mode='w')
819+
>>> foo = source.create_group('foo')
820+
>>> baz = foo.create_dataset('bar/baz', data=np.arange(100), chunks=(50,))
821+
>>> spam = source.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
822+
>>> zarr.tree(source)
823+
/
824+
├── foo
825+
│ └── bar
826+
│ └── baz (100,) int64
827+
└── spam (100,) int64
828+
>>> dest = zarr.open_group('data/example.zarr', mode='w')
829+
>>> from sys import stdout
830+
>>> zarr.copy(source['foo'], dest, log=stdout)
831+
copy /foo
832+
copy /foo/bar
833+
copy /foo/bar/baz (100,) int64
834+
all done: 3 copied, 0 skipped, 800 bytes copied
835+
(3, 0, 800)
836+
>>> dest.tree() # N.B., no spam
837+
/
838+
└── foo
839+
└── bar
840+
└── baz (100,) int64
841+
>>> source.close()
842+
843+
If rather than copying a single group or dataset you would like to copy all
844+
groups and datasets, use :func:`zarr.convenience.copy_all`, e.g.::
845+
846+
>>> source = h5py.File('data/example.h5', mode='r')
847+
>>> dest = zarr.open_group('data/example2.zarr', mode='w')
848+
>>> zarr.copy_all(source, dest, log=stdout)
849+
copy /foo
850+
copy /foo/bar
851+
copy /foo/bar/baz (100,) int64
852+
copy /spam (100,) int64
853+
all done: 4 copied, 0 skipped, 1,600 bytes copied
854+
(4, 0, 1600)
855+
>>> dest.tree()
856+
/
857+
├── foo
858+
│ └── bar
859+
│ └── baz (100,) int64
860+
└── spam (100,) int64
861+
862+
If you need to copy data between two Zarr groups, the
863+
func:`zarr.convenience.copy` and :func:`zarr.convenience.copy_all` functions can
864+
be used and provide the most flexibility. However, if you want to copy data
865+
in the most efficient way possible, without changing any configuration options,
866+
the :func:`zarr.convenience.copy_store` function can be used. This function
867+
copies data directly between the underlying stores, without any decompression or
868+
re-compression, and so should be faster. E.g.::
869+
870+
>>> import zarr
871+
>>> import numpy as np
872+
>>> store1 = zarr.DirectoryStore('data/example.zarr')
873+
>>> root = zarr.group(store1, overwrite=True)
874+
>>> baz = root.create_dataset('foo/bar/baz', data=np.arange(100), chunks=(50,))
875+
>>> spam = root.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
876+
>>> root.tree()
877+
/
878+
├── foo
879+
│ └── bar
880+
│ └── baz (100,) int64
881+
└── spam (100,) int64
882+
>>> from sys import stdout
883+
>>> store2 = zarr.ZipStore('data/example.zip', mode='w')
884+
>>> zarr.copy_store(store1, store2, log=stdout)
885+
copy .zgroup
886+
copy foo/.zgroup
887+
copy foo/bar/.zgroup
888+
copy foo/bar/baz/.zarray
889+
copy foo/bar/baz/0
890+
copy foo/bar/baz/1
891+
copy spam/.zarray
892+
copy spam/0
893+
copy spam/1
894+
copy spam/2
895+
copy spam/3
896+
all done: 11 copied, 0 skipped, 1,138 bytes copied
897+
(11, 0, 1138)
898+
>>> new_root = zarr.group(store2)
899+
>>> new_root.tree()
900+
/
901+
├── foo
902+
│ └── bar
903+
│ └── baz (100,) int64
904+
└── spam (100,) int64
905+
>>> new_root['foo/bar/baz'][:]
906+
array([ 0, 1, 2, ..., 97, 98, 99])
907+
>>> store2.close() # zip stores need to be closed
770908

771909
.. _tutorial_strings:
772910

requirements_dev.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Cython==0.27.2
1010
docopt==0.6.2
1111
fasteners==0.14.1
1212
flake8==3.5.0
13+
h5py==2.7.1
1314
idna==2.6
1415
mccabe==0.6.1
1516
monotonic==1.3

zarr/__init__.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,11 @@
77
from zarr.creation import (empty, zeros, ones, full, array, empty_like, zeros_like,
88
ones_like, full_like, open_array, open_like, create)
99
from zarr.storage import (DictStore, DirectoryStore, ZipStore, TempStore,
10-
NestedDirectoryStore, DBMStore, LMDBStore)
10+
NestedDirectoryStore, DBMStore, LMDBStore, LRUStoreCache)
1111
from zarr.hierarchy import group, open_group, Group
1212
from zarr.sync import ThreadSynchronizer, ProcessSynchronizer
1313
from zarr.codecs import *
14-
from zarr.convenience import open, save, save_array, save_group, load
14+
from zarr.convenience import (open, save, save_array, save_group, load, copy_store,
15+
copy, copy_all, tree)
16+
from zarr.errors import CopyError, MetadataError, PermissionError
1517
from zarr.version import version as __version__

zarr/compat.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,16 @@
1616
class PermissionError(Exception):
1717
pass
1818

19+
def OrderedDict_move_to_end(od, key):
20+
od[key] = od.pop(key)
21+
22+
1923
else: # pragma: py2 no cover
2024

2125
text_type = str
2226
binary_type = bytes
2327
from functools import reduce
2428
PermissionError = PermissionError
29+
30+
def OrderedDict_move_to_end(od, key):
31+
od.move_to_end(key)

0 commit comments

Comments
 (0)