Skip to content

Commit 4a29648

Browse files
authored
Merge branch 'master' into rework-requirements-alimanfoo-20181108
2 parents c8bb007 + d193a78 commit 4a29648

16 files changed

+422
-50
lines changed

docs/api/convenience.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,5 @@ Convenience functions (``zarr.convenience``)
1010
.. autofunction:: copy_all
1111
.. autofunction:: copy_store
1212
.. autofunction:: tree
13+
.. autofunction:: consolidate_metadata
14+
.. autofunction:: open_consolidated

docs/api/storage.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ Storage (``zarr.storage``)
2727
.. automethod:: invalidate_values
2828
.. automethod:: invalidate_keys
2929

30+
.. autoclass:: ConsolidatedMetadataStore
31+
3032
.. autofunction:: init_array
3133
.. autofunction:: init_group
3234
.. autofunction:: contains_array

docs/release.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,13 @@ Release notes
99
Enhancements
1010
~~~~~~~~~~~~
1111

12+
* Add "consolidated" metadata as an experimental feature: use
13+
:func:`zarr.convenience.consolidate_metadata` to copy all metadata from the various
14+
metadata keys within a dataset hierarchy under a single key, and
15+
:func:`zarr.convenience.open_consolidated` to use this single key. This can greatly
16+
cut down the number of calls to the storage backend, and so remove a lot of overhead
17+
for reading remote data. By :user:`Martin Durant <martindurant>`, :issue:`268`.
18+
1219
* Support has been added for structured arrays with sub-array shape and/or nested fields. By
1320
:user:`Tarik Onalan <onalant>`, :issue:`111`, :issue:`296`.
1421

docs/tutorial.rst

Lines changed: 47 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -778,9 +778,11 @@ chunk size, which will reduce the number of chunks and thus reduce the number of
778778
round-trips required to retrieve data for an array (and thus reduce the impact of network
779779
latency). Another option is to try to increase the compression ratio by changing
780780
compression options or trying a different compressor (which will reduce the impact of
781-
limited network bandwidth). As of version 2.2, Zarr also provides the
782-
:class:`zarr.storage.LRUStoreCache` which can be used to implement a local in-memory cache
783-
layer over a remote store. E.g.::
781+
limited network bandwidth).
782+
783+
As of version 2.2, Zarr also provides the :class:`zarr.storage.LRUStoreCache`
784+
which can be used to implement a local in-memory cache layer over a remote
785+
store. E.g.::
784786

785787
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
786788
>>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
@@ -797,13 +799,51 @@ layer over a remote store. E.g.::
797799
b'Hello from the cloud!'
798800
0.0009490990014455747
799801

800-
If you are still experiencing poor performance with distributed/cloud storage, please
801-
raise an issue on the GitHub issue tracker with any profiling data you can provide, as
802-
there may be opportunities to optimise further either within Zarr or within the mapping
803-
interface to the storage.
802+
If you are still experiencing poor performance with distributed/cloud storage,
803+
please raise an issue on the GitHub issue tracker with any profiling data you
804+
can provide, as there may be opportunities to optimise further either within
805+
Zarr or within the mapping interface to the storage.
804806

805807
.. _tutorial_copy:
806808

809+
Consolidating metadata
810+
~~~~~~~~~~~~~~~~~~~~~~
811+
812+
(This is an experimental feature.)
813+
814+
Since there is a significant overhead for every connection to a cloud object
815+
store such as S3, the pattern described in the previous section may incur
816+
significant latency while scanning the metadata of the dataset hierarchy, even
817+
though each individual metadata object is small. For cases such as these, once
818+
the data are static and can be regarded as read-only, at least for the
819+
metadata/structure of the dataset hierarchy, the many metadata objects can be
820+
consolidated into a single one via
821+
:func:`zarr.convenience.consolidate_metadata`. Doing this can greatly increase
822+
the speed of reading the dataset metadata, e.g.::
823+
824+
>>> zarr.consolidate_metadata(store) # doctest: +SKIP
825+
826+
This creates a special key with a copy of all of the metadata from all of the
827+
metadata objects in the store.
828+
829+
Later, to open a Zarr store with consolidated metadata, use
830+
:func:`zarr.convenience.open_consolidated`, e.g.::
831+
832+
>>> root = zarr.open_consolidated(store) # doctest: +SKIP
833+
834+
This uses the special key to read all of the metadata in a single call to the
835+
backend storage.
836+
837+
Note that, the hierarchy could still be opened in the normal way and altered,
838+
causing the consolidated metadata to become out of sync with the real state of
839+
the dataset hierarchy. In this case,
840+
:func:`zarr.convenience.consolidate_metadata` would need to be called again.
841+
842+
To protect against consolidated metadata accidentally getting out of sync, the
843+
root group returned by :func:`zarr.convenience.open_consolidated` is read-only
844+
for the metadata, meaning that no new groups or arrays can be created, and
845+
arrays cannot be resized. However, data values with arrays can still be updated.
846+
807847
Copying/migrating data
808848
----------------------
809849

zarr/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
from zarr.sync import ThreadSynchronizer, ProcessSynchronizer
1313
from zarr.codecs import *
1414
from zarr.convenience import (open, save, save_array, save_group, load, copy_store,
15-
copy, copy_all, tree)
15+
copy, copy_all, tree, consolidate_metadata,
16+
open_consolidated)
1617
from zarr.errors import CopyError, MetadataError, PermissionError
1718
from zarr.version import version as __version__

zarr/attrs.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
from collections import MutableMapping
55

66

7-
from zarr.compat import text_type
87
from zarr.errors import PermissionError
8+
from zarr.meta import parse_metadata
99

1010

1111
class Attributes(MutableMapping):
@@ -43,7 +43,7 @@ def _get_nosync(self):
4343
except KeyError:
4444
d = dict()
4545
else:
46-
d = json.loads(text_type(data, 'ascii'))
46+
d = parse_metadata(data)
4747
return d
4848

4949
def asdict(self):

zarr/compat.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ class PermissionError(Exception):
1919
def OrderedDict_move_to_end(od, key):
2020
od[key] = od.pop(key)
2121

22+
from collections import Mapping
23+
2224

2325
else: # pragma: py2 no cover
2426

@@ -29,3 +31,5 @@ def OrderedDict_move_to_end(od, key):
2931

3032
def OrderedDict_move_to_end(od, key):
3133
od.move_to_end(key)
34+
35+
from collections.abc import Mapping

zarr/convenience.py

Lines changed: 120 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,28 +15,34 @@
1515
from zarr.errors import err_path_not_found, CopyError
1616
from zarr.util import normalize_storage_path, TreeViewer, buffer_size
1717
from zarr.compat import PY2, text_type
18+
from zarr.meta import ensure_str, json_dumps
1819

1920

2021
# noinspection PyShadowingBuiltins
21-
def open(store, mode='a', **kwargs):
22+
def open(store=None, mode='a', **kwargs):
2223
"""Convenience function to open a group or array using file-mode-like semantics.
2324
2425
Parameters
2526
----------
26-
store : MutableMapping or string
27+
store : MutableMapping or string, optional
2728
Store or path to directory in file system or name of zip file.
2829
mode : {'r', 'r+', 'a', 'w', 'w-'}, optional
2930
Persistence mode: 'r' means read only (must exist); 'r+' means
3031
read/write (must exist); 'a' means read/write (create if doesn't
3132
exist); 'w' means create (overwrite if exists); 'w-' means create
3233
(fail if exists).
3334
**kwargs
34-
Additional parameters are passed through to :func:`zarr.open_array` or
35-
:func:`zarr.open_group`.
35+
Additional parameters are passed through to :func:`zarr.creation.open_array` or
36+
:func:`zarr.hierarchy.open_group`.
37+
38+
Returns
39+
-------
40+
z : :class:`zarr.core.Array` or :class:`zarr.hierarchy.Group`
41+
Array or group, depending on what exists in the given store.
3642
3743
See Also
3844
--------
39-
zarr.open_array, zarr.open_group
45+
zarr.creation.open_array, zarr.hierarchy.open_group
4046
4147
Examples
4248
--------
@@ -68,7 +74,8 @@ def open(store, mode='a', **kwargs):
6874

6975
path = kwargs.get('path', None)
7076
# handle polymorphic store arg
71-
store = normalize_store_arg(store, clobber=(mode == 'w'))
77+
clobber = mode == 'w'
78+
store = normalize_store_arg(store, clobber=clobber)
7279
path = normalize_storage_path(path)
7380

7481
if mode in {'w', 'w-', 'x'}:
@@ -1069,3 +1076,110 @@ def copy_all(source, dest, shallow=False, without_attrs=False, log=None,
10691076
_log_copy_summary(log, dry_run, n_copied, n_skipped, n_bytes_copied)
10701077

10711078
return n_copied, n_skipped, n_bytes_copied
1079+
1080+
1081+
def consolidate_metadata(store, metadata_key='.zmetadata'):
1082+
"""
1083+
Consolidate all metadata for groups and arrays within the given store
1084+
into a single resource and put it under the given key.
1085+
1086+
This produces a single object in the backend store, containing all the
1087+
metadata read from all the zarr-related keys that can be found. After
1088+
metadata have been consolidated, use :func:`open_consolidated` to open
1089+
the root group in optimised, read-only mode, using the consolidated
1090+
metadata to reduce the number of read operations on the backend store.
1091+
1092+
Note, that if the metadata in the store is changed after this
1093+
consolidation, then the metadata read by :func:`open_consolidated`
1094+
would be incorrect unless this function is called again.
1095+
1096+
.. note:: This is an experimental feature.
1097+
1098+
Parameters
1099+
----------
1100+
store : MutableMapping or string
1101+
Store or path to directory in file system or name of zip file.
1102+
metadata_key : str
1103+
Key to put the consolidated metadata under.
1104+
1105+
Returns
1106+
-------
1107+
g : :class:`zarr.hierarchy.Group`
1108+
Group instance, opened with the new consolidated metadata.
1109+
1110+
See Also
1111+
--------
1112+
open_consolidated
1113+
1114+
"""
1115+
import json
1116+
1117+
store = normalize_store_arg(store)
1118+
1119+
def is_zarr_key(key):
1120+
return (key.endswith('.zarray') or key.endswith('.zgroup') or
1121+
key.endswith('.zattrs'))
1122+
1123+
out = {
1124+
'zarr_consolidated_format': 1,
1125+
'metadata': {
1126+
key: json.loads(ensure_str(store[key]))
1127+
for key in store if is_zarr_key(key)
1128+
}
1129+
}
1130+
store[metadata_key] = json_dumps(out).encode()
1131+
return open_consolidated(store, metadata_key=metadata_key)
1132+
1133+
1134+
def open_consolidated(store, metadata_key='.zmetadata', mode='r+'):
1135+
"""Open group using metadata previously consolidated into a single key.
1136+
1137+
This is an optimised method for opening a Zarr group, where instead of
1138+
traversing the group/array hierarchy by accessing the metadata keys at
1139+
each level, a single key contains all of the metadata for everything.
1140+
For remote data sources where the overhead of accessing a key is large
1141+
compared to the time to read data.
1142+
1143+
The group accessed must have already had its metadata consolidated into a
1144+
single key using the function :func:`consolidate_metadata`.
1145+
1146+
This optimised method only works in modes which do not change the
1147+
metadata, although the data may still be written/updated.
1148+
1149+
Parameters
1150+
----------
1151+
store : MutableMapping or string
1152+
Store or path to directory in file system or name of zip file.
1153+
metadata_key : str
1154+
Key to read the consolidated metadata from. The default (.zmetadata)
1155+
corresponds to the default used by :func:`consolidate_metadata`.
1156+
mode : {'r', 'r+'}, optional
1157+
Persistence mode: 'r' means read only (must exist); 'r+' means
1158+
read/write (must exist) although only writes to data are allowed,
1159+
changes to metadata including creation of new arrays or group
1160+
are not allowed.
1161+
1162+
Returns
1163+
-------
1164+
g : :class:`zarr.hierarchy.Group`
1165+
Group instance, opened with the consolidated metadata.
1166+
1167+
See Also
1168+
--------
1169+
consolidate_metadata
1170+
1171+
"""
1172+
1173+
from .storage import ConsolidatedMetadataStore
1174+
1175+
# normalize parameters
1176+
store = normalize_store_arg(store)
1177+
if mode not in {'r', 'r+'}:
1178+
raise ValueError("invalid mode, expected either 'r' or 'r+'; found {!r}"
1179+
.format(mode))
1180+
1181+
# setup metadata sotre
1182+
meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
1183+
1184+
# pass through
1185+
return open(store=meta_store, chunk_store=store, mode=mode)

zarr/core.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,9 @@ def _load_metadata_nosync(self):
165165
if config is None:
166166
self._compressor = None
167167
else:
168+
# temporary workaround for
169+
# https://github.com/zarr-developers/numcodecs/issues/78
170+
config = dict(config)
168171
self._compressor = get_codec(config)
169172

170173
# setup filters

zarr/creation.py

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -346,15 +346,15 @@ def array(data, **kwargs):
346346
return z
347347

348348

349-
def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor='default',
350-
fill_value=0, order='C', synchronizer=None, filters=None,
351-
cache_metadata=True, cache_attrs=True, path=None, object_codec=None,
352-
**kwargs):
349+
def open_array(store=None, mode='a', shape=None, chunks=True, dtype=None,
350+
compressor='default', fill_value=0, order='C', synchronizer=None,
351+
filters=None, cache_metadata=True, cache_attrs=True, path=None,
352+
object_codec=None, chunk_store=None, **kwargs):
353353
"""Open an array using file-mode-like semantics.
354354
355355
Parameters
356356
----------
357-
store : MutableMapping or string
357+
store : MutableMapping or string, optional
358358
Store or path to directory in file system or name of zip file.
359359
mode : {'r', 'r+', 'a', 'w', 'w-'}, optional
360360
Persistence mode: 'r' means read only (must exist); 'r+' means
@@ -391,6 +391,8 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
391391
Array path within store.
392392
object_codec : Codec, optional
393393
A codec to encode object arrays, only needed if dtype=object.
394+
chunk_store : MutableMapping or string, optional
395+
Store or path to directory in file system or name of zip file.
394396
395397
Returns
396398
-------
@@ -426,7 +428,10 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
426428
# a : read/write if exists, create otherwise (default)
427429

428430
# handle polymorphic store arg
429-
store = normalize_store_arg(store, clobber=(mode == 'w'))
431+
clobber = mode == 'w'
432+
store = normalize_store_arg(store, clobber=clobber)
433+
if chunk_store is not None:
434+
chunk_store = normalize_store_arg(chunk_store, clobber=clobber)
430435
path = normalize_storage_path(path)
431436

432437
# API compatibility with h5py
@@ -448,7 +453,7 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
448453
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
449454
compressor=compressor, fill_value=fill_value,
450455
order=order, filters=filters, overwrite=True, path=path,
451-
object_codec=object_codec)
456+
object_codec=object_codec, chunk_store=chunk_store)
452457

453458
elif mode == 'a':
454459
if contains_group(store, path=path):
@@ -457,7 +462,7 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
457462
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
458463
compressor=compressor, fill_value=fill_value,
459464
order=order, filters=filters, path=path,
460-
object_codec=object_codec)
465+
object_codec=object_codec, chunk_store=chunk_store)
461466

462467
elif mode in ['w-', 'x']:
463468
if contains_group(store, path=path):
@@ -468,14 +473,15 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
468473
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
469474
compressor=compressor, fill_value=fill_value,
470475
order=order, filters=filters, path=path,
471-
object_codec=object_codec)
476+
object_codec=object_codec, chunk_store=chunk_store)
472477

473478
# determine read only status
474479
read_only = mode == 'r'
475480

476481
# instantiate array
477482
z = Array(store, read_only=read_only, synchronizer=synchronizer,
478-
cache_metadata=cache_metadata, cache_attrs=cache_attrs, path=path)
483+
cache_metadata=cache_metadata, cache_attrs=cache_attrs, path=path,
484+
chunk_store=chunk_store)
479485

480486
return z
481487

0 commit comments

Comments
 (0)