Skip to content

Commit 3964eab

Browse files
Zarr-v3 Consolidated Metadata (#2113)
* Fixed MemoryStore.list_dir Ensures that nested children are listed properly. * fixup s3 * recursive Group.members This PR adds a recursive=True flag to Group.members, for recursively listing the members of some hierarhcy. This is useful for Consolidated Metadata, which needs to recursively inspect children. IMO, it's useful (and simple) enough to include in the public API. * Zarr-v3 Consolidated Metadata Implements the optional Consolidated Metadata feature of zarr-v3. * fixup * read zarr-v2 consolidated metadata * check writablem * Handle non-root paths * Some error handling * cleanup * refactor open * remove dupe file * v2 getitem * fixup * Optimzied members * Impl flatten * Fixups * doc * nest the tests * fixup * Fixups * fixup * fixup * fixup * fixup * consistent open_consolidated handling * fixup * make clear that flat_to_nested mutates * fixup * fixup * Fixup * fixup * fixup * fixup * fixup * added docs * fixup * Ensure empty dict * fixed name * fixup nested * removed dupe tests * fixup * doc fix * fixups * fixup * fixup * v2 writer * fixup * fixup * path fix * Fixed v2 use_consolidated=False * fixupg * Special case object dtype Closes #2315 * fixup * docs * pr review * must_understand * Updated from_dict checking * cleanup * cleanup * Fixed fill_value * fixup
1 parent 6b11bb8 commit 3964eab

File tree

14 files changed

+1732
-67
lines changed

14 files changed

+1732
-67
lines changed

docs/consolidated_metadata.rst

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
Consolidated Metadata
2+
=====================
3+
4+
Zarr-Python implements the `Consolidated Metadata_` extension to the Zarr Spec.
5+
Consolidated metadata can reduce the time needed to load the metadata for an
6+
entire hierarchy, especially when the metadata is being served over a network.
7+
Consolidated metadata essentially stores all the metadata for a hierarchy in the
8+
metadata of the root Group.
9+
10+
Usage
11+
-----
12+
13+
If consolidated metadata is present in a Zarr Group's metadata then it is used
14+
by default. The initial read to open the group will need to communicate with
15+
the store (reading from a file for a :class:`zarr.store.LocalStore`, making a
16+
network request for a :class:`zarr.store.RemoteStore`). After that, any subsequent
17+
metadata reads get child Group or Array nodes will *not* require reads from the store.
18+
19+
In Python, the consolidated metadata is available on the ``.consolidated_metadata``
20+
attribute of the ``GroupMetadata`` object.
21+
22+
.. code-block:: python
23+
24+
>>> import zarr
25+
>>> store = zarr.store.MemoryStore({}, mode="w")
26+
>>> group = zarr.open_group(store=store)
27+
>>> group.create_array(shape=(1,), name="a")
28+
>>> group.create_array(shape=(2, 2), name="b")
29+
>>> group.create_array(shape=(3, 3, 3), name="c")
30+
>>> zarr.consolidate_metadata(store)
31+
32+
If we open that group, the Group's metadata has a :class:`zarr.ConsolidatedMetadata`
33+
that can be used.
34+
35+
.. code-block:: python
36+
37+
>>> consolidated = zarr.open_group(store=store)
38+
>>> consolidated.metadata.consolidated_metadata.metadata
39+
{'b': ArrayV3Metadata(shape=(2, 2), fill_value=np.float64(0.0), ...),
40+
'a': ArrayV3Metadata(shape=(1,), fill_value=np.float64(0.0), ...),
41+
'c': ArrayV3Metadata(shape=(3, 3, 3), fill_value=np.float64(0.0), ...)}
42+
43+
Operations on the group to get children automatically use the consolidated metadata.
44+
45+
.. code-block:: python
46+
47+
>>> consolidated["a"] # no read / HTTP request to the Store is required
48+
<Array memory://.../a shape=(1,) dtype=float64>
49+
50+
With nested groups, the consolidated metadata is available on the children, recursively.
51+
52+
... code-block:: python
53+
54+
>>> child = group.create_group("child", attributes={"kind": "child"})
55+
>>> grandchild = child.create_group("child", attributes={"kind": "grandchild"})
56+
>>> consolidated = zarr.consolidate_metadata(store)
57+
58+
>>> consolidated["child"].metadata.consolidated_metadata
59+
ConsolidatedMetadata(metadata={'child': GroupMetadata(attributes={'kind': 'grandchild'}, zarr_format=3, )}, ...)
60+
61+
Synchronization and Concurrency
62+
-------------------------------
63+
64+
Consolidated metadata is intended for read-heavy use cases on slowly changing
65+
hierarchies. For hierarchies where new nodes are constantly being added,
66+
removed, or modified, consolidated metadata may not be desirable.
67+
68+
1. It will add some overhead to each update operation, since the metadata
69+
would need to be re-consolidated to keep it in sync with the store.
70+
2. Readers using consolidated metadata will regularly see a "past" version
71+
of the metadata, at the time they read the root node with its consolidated
72+
metadata.
73+
74+
.. _Consolidated Metadata: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#consolidated-metadata

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Zarr-Python
1010

1111
getting_started
1212
tutorial
13+
consolidated_metadata
1314
api/index
1415
spec
1516
release

src/zarr/api/asynchronous.py

Lines changed: 104 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from __future__ import annotations
22

33
import asyncio
4+
import dataclasses
45
import warnings
56
from typing import TYPE_CHECKING, Any, Literal, cast
67

@@ -9,9 +10,17 @@
910

1011
from zarr.abc.store import Store
1112
from zarr.core.array import Array, AsyncArray, get_array_metadata
12-
from zarr.core.common import JSON, AccessModeLiteral, ChunkCoords, MemoryOrder, ZarrFormat
13+
from zarr.core.buffer import NDArrayLike
14+
from zarr.core.chunk_key_encodings import ChunkKeyEncoding
15+
from zarr.core.common import (
16+
JSON,
17+
AccessModeLiteral,
18+
ChunkCoords,
19+
MemoryOrder,
20+
ZarrFormat,
21+
)
1322
from zarr.core.config import config
14-
from zarr.core.group import AsyncGroup
23+
from zarr.core.group import AsyncGroup, ConsolidatedMetadata, GroupMetadata
1524
from zarr.core.metadata import ArrayMetadataDict, ArrayV2Metadata, ArrayV3Metadata
1625
from zarr.errors import NodeTypeValidationError
1726
from zarr.storage import (
@@ -132,8 +141,64 @@ def _default_zarr_version() -> ZarrFormat:
132141
return cast(ZarrFormat, int(config.get("default_zarr_version", 3)))
133142

134143

135-
async def consolidate_metadata(*args: Any, **kwargs: Any) -> AsyncGroup:
136-
raise NotImplementedError
144+
async def consolidate_metadata(
145+
store: StoreLike,
146+
path: str | None = None,
147+
zarr_format: ZarrFormat | None = None,
148+
) -> AsyncGroup:
149+
"""
150+
Consolidate the metadata of all nodes in a hierarchy.
151+
152+
Upon completion, the metadata of the root node in the Zarr hierarchy will be
153+
updated to include all the metadata of child nodes.
154+
155+
Parameters
156+
----------
157+
store: StoreLike
158+
The store-like object whose metadata you wish to consolidate.
159+
path: str, optional
160+
A path to a group in the store to consolidate at. Only children
161+
below that group will be consolidated.
162+
163+
By default, the root node is used so all the metadata in the
164+
store is consolidated.
165+
zarr_format : {2, 3, None}, optional
166+
The zarr format of the hierarchy. By default the zarr format
167+
is inferred.
168+
169+
Returns
170+
-------
171+
group: AsyncGroup
172+
The group, with the ``consolidated_metadata`` field set to include
173+
the metadata of each child node.
174+
"""
175+
store_path = await make_store_path(store)
176+
177+
if path is not None:
178+
store_path = store_path / path
179+
180+
group = await AsyncGroup.open(store_path, zarr_format=zarr_format, use_consolidated=False)
181+
group.store_path.store._check_writable()
182+
183+
members_metadata = {k: v.metadata async for k, v in group.members(max_depth=None)}
184+
185+
# While consolidating, we want to be explicit about when child groups
186+
# are empty by inserting an empty dict for consolidated_metadata.metadata
187+
for k, v in members_metadata.items():
188+
if isinstance(v, GroupMetadata) and v.consolidated_metadata is None:
189+
v = dataclasses.replace(v, consolidated_metadata=ConsolidatedMetadata(metadata={}))
190+
members_metadata[k] = v
191+
192+
ConsolidatedMetadata._flat_to_nested(members_metadata)
193+
194+
consolidated_metadata = ConsolidatedMetadata(metadata=members_metadata)
195+
metadata = dataclasses.replace(group.metadata, consolidated_metadata=consolidated_metadata)
196+
group = dataclasses.replace(
197+
group,
198+
metadata=metadata,
199+
)
200+
await group._save_metadata()
201+
return group
137202

138203

139204
async def copy(*args: Any, **kwargs: Any) -> tuple[int, int, int]:
@@ -256,8 +321,18 @@ async def open(
256321
return await open_group(store=store_path, zarr_format=zarr_format, **kwargs)
257322

258323

259-
async def open_consolidated(*args: Any, **kwargs: Any) -> AsyncGroup:
260-
raise NotImplementedError
324+
async def open_consolidated(
325+
*args: Any, use_consolidated: Literal[True] = True, **kwargs: Any
326+
) -> AsyncGroup:
327+
"""
328+
Alias for :func:`open_group` with ``use_consolidated=True``.
329+
"""
330+
if use_consolidated is not True:
331+
raise TypeError(
332+
"'use_consolidated' must be 'True' in 'open_consolidated'. Use 'open' with "
333+
"'use_consolidated=False' to bypass consolidated metadata."
334+
)
335+
return await open_group(*args, use_consolidated=use_consolidated, **kwargs)
261336

262337

263338
async def save(
@@ -549,6 +624,7 @@ async def open_group(
549624
zarr_format: ZarrFormat | None = None,
550625
meta_array: Any | None = None, # not used
551626
attributes: dict[str, JSON] | None = None,
627+
use_consolidated: bool | str | None = None,
552628
) -> AsyncGroup:
553629
"""Open a group using file-mode-like semantics.
554630
@@ -589,6 +665,22 @@ async def open_group(
589665
to users. Use `numpy.empty(())` by default.
590666
attributes : dict
591667
A dictionary of JSON-serializable values with user-defined attributes.
668+
use_consolidated : bool or str, default None
669+
Whether to use consolidated metadata.
670+
671+
By default, consolidated metadata is used if it's present in the
672+
store (in the ``zarr.json`` for Zarr v3 and in the ``.zmetadata`` file
673+
for Zarr v2).
674+
675+
To explicitly require consolidated metadata, set ``use_consolidated=True``,
676+
which will raise an exception if consolidated metadata is not found.
677+
678+
To explicitly *not* use consolidated metadata, set ``use_consolidated=False``,
679+
which will fall back to using the regular, non consolidated metadata.
680+
681+
Zarr v2 allowed configuring the key storing the consolidated metadata
682+
(``.zmetadata`` by default). Specify the custom key as ``use_consolidated``
683+
to load consolidated metadata from a non-default key.
592684
593685
Returns
594686
-------
@@ -615,7 +707,9 @@ async def open_group(
615707
attributes = {}
616708

617709
try:
618-
return await AsyncGroup.open(store_path, zarr_format=zarr_format)
710+
return await AsyncGroup.open(
711+
store_path, zarr_format=zarr_format, use_consolidated=use_consolidated
712+
)
619713
except (KeyError, FileNotFoundError):
620714
return await AsyncGroup.from_store(
621715
store_path,
@@ -777,7 +871,9 @@ async def create(
777871
)
778872
else:
779873
warnings.warn(
780-
"dimension_separator is not yet implemented", RuntimeWarning, stacklevel=2
874+
"dimension_separator is not yet implemented",
875+
RuntimeWarning,
876+
stacklevel=2,
781877
)
782878
if write_empty_chunks:
783879
warnings.warn("write_empty_chunks is not yet implemented", RuntimeWarning, stacklevel=2)

src/zarr/api/synchronous.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from __future__ import annotations
22

3-
from typing import TYPE_CHECKING, Any
3+
from typing import TYPE_CHECKING, Any, Literal
44

55
import zarr.api.asynchronous as async_api
66
from zarr._compat import _deprecate_positional_args
@@ -90,8 +90,10 @@ def open(
9090
return Group(obj)
9191

9292

93-
def open_consolidated(*args: Any, **kwargs: Any) -> Group:
94-
return Group(sync(async_api.open_consolidated(*args, **kwargs)))
93+
def open_consolidated(*args: Any, use_consolidated: Literal[True] = True, **kwargs: Any) -> Group:
94+
return Group(
95+
sync(async_api.open_consolidated(*args, use_consolidated=use_consolidated, **kwargs))
96+
)
9597

9698

9799
def save(
@@ -208,6 +210,7 @@ def open_group(
208210
zarr_format: ZarrFormat | None = None,
209211
meta_array: Any | None = None, # not used in async api
210212
attributes: dict[str, JSON] | None = None,
213+
use_consolidated: bool | str | None = None,
211214
) -> Group:
212215
return Group(
213216
sync(
@@ -223,6 +226,7 @@ def open_group(
223226
zarr_format=zarr_format,
224227
meta_array=meta_array,
225228
attributes=attributes,
229+
use_consolidated=use_consolidated,
226230
)
227231
)
228232
)

src/zarr/core/array.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@
7373
ArrayV3MetadataDict,
7474
T_ArrayMetadata,
7575
)
76+
from zarr.core.metadata.v3 import parse_node_type_array
7677
from zarr.core.sync import collect_aiterator, sync
7778
from zarr.errors import MetadataValidationError
7879
from zarr.registry import get_pipeline_class
@@ -165,6 +166,9 @@ async def get_array_metadata(
165166
# V3 arrays are comprised of a zarr.json object
166167
assert zarr_json_bytes is not None
167168
metadata_dict = json.loads(zarr_json_bytes.to_bytes())
169+
170+
parse_node_type_array(metadata_dict.get("node_type"))
171+
168172
return metadata_dict
169173

170174

src/zarr/core/common.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,15 @@
2424
ZARRAY_JSON = ".zarray"
2525
ZGROUP_JSON = ".zgroup"
2626
ZATTRS_JSON = ".zattrs"
27+
ZMETADATA_V2_JSON = ".zmetadata"
2728

2829
ByteRangeRequest = tuple[int | None, int | None]
2930
BytesLike = bytes | bytearray | memoryview
3031
ShapeLike = tuple[int, ...] | int
3132
ChunkCoords = tuple[int, ...]
3233
ChunkCoordsLike = Iterable[int]
3334
ZarrFormat = Literal[2, 3]
35+
NodeType = Literal["array", "group"]
3436
JSON = None | str | int | float | Mapping[str, "JSON"] | tuple["JSON", ...]
3537
MemoryOrder = Literal["C", "F"]
3638
AccessModeLiteral = Literal["r", "r+", "a", "w", "w-"]

0 commit comments

Comments
 (0)