-
-
Notifications
You must be signed in to change notification settings - Fork 366
Fixed consolidated Group getitem with multi-part key #2363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
f4f8200
05cf652
488d57f
cbfc197
b08917c
a0c0ef6
912e582
1cf5bd4
5533ac7
259e09e
446c893
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -97,6 +97,24 @@ def _parse_async_node( | |
| raise TypeError(f"Unknown node type, got {type(node)}") | ||
|
|
||
|
|
||
| class _MixedConsolidatedMetadataException(Exception): | ||
| """ | ||
| A custom, *internal* exception for when we encounter mixed consolidated metadata. | ||
|
|
||
| Typically, Consolidated Metadata will explicitly indicate that there are no | ||
| additional children under a group with ``ConsolidatedMetadata(metadata={})``, | ||
| as opposed to ``metadata=None``. This is the behavior of ``consolidated_metadata``. | ||
| We rely on that "fact" to do I/O-free getitem: when a group's consolidated metadata | ||
| doesn't contain a child we can raise a ``KeyError`` without consulting the backing | ||
| store. | ||
|
|
||
| Users can potentially get themselves in situations where there's "mixed" consolidated | ||
| metadata. For now, we'll raise this error, catch it internally, and silently fall back | ||
| to the store (which will either succeed or raise its own KeyError, slowly). We might | ||
| want to expose this in the future, in which case rename it add it to zarr.errors. | ||
| """ | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class ConsolidatedMetadata: | ||
| """ | ||
|
|
@@ -571,7 +589,10 @@ def _from_bytes_v2( | |
|
|
||
| @classmethod | ||
| def _from_bytes_v3( | ||
| cls, store_path: StorePath, zarr_json_bytes: Buffer, use_consolidated: bool | None | ||
| cls, | ||
| store_path: StorePath, | ||
| zarr_json_bytes: Buffer, | ||
| use_consolidated: bool | None, | ||
| ) -> AsyncGroup: | ||
| group_metadata = json.loads(zarr_json_bytes.to_bytes()) | ||
| if use_consolidated and group_metadata.get("consolidated_metadata") is None: | ||
|
|
@@ -604,14 +625,22 @@ async def getitem( | |
|
|
||
| # Consolidated metadata lets us avoid some I/O operations so try that first. | ||
| if self.metadata.consolidated_metadata is not None: | ||
| return self._getitem_consolidated(store_path, key, prefix=self.name) | ||
| try: | ||
| return self._getitem_consolidated(store_path, key, prefix=self.name) | ||
| except _MixedConsolidatedMetadataException: | ||
| logger.info( | ||
| "Mixed consolidated and unconsolidated metadata. key=%s, store_path=%s", | ||
| key, | ||
| store_path, | ||
| ) | ||
| # now fall back to the non-consolidated variant | ||
|
|
||
| # Note: | ||
| # in zarr-python v2, we first check if `key` references an Array, else if `key` references | ||
| # a group,using standalone `contains_array` and `contains_group` functions. These functions | ||
| # are reusable, but for v3 they would perform redundant I/O operations. | ||
| # Not clear how much of that strategy we want to keep here. | ||
| elif self.metadata.zarr_format == 3: | ||
| if self.metadata.zarr_format == 3: | ||
| zarr_json_bytes = await (store_path / ZARR_JSON).get() | ||
| if zarr_json_bytes is None: | ||
| raise KeyError(key) | ||
|
|
@@ -661,18 +690,39 @@ def _getitem_consolidated( | |
| # getitem, in the special case where we have consolidated metadata. | ||
| # Note that this is a regular def (non async) function. | ||
| # This shouldn't do any additional I/O. | ||
| # All callers *must* catch _MixedConsolidatedMetadataException to ensure | ||
| # that we correctly handle the case where we need to fall back to doing | ||
| # additional I/O. | ||
|
|
||
| # the caller needs to verify this! | ||
| assert self.metadata.consolidated_metadata is not None | ||
|
|
||
| try: | ||
| metadata = self.metadata.consolidated_metadata.metadata[key] | ||
| except KeyError as e: | ||
| # The Group Metadata has consolidated metadata, but the key | ||
| # isn't present. We trust this to mean that the key isn't in | ||
| # the hierarchy, and *don't* fall back to checking the store. | ||
| msg = f"'{key}' not found in consolidated metadata." | ||
| raise KeyError(msg) from e | ||
| # we support nested getitems like group/subgroup/array | ||
| indexers = key.split("/") | ||
| indexers.reverse() | ||
| metadata: ArrayV2Metadata | ArrayV3Metadata | GroupMetadata = self.metadata | ||
|
|
||
| while indexers: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wouldn't the flattened form of the consolidated metadata make this a lot simpler?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe the problem with that is that it requires that the consolidated metadata be complete? whereas the iterative approach can handle a group with 'live' metadata.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I missed this earlier. Flattened metadata would make this specific section simpler, but I think would complicate things a later since we'd when we need to "unflatten" it to put all of its children in its |
||
| indexer = indexers.pop() | ||
| if isinstance(metadata, ArrayV2Metadata | ArrayV3Metadata): | ||
| # we've indexed into an array with group["array/subarray"]. Invalid. | ||
| raise KeyError(key) | ||
| try: | ||
| if metadata.consolidated_metadata is None: | ||
| # we've indexed into a group without consolidated metadata. | ||
| # Note that the `None` case is different from `metadata={}` | ||
| # where we explicitly know we have no children. In the None | ||
| # case we have to fall back to non-consolidated metadata. | ||
| raise _MixedConsolidatedMetadataException(key) | ||
| assert metadata.consolidated_metadata is not None | ||
TomAugspurger marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| metadata = metadata.consolidated_metadata.metadata[indexer] | ||
| except KeyError as e: | ||
| # The Group Metadata has consolidated metadata, but the key | ||
| # isn't present. We trust this to mean that the key isn't in | ||
| # the hierarchy, and *don't* fall back to checking the store. | ||
| msg = f"'{key}' not found in consolidated metadata." | ||
| raise KeyError(msg) from e | ||
|
|
||
| # update store_path to ensure that AsyncArray/Group.name is correct | ||
| if prefix != "/": | ||
|
|
@@ -1087,7 +1137,8 @@ async def members( | |
| self, | ||
| max_depth: int | None = 0, | ||
| ) -> AsyncGenerator[ | ||
| tuple[str, AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata] | AsyncGroup], None | ||
| tuple[str, AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata] | AsyncGroup], | ||
| None, | ||
| ]: | ||
| """ | ||
| Returns an AsyncGenerator over the arrays and groups contained in this group. | ||
|
|
@@ -1118,15 +1169,20 @@ async def members( | |
| async def _members( | ||
| self, max_depth: int | None, current_depth: int | ||
| ) -> AsyncGenerator[ | ||
| tuple[str, AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata] | AsyncGroup], None | ||
| tuple[str, AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata] | AsyncGroup], | ||
| None, | ||
| ]: | ||
| if self.metadata.consolidated_metadata is not None: | ||
| # we should be able to do members without any additional I/O | ||
| members = self._members_consolidated(max_depth, current_depth) | ||
|
|
||
| for member in members: | ||
| yield member | ||
| return | ||
| try: | ||
| members = self._members_consolidated(max_depth, current_depth) | ||
| except _MixedConsolidatedMetadataException: | ||
| # we've already logged this. We'll fall back to the non-consolidated version. | ||
| pass | ||
| else: | ||
| for member in members: | ||
| yield member | ||
| return | ||
|
|
||
| if not self.store_path.store.supports_listing: | ||
| msg = ( | ||
|
|
@@ -1177,17 +1233,28 @@ async def _members( | |
| def _members_consolidated( | ||
| self, max_depth: int | None, current_depth: int, prefix: str = "" | ||
| ) -> Generator[ | ||
| tuple[str, AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata] | AsyncGroup], None | ||
| tuple[str, AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata] | AsyncGroup], | ||
| None, | ||
| ]: | ||
| consolidated_metadata = self.metadata.consolidated_metadata | ||
|
|
||
| # we kind of just want the top-level keys. | ||
| if consolidated_metadata is not None: | ||
| for key in consolidated_metadata.metadata.keys(): | ||
| obj = self._getitem_consolidated( | ||
| self.store_path, key, prefix=self.name | ||
| ) # Metadata -> Group/Array | ||
| key = f"{prefix}/{key}".lstrip("/") | ||
| try: | ||
| obj = self._getitem_consolidated( | ||
| self.store_path, key, prefix=self.name | ||
| ) # Metadata -> Group/Array | ||
| key = f"{prefix}/{key}".lstrip("/") | ||
| except _MixedConsolidatedMetadataException: | ||
| logger.info( | ||
| "Mixed consolidated and unconsolidated metadata. key=%s, depth=%d, prefix=%s", | ||
| key, | ||
| current_depth, | ||
| prefix, | ||
| ) | ||
| # This isn't an async def function so we need to re-raise up one more level. | ||
| raise | ||
| yield key, obj | ||
|
|
||
| if ((max_depth is None) or (current_depth < max_depth)) and isinstance( | ||
|
|
@@ -1262,7 +1329,11 @@ async def full( | |
| self, *, name: str, shape: ChunkCoords, fill_value: Any | None, **kwargs: Any | ||
| ) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]: | ||
| return await async_api.full( | ||
| shape=shape, fill_value=fill_value, store=self.store_path, path=name, **kwargs | ||
| shape=shape, | ||
| fill_value=fill_value, | ||
| store=self.store_path, | ||
| path=name, | ||
| **kwargs, | ||
| ) | ||
|
|
||
| async def empty_like( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if this
MixConsolidatedMetadataExceptionstuff makes sense. The crux of the issue is thatGroup.__getitem__on a Group with consolidated metadata can be IO-free. To make that explicit, we define it as a plain (non-async) function.But this
group['subgroup']['subarray']presents a new challenge. What ifgrouphas consolidated metadata, butsubgroupdoesn't? Its consolidated metadata might beNone, meaning we need to fall back to the non-consolidated metadata. We discover this fact inAsyncGroup._getitem_consolidated, which isn't an async function, so it can't call the async non-consolidated getitem.To break through that tangle, I've added this custom exception. The expectation is that users never see it; we require (through docs & being careful?) that all the callers of
_getitem_consolidatedhandle this case by catching the error and falling back if needed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to mention: this "mixed" case isn't going to be common. Under normal operation, where users call
consolidate_metadataand don't mutate the group afterwards we consolidate everything and so won't hit this. I believe the only time this can happen through the public API is when users add a new group to group with consolidated metadata, and don't then re-consolidate.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support this case? I would think that users could expect that, when using consolidated metadata, then metadata lookups (beyond the first) will never trigger IO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some discussion on this here: #2363 (comment). I think there isn't really a clear answer. I'm sympathetic to users who want to know for certain that they're done with I/O after loading a Group with consolidated metadata.
If we decide not to support this, we do still have a decision to make: when we try to index into a nested Group and reach a node with non-consolidated metadata, do we raise a
KeyError(saying we know this key isn't there) or some public version of this MixedConsolidatedMetadataError (saying we don't know whether or not this key is there)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a user of consolidated metadata, so take this with a grain of salt, but I would default to being strict: if people open a group in consolidated metadata mode, then IMO the contents of the consolidated metadata should purport to provide the complete model of the hierarchy, even if that model happens to be wrong :) But I'd prefer to temper my strict default with whatever expectations consolidated metadata users have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear: what does "strict" mean to you here?
I actually am having trouble constructing a hierarchy that hits this using the public API:
I think the semantics around what exact state in after mutating a consolidated hierarchy, but before reconsolidating, that we can do whatever we want.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me, "strict" would mean that if a user opens a group with
use_consolidated=True, then the consolidated metadata for that group will be the single source of truth about the hierarchy, and so attempting to access a sub-node that actually exists, but isn't in that document, would raise aKeyError. Not sure if this is actually a reasonable position