-
Notifications
You must be signed in to change notification settings - Fork 15
ZEP10: Generic extensions proposal #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @joshmoore, just a process question, it would seem beneficial to get this PR merged asap so it becomes visible as a draft zep on the zeps website. Who needs to approve that, and what checks would need to be done at this stage to allow merging? E.g., does someone just need to check that the document has the right structure for a ZEP? If so, I'd be happy to approve. |
I know we've done that in the past for ZEPs but then it is actually harder to comment on it --- I'd need to open a separate issue for each comment.. |
For merging in the "Draft", yes, that suffices. From https://zarr.dev/zeps/active/ZEP0000.html#submitting-a-zep
I'm certainly all for leaving it open for a bit, especially for the discussion of the material that is only here (as @jbms has done above). I can manage having it open and synchronizing with the specs PR. That being said, if possible, I'd like to get it merged as a "Draft" and then will also keep updating it as necessary to stay in step with discussions on zarr-developers/zarr-specs#344 |
seconding @jbms, I rate the ability to discuss the ZEP as a single PR much higher than seeing it listed on the ZEP web site, so I would rather we keep this PR open until it's clear that all the questions have been answered. |
Note that in this example of the extension is ``must_understand=true`` meaning | ||
an implementation which does not support the ``example.offset`` extension | ||
should raise an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when should that error be raised? when reading metadata, or when reading chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the impl doesn't know the example.offset
extension, it must fail when parsing the metadata.
It may fail with a out-of-bounds error when reading/writing data outside the domain. But that would be up to the specification for this extension to define.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the impl doesn't know the example.offset extension, it must fail when parsing the metadata.
It seems to me that a zarr-compatible application should be able to say, for example, "this is an array with shape <shape>
, but I can't load chunks for you because of <unknown extension>
". Your suggesting that the metadata document should be effectively unreadable prevents this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that a zarr-compatible application should be able to say, for example, "this is an array with shape
<shape>
, but I can't load chunks for you because of<unknown extension>
".
I think that would be a good implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a good implementation.
Since the behavior I described relies on reading the metadata without an error, this PR should clarify the distinction between reading metadata documents and other IO operations (e.g., reading chunks, in this example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are purely displaying information to a user and including a warning that an unknown extension was encountered, then displaying whatever information can be heuristically extracted from the metadata successfully may be reasonable.
In general though if there is an unknown extension, you can't really make any assumptions about the meaning of the metadata and any programmatic use is problematic.
For example, the offset
extension may mean that the upper bound of the array is no longer indicated by shape
but by offset + shape
, and the chunk grid starts at offset
rather than (0, ...)
. Maybe there is some program that partitions zarr arrays according to the chunking and then hands off those zarr arrays to worker processes. If the partition program does not support the offset
extension, but the worker program does support the offset
extension, then the partition program will perform the partitioning incorrectly, but the worker processes may process them without errors, but not correctly aligned to the chunk grid.
Concretely, I'd say that if there is an unknown must_understand=true extension, zarr.open
and similar interfaces should not appear to succeed and allow querying properties like the chunk grid, dtype, etc. unless the user explicitly opts into ignoring unknown extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general though if there is an unknown extension, you can't really make any assumptions about the meaning of the metadata and any programmatic use is problematic.
I find this outcome concerning, as it amounts to fragmenting the zarr ecosystem.
I think this document should explain why the pre-existing |
For |
I've only just scanned over it, but GeoZarr looks to be a metadata/layout standard for Zarr (like OME-Zarr), rather than a "Zarr extension". Such standards require no support from implementations like
Indeed, it is not. Just use |
@LDeakin I'm really not sure it's so cut and dry. Things like GeoZarr do require support from other, more domain specific implementations like GDAL, Xarray, etc. Having worked on both GeoZarr and ZEP4, I'm not convinced that just saying "put whatever you want in attributes" is the right path forward. There needs to be a way to declare more explicitly that a dataset is conforming to one of these models, and a way to register them centrally. |
Could not agree more! STAC has a domain-agnostic and battle tested extension mechanism that could almost be directly copy/pasted into Zarr. If STAC took a similar approach of saying "just use Furthermore, the Zarr spec defines I'm not sure why this is so controversial. A robust extension mechanism that provides 3rd parties the ability to extend the Zarr spec (exactly what ZEP10 describes itself as) doesn't hurt the Zarr community. It's a win-win for everyone involved. A weak extension mechanism (just use attributes) is a lose-lose for everyone involved. |
Of course, just like OME-Zarr. But does GeoZarr need to change the core data model of Zarr? If not, it is well suited to GeoZarr could use
@jbms brought up the registration of attributes earlier in this thread. Is that not sufficient? |
A recurring discussion around this ZEP is the ill-defined distinction between Anything standards-driven could be registered and part of |
Generally speaking, I worry about two things here:
|
this is an implementation detail. The only real requirement is that the type of attributes be assignable to class ArraySpec(Generic[TAttr]):
attributes: TAttr
...
class GroupSpec(Generic[TAttr, TItem]):
attributes: TAttr
members: Mapping[str, TItem] The array object takes the type of its attributes as a type parameter. The group object takes the type of its attributes, and the type of its members, as a type parameter. This allows pydantic to do runtime type checking of the attributes field + group members, and is largely sufficient for "statically typed" zarr hierarchies. For example, we implemented OME-Zarr using this approach. STAC could use pydantic-zarr to define an attributes model that performs runtime JSONschema validation of attributes elements that conform to a certain structure. I'm new to STAC so maybe there's something even better idea. But nothing needs to change about the spec for this to work. |
I'll be disappearing for two weeks of wilderness shortly, and apologize now that I'm very unlikely to check in on this thread after this weekend. (I wish everyone at least as much respite.) A few quick responses to the above conversation and from geospatial-jeff/zarr-python#1:
From my side, (a) very happy for improved wording both on ZEP9 and ZEP10, (b) it might be as I mentioned previously that a community/spec/ZEP call to talk this through via voice would be easier. Alternatively, I'll try to piece apart the discussions above into separate issues and try to reach consensus on each in turn, including:
|
The current ZEP10 proposal indeed blurs the distinction between extensions that require changes to the zarr implementation, and extensions like OME-zarr that merely build on top of zarr, and may affect the higher-level interpretation of the array but require no change to the zarr implementation itself and do not affect how the array data itself is read or written. I would propose instead that we define "registered attributes" also in the zarr-extensions repo, where registered attributes are included in For generic extensions that do require changes to the zarr implementation, I'd propose that they just be included as top-level attributes, but with a suitable prefix to avoid conflict with any future additions to the core zarr spec. The |
Thanks @jbms! I like the idea of registered attributes that get their space in the zarr-extensions repo. That might bring some backwards incompatibility issues, because we haven't reserved that namespace so far. However, I am not too concerned about that. I am also onboard with the idea of moving this proposal from an In any case, I think it would be useful to have a meeting to discuss this and potentially take a decision to move forward. |
This evening during the community meeting, @jbms had the impression that there's generally more interest in the
Edit: There's no way to just add your own emojis. 🤦🏽 |
💯 , I think taking this approach would resolve a lot of the concerns and questions I raised about this PR. |
Thanks everyone for weighing in with some productive suggestions. It sounds like we are converging on a path forward for this concept. I think a synchronous meeting could be very useful to finalize a consensus. I'd love to include at least @joshmoore, @normanrz, @jbms, @d-v-b, @LDeakin, and myself in that meeting. The challenge is that we span the globe almost completely, and finding a comfortable time zone may be very difficult. In my opinion, the least painful option looks like this:
(Obviously this is most uncomfortable for Berlin and Canberra.) So my question for the group - could we make this timing work some time this week? I am free any day of this week (Aug 11 - Aug 14) at this time. Alternatively, feel free to propose a different time which you think could work better. |
thanks for kicking off the calendar-ing @rabernat, I'm also free any day this week at 11:00 pm Berlin time. |
The week has almost passed, so I thin we're looking at next week instead. Here's a Doodle Poll with some options in this time range the rest of this week and next week. Please fill it out if you're interested in attending: https://doodle.com/group-poll/participate/azlpXj7d |
I filled out the poll but mostly to learn from the discussion so please don't put much weight on my votes. |
Right now the best options are looking like
@jbms, @joshmoore, & @normanrz - would you be available to join at either time? |
Preference for Wed but I'll make either work. |
I would like to observe. |
I am traveling this week and likely won't be able to join. |
If anyone is interested, I put together some slides that summarise the generic extension options I've seen discussed + my view on the pros/cons of each. |
Thanks for that writeup @LDeakin I have one new proposal for Supposing that we agree that "must understand for writing" is always implicitly true, and that we only need to represent the single bit "must understand for reading", we could say that properties prefixed with an underscore are not required for reading. For example: {
"zarr_format": 3,
...
"_ext:consolidated_metadata": ...
} or for a codec {
"name": "jpeg",
"configuration": {
"_quality": 80
}
} Pros:
Cons:
|
Last week a group of us met to discuss the concept of generic extensions. By focusing on specific use cases for this extension mechanism, we reached a somewhat surprising conclusion: we may not need ZEP10-style extensions right now. Instead, nearly all of the use cases we had in mind might conceivably be implemented using existing extension points OR user-level attributes (plus the “registered attributes” concept described above). In the latter case, the data are ultimately readable by simple Zarr implementations without any knowledge of the “extension.” Some example use cases we discussed were:
This leaves the question of consolidated metadata, which is currently implemented in Zarr Python but completely out of spec for Zarr 3. Our proposal for this is:
Most of the remaining questions about how these extensions should work are ultimately about consistency. Example of consistency-related questions are:
Ultimately, we concluded that specs (including specs for registered attributes which enable behaviours such as described above) should simply try to describe the valid state of data at rest, and not attempt to resolve every consistency issue that might arise in getting into that state. This is consistent with usage today; in practice, most Zarr users today already assume there are relationships between multiple arrays in a group defined by node names and attributes (e.g. Xarray’s coordinate variables), without the spec having to explicitly spell out how updates should work. For applications that require strong consistency for updates across multiple nodes, there are already implementations which offer transactions, such as Icechunk and Tensorstore. Of course there are many remaining questions about how these features should work, but we are determined to start experimenting with implementations which simply use attributes to define new behaviors, rather than introducing a new, hypothetical extension mechanism at this time. We can leave this ZEP open for whenever someone comes along who absolutely does need this mechanism for their use case and is motivated to continue work on it. |
Thanks for the summary @rabernat! I guess the next steps would be to write up a (small) ZEP to define the "registered attributes" and, in parallel, prepare the zarr-extensions repo for registered attributes. |
Agreed, thanks @rabernat! IIRC one motivation for pivoting towards "registered attributes" was to promote an implementation first approach, such that the registered attributes ZEP could be accompanied by one or more concrete examples based on the proposed use-cases. Would anyone from the steering council have time to lead the process for one of the example use-cases, such that the broader community could follow that successful model for the other use-cases? FWIW I am most interested |
I don't think unevenly chunked arrays would be best expressed as a registered attribute. Rather, the |
I think a natural first registered attribute would be
Other attributes might want to document their spec directly in the zarr-extensions repo. |
Big 👍 to this. I really like how OME puts all its metadata under a single top-level attribute. I sometimes wish we had decided to do that for Xarray.
This is a big challenge with a potentially big payoff--refactoring stuff like multiscales, units, etc to be composable rather than part of a big monolithic standard like CF or OME. |
FWIW, I think we should not do any ZEP work until we have this actually working. |
An addendum comment from the Zoom conversation: I'm not in favor of leaving this PR open indefinitely. Instead, I will offer (when time permits) to add a some changes and then merge it, marking it "withdrawn" or "rejected". When and if we come back to top-level extensions, I'd then suggest that we start with a new ZEP with the lessons learned from the attributes work. |
This is a follow on to ZEP9 (#65) since #66 limits the scope of ZEP9 solely to phase 1 such that it can be moved to accepted (since zarr-developers/zarr-specs#330 is merged and v3.1 released). This ZEP is equivalent to phase 2 of the original ZEP9 draft and introduces a top-level generic
extensions
field.This ZEP will follow the process laid out in ZEP0 and invites votes from the newly refreshed @zarr-developers/implementation-council. This PR may be proactively merged as a draft, but will not be moved to "accepted" until the related PR on zarr-specs is voted on, merged, and v3.2 released.
Please see zarr-developers/zarr-specs#344 for detailed changes.