diff --git a/draft/ZEP0009.md b/draft/ZEP0009.md new file mode 100644 index 0000000..38f6027 --- /dev/null +++ b/draft/ZEP0009.md @@ -0,0 +1,746 @@ +--- +layout: default +title: ZEP0009 +description: This ZEP proposes a reworking of the Zarr v3 extension naming mechanism. +parent: draft ZEPs +nav_order: 9 +--- + +# ZEP 9 — Zarr Extension Naming + +Authors: + +- [Norman Rzepka](https://github.com/normanrz), scalable minds +- [Josh Moore](https://github.com/joshmoore), German BioImaging + +Status: Draft + +Type: Specification + +Created: 2025-01-21 + +## Abstract + +The Zarr v3 specification is unclear on several matters regarding the +definition of extensions in the `zarr.json` metadata. This proposal defines +immediate clarifications for the specification which will allow the centralized +creation of "raw" names as well as the decentralized use of URIs for +extensions. At the same time, we propose a longer-term strategy for evolving +extensions and extension points for discussion and potential approval by the +community to be rolled out to implementations at a future time. + +## Introduction + +The intention of the Zarr v3 spec was to provide a framework where +community-driven extensions can be created and managed by the community itself. +However, the process for that was unclear and, therefore, implicitly linked to +the ZEP process. Issues have been surfaced by users, who wanted to create and +use new extensions. For example, users wanted to +[create new codecs](https://github.com/zarr-developers/zarr-specs/pull/256), +[data types](https://github.com/zarr-developers/zarr-specs/pull/257) and +[an extension for consolidated metadata](https://github.com/zarr-developers/zarr-specs/pull/309). +We thank the members of the community for their efforts in discovering and communicating +these issues. + +The underlying problem is that the extension mechanism for v3 lacks detail in +its definition. While several extension points are defined in the core spec, +there is no advice about selecting names for these extensions or how naming +conflicts are avoided. Additionally, there is no mechanism for defining new +extensions that do not fit into the existing extension points. In fact, there +are contradictions in what is entirely permissible, including for example +whether codecs are within or required by the core v3 spec. + +Implementations have started to use the v3 spec and are making use of +extensions (e.g `numcodecs.*` codecs and `zstd` in zarr-python), which means +that any changes made to the v3 spec need to be compatible with the current +reality in the community. + +## Proposal + +This proposal has three phases. + +The [first phase](#Phase-1-Immediate-clarifications) clarifies that extensions +can be created by the community without the ZEP process. Extensions with +URI-based names can be created without any further coordination and extensions +with raw names only need to get the name registered in the +[`zarr-extensions` Github repository](https://github.com/zarr-developers/zarr-extensions). +The process for registering will be a lightweight PR-based process. + +The naming mechanism as well as clarifications in the core spec documented will +be implemented by the ZSC through +[PR330](https://github.com/zarr-developers/zarr-specs/pull/330) in the zarr-specs repository for +discussion concurrently with this ZEP. The changes we are proposing, marked +with a "🛠️" below, are interpretations of the current Zarr v3 core spec as well +as additions that are in-line with the spec evolution policy of the Zarr core +spec. The Zarr core spec document will be bumped to version 3.1. The metadata +in the `zarr.json` files will remain unchanged `zarr_format=3`. + +The [second phase](#Phase-2-New-extension-points) defines new extension points +and therefore will follow the active ZEP process. The goal is to encourage more +exploration by the Zarr community outside of what's currently defined with the +core specification. + +The [third phase](#Phase-3-Future-name-evolution) defines a possible future +evolution of the naming strategy, largely as a justification for decisions made +in the first two phases. It is non-normative and can be further discussed in +future ZEPs. + +Revisions to ZEP 0 will be paused until we are clear about the extension +mechanism and the implications for the role of the ZEP process in the future. +The ZEP process remains in place for changes to the Zarr v3 core spec including +the adoption of extensions into the core spec. + +## Definitions + +- "Core" refers to the Zarr v3 core specification as defined in the document + and not + necessarily if something is a MUST for implementations. +- "Metadata" is the Zarr metadata as stored in `zarr.json` files for arrays and + groups. +- "Extensions" as used in this document are components used in the metadata to + define and configure how metadata are interpreted by implementations. These + components include codecs, data types, chunk key encodings, chunk grids and + storage transformers. +- "Extension points" are locations within metadata where extension-related + metadata can be found. Current extension points are listed in the core spec, + e.g. `codecs`, `data_type`. As part of the [second phase](#Phase-2-New-extension-points), + we intend to add another general-purpose extension point called [`extensions`](#Extension-points). +- "Extension maintainers" are a person or a team that is responsible for + creating and maintaining an extension. + +## Phase 1: Immediate clarifications + +### Extension naming + +🛠️ We propose defining two categories of names for immediate use by extensions: +raw names and URI-based. + +1. **Raw names** MUST be assigned within a central repository and follow the + compatibility and versioning [v3 stability + policy](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stability-policy). + The name assignment is managed through the [`zarr-extensions` Github + repository](https://github.com/zarr-developers/zarr-extensions), where + each extension is listed and either contains a spec document or links to + a spec document. Names are never unassigned or reassigned. The ZSC or by + delegation a maintainer team reserves the right to refuse name assignment + at its own discretion. + + - **Example:** `zstd` + - **Acceptd regex:** `^[a-z0-9-_.]+$` + +2. **URI-based names** can be used by anyone without further coordination + though the assumption is that users reasonably "own" the URI. Users MAY + make use of a persistent redirecting URL like [PURL](https://purl.archive.org/). + URIs have been chosen due to their potential for being self-documenting and + *strongly recommend* that the URL SHOULD resolve to a human-readable explanation of the extension, but + implementations SHOULD NOT attempt to resolve the URL during processing. + There are no guarantees in terms of versioning or compatibility. However, + preserving backwards-compatibility is strongly encouraged. See the + [versioning section](#Versioning-and-spec-evolution) below. + + - **Example:** `https://example.com/zarr3/consolidated-metadata` + - **Accepted regex:** `^https?:\/\/[^/?#]+[^?#]*$` + + +All names currently listed in the v3 specification are raw names. URIs were +mentioned in previous drafts of the v3 specification and are still referenced +under the codecs section: + +> "Each codec must be defined via a separate specification. In order to refer +> to codecs in array metadata documents, each codec must have a unique +> identifier, which is a URI that dereferences to a human-readable +> specification of the codec." +> (). + +🛠️ This proposal would drop the MUST requirement on the dereferencing of the +URI to a SHOULD and more widely specify that implemenations MAY use URI-based +key names throughout the specification. (See +[community registry](#Community-registry) below for a proposal on coordinating the use of +URIs.) + +### Extension definitions + +Extensions are defined in the `zarr.json` metadata either as objects or as +short-hand names. + +🛠️ In order to unify the general design of future extensions we would add a +MUST requirement for future extension objects to adhere to the following +definition. + +Objects have the following keys: +```json +{ + "name": "", + "configuration": { ... } # optional +} +``` + +If such an object is present, the field `must_understand` is implicitly set to +`True`. Implementations which encounter a `must_understand` extension that they +have not implemented MUST raise an exception. +An extension object can explicitly set `must_understand=False` if +implementations can ignore its presence, following the current guidelines in +the v3 specification. + +Instead of extension objects, short-hand names MAY continue to be used if no +configuration metadata is required. They would be equivalent to extension +objects with just a `name` key. This is in-line with +[the current wording of the spec](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#id11). + +### Extension specifications + +There is no strict requirement for extensions to have a formal specification. +However, for adoption in the community it is STRONGLY RECOMMENDED to write a +specification. + +For an extension with a raw name, the +[zarr-extensions](https://github.com/zarr-developers/zarr-extensions) +repository SHOULD be used to either publish the specification +directly or link to another location which does so. For extensions with URI-based names, it +is RECOMMENDED to publish the specification under the URI of the extension. + +### Extension points + +The v3 core spec defines a number of extension points for arrays and groups +that can hold extensions following the above recommendations: + +- `data_type` (array only) +- `codecs` (array only) +- `chunk_grids` (array only) +- `chunk_key_encoding` (array only) +- `storage_transformers` (array only) // extendable to groups + +The Zarr v3 core spec was originally designed to not contain any extensions, +such as data types, codecs etc. However, over time and through changing +authorship, some extensions were, in fact, listed in +[the core spec document](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types). + +🛠️ We propose to acknowledge that the Zarr v3 core spec includes a few +extensions that are expected to be supported by all implementations and fix the +wording in the core spec document accordingly. Wording in each section will be +updated to clarify whether or not an implementation `MUST` support the +extension point. Other extensions become "core" by being listed in the core +spec document through a ZEP. The use of a "raw name" does **NOT** automatically +make an extension "core". + +The following extensions are (currently) listed or referred to in the core spec +and would be declared as "core" extensions under this proposal. + + * Data types: `(u)int{8,16,32,64}`, `float{32,64}`, `complex{64,128}`, `r*` + * Codecs: `bytes`, `transpose` + * Chunk grids: `regular` + * Chunk key encoding: `default`, `v2` + * Storage transformers: (none) + +### Example + +The following example represents an Array showing many of the proposed changes +described above: + +```javascript +{ + "zarr_format": 3, + "data_type": "https://example.com/zarr/string", // URI-based name, short-hand name + "chunk_key_encoding": { + "name": "default", // core + "configuration": { "separator": "." } + }, + "codecs": [ + { + "name": "https://numcodecs.dev/vlen-utf8" // URI-based name + }, + { + "name": "zstd", // raw name + "configuration": { ... } + } + ], + "chunk_grid": { + "name": "regular", // core + "configuration": { "chunk_shape": [ 32 ] } + }, + "shape": [ 128 ], + "dimension_names": [ "x" ], + "attributes": { ... }, + "storage_transformers": [] +} + +``` + +### Discussion + +#### Community registry + +Externally to this ZEP, we will work towards unification of these extension +points. We propose that community register and discuss their extensions on the +[zarr-extensions](https://github.com/zarr-developers/zarr-extensions) repository. +Eventually, we recommend that a maturity ranking be included in those listings +as in other plugin ecosystems. + +#### Reassigning or unassigning names + +We consider it a design goal to not allow datasets written with one set of +naming expectations to be unintentionally interpreted with *other* names by a +future version of an implementation. Without further mechanics, this means that +raw and URI-based names, once assigned, cannot be changed. (For a possible +evolution of this naming scheme, see [phase 3](#Phase-3-name-evolution) below.) +With this limitation, current implementations either already or quickly can be +updated to support the above proposal since no additional logic is necessary +beyond checking `must_understand`. + +#### Name assignment + +Raw names will be assigned through the [`zarr-extension`](https://github.com/zarr-developers/zarr-extensions) repository. While the Zarr steering council will initially maintain +this repository, it is intended that a community team will be formed to maintain +the repository long-term. + +An alternative would be to use the `zarr-specs` repository to deposit spec +documents for every assigned name. The process would be that extension +maintainers would open a PR (with a template) to register their desired name. +Extension maintainers could choose to publish their extensions spec in the +repository directly or link to an externally hosted spec. Updates of specs +hosted in `zarr-specs`, would be updated through PRs. A zarr-specs maintainer +team would be required to ensure a timely assignment of names and updates to +the specs. + +A third option would be to only store the names of the extensions in +`zarr-specs` repository and always link out to an externally hosted spec. In +that case, externally hosted could also be another repo under `zarr-developers` +that manages some extensions through its own maintenance structure. + +#### Review process + +To register an extension with a raw name, a community member needs to open a new +PR in the [zarr-extensions](https://github.com/zarr-developers/zarr-extensions) +repository. + +Each extension MUST have a README.md file that describes the extension and its metadata specification. Extensions SHOULD have a schema.json file that contains the JSON schema for the metadata, if the README.md does not provide a link to an external schema. Please note that all extensions documents will be licensed under the Creative Commons Attribution 3.0 Unported License. Only open a PR if you are willing to license your extension under this license. + +The PR will be reviewed by the Zarr steering council. We aim to be very open about registering extensions. The review will be done largely based on avoiding confusing extension names and preventing malicious activity as well as maintaining the formal requirements of the extensions. Extension maintainers are responsible for their extensions. Updates to the extensions will also be reviewed by the steering council. + +## Phase 2: New extension points + +Beyond the immediate clarifications that are outlined in phase 1, we believe +that additional extension points will further improve the Zarr v3 +specification. Since these introduce new interfaces that implementations SHOULD +be aware, we discuss them here for consideration as a ZEP. + +#### Using `must_understand` + +The v3 core spec allows for the addition of new metadata keys as part of spec +evolution. This is defined in the +["Extension Points" section](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points) +of the core spec. Currently, implementations MUST parse the entire metadata and +fail, if they find keys they cannot parse unless those are marked by +`must_understand=False`. We propose using this mechanism to evolve the spec and +add the keys that are necessary to achieve a well-defined extension mechanism. + +New keys in the metadata MAY contain objects that have a `must_understand` key +with value `false`. In that case, the key may be ignored by implementations +that cannot parse it. This is useful for extensions that aren't strictly +required for interaction with the data. If the new key holds a scalar value or +an array or doesn't not contain the `must_understand` key, it is implictily +`must_understand=True`. In that case, implementations MUST fail if they cannot +parse the key. + +### "extensions" extension point + +🛠️ To provide for more flexible, immediate, and de-centralized use cases, we +propose to also add another general-purpose extension point `extensions` on +both arrays and groups into which extensions MAY be added. + +The `extensions` object holds an array of extension definitions. The held array +MUST either have one or more extensions or the object MUST be omitted entirely. + +The key itself is implicitly `must_understand=True`. Implementations MAY set +`must_understand=False` if they can reliably determine that all extensions are +also `must_understand=False`. + +### Additional extension points + +We support the creation of additional extension points in the future but their +introduction should follow the ZEP process. In general, the overall number of +"core" extension points should be well-maintained and provide clear APIs which +can be implemented by a large number of libraries. ZEPs are the appropriate +mechanism to encourage a wide variety of opinions and consensus building. + +### Versioning and spec evolution + +We propose leaving the versioning of the core spec unchanged. That means the +value of `zarr_format=3` and new keys to the metadata MUST be understood by +implementations, i.e. they MUST fail if they find a key they don't know, and +all changes to the core spec MUST go through the ZEP process. + +Extensions SHOULD follow the +[stability policy defined in the Zarr v3 core spec](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stability-policy). +However, this is not a strict requirement and extension maintainers can define +their own evolution processes. + +It is recommended that extensions evolve in a backwards-compatible manner +without explicitly stored versions, meaning (credit to Jeremy Maitin-Shepard): + +- Any metadata compatible with a previous version of the extension continues to + be correctly interpreted by implementations of the new version. +- New metadata written according to the new version of the extension either: + (a) is correctly interpreted by existing implementations of previous versions of the extension, or + (b) causes existing implementations of previous versions of the extension to report an error and not load it. +- New metadata written according to the new version of the extension MUST not + be successfully loaded by existing implementations of previous versions of + the extension with an incorrect interpretation. + +While it is recommended to maximize backwards-compatible changes, it is also +possible to evolve the extension in an intentionally backwards-incompatible +way, e.g.: + +- Choose a new name, e.g. append a version number to the existing name. +- Add a new key to the configuration (like `version`) that was disallowed by + the previous version of the extension, such that existing implementations + will fail when they encounter it. + +### Example + +The following example represents an Array showing many of the proposed changes +described above: + +```javascript +{ + "zarr_format": 3, + ..., + "extensions": [ // new general-purpose extension point + { + "name": "https://example.com/zarr/offset", // uri-based name + "configuration": { "offset": [ 12 ] } + }, + { + "name": "https://example.com/zarr/array-statistics", // uri-based name + "configuration": { + "min": 5, + "max": 12 + }, + "must_understand": false // optional extension + }, + { + "name": "https://example.com/zarr/consolidated-metadata", // uri-based name + "configuration": { ... } + "must_understand": false // optional extension + } + ], +} + +``` + +### Discussion + +#### Alternatives for the `extensions` extension point + +This proposal contains a new general-purpose extension point `extensions`, +which holds an array of extensions. This design allows to have the same +extension definition syntax across all extension points. It also avoids using +URIs as keys in JSON metadata and reduces pollution of the top-level namespace +in a `zarr.json`. Thus, the addition of top-level metadata keys remains +reserved to changes in the core spec. This MAY happen as part of the core spec +adopting functionality of an extension. + +There are alternative designs to be considered: + +##### Top-level metadata keys + +Instead of a general-purpose extension point, we could also add new top-level +extension keys to the metadata. + +```javascript +{ + "zarr_format": 3, + ... + "https://example.com/zarr/offset": { "offset": [ 12 ] }, + "https://example.com/zarr/array-statistics": { + "min": 5, + "max": 12 + }, + "https://example.com/zarr/consolidated-metadata": { + "must_understand": false, + ... + }, // optional extension + ... +} +``` + +In this case, there would be no explicit `configuration` key within an +extension definition, but instead all the keys of such a configuration would be +in the object itself. This would mean that there are two separate types of +extension definitions, i.e. `{"name":"", "configuration": {...}}` in +specialized extension points (e.g. `codecs`) and `"": {...}` for other +extensions. + +It would still be recommended to use an object with keys for the extension +definition to allow for evolution of the extension. + +In case an extension becomes adopted into the core spec, implementations +wouldn't need to be changed (only when changing the name from URI-based to raw +name). + +There has been some controversy about using URLs as keys in JSON metadata. +However, it has also been used effectively in formats such as JSON-LD (see +below). + +##### `extensions` object + +Instead of an array that holds the extension definitions, we could also use an object. + +```javascript +{ + "zarr_format": 3, + ... + "extensions": { + "https://example.com/zarr/offset": { "offset": [ 12 ] }, + "https://example.com/zarr/array-statistics": { + "min": 5, + "max": 12 + }, + "https://example.com/zarr/consolidated-metadata": { + "must_understand": false, + ... + } // optional extension + }, + ... +} +``` + +This alternative is similar to the top-level keys, with mostly the same implications. + +However, this alternative would reserve the top-level namespace for changes to +the core spec and, therefore, reduce pollution of the top-level namespace. + + +#### Applicaton to subnodes + +Conceptually, we propose that extensions defined on groups should be valid for +their child nodes. However, the details of how an implementation should +identify which extensions are active within an hierarchy are unclear. Relying +on traversing the hierarchy towards the root node is undesirable from a +performance point of view. By writing *some* metadata within the contained +subgroups and arrays this could be made easier. Options for what this metadata +could be include: + +1. A copy of the metadata + +```javascript +{ + "extensions": [ + { + "name": "https://example.com/my-extension", + "configuration": { ... full copy of the metadata ...} + } + ] + +} +``` + +2. A reference to the metadata as part of the extension itself + +```javascript +{ + "extensions": [ + { + "name": "https://example.com/my-extension", + "configuration": { + "reference": "../.." + } + } + ] + +} +``` + +3. A complimentary reference extension + +```javascript +{ + "extensions": [ + { + "name": "https://example.com/my-extension-ref", + "configuration": { + "reference": "../.." + } + } + ] + +} +``` + +4. A shared or even core reference extension + +```javascript +{ + "extensions": [ + { + "name": "https://zarr.dev/extensions/parent-ref", + "configuration": { + "reference": "../.." + } + } + ] +} +``` + +## Phase 3: Future name evolution + +The phase 1 proposal above prioritizes changes that can be made to the +specification in line with the current text, aware that there are +contradictions and a lack of clarity. + +It does not provide, however, a mechanism for evolving the assigned names. + +An extension that begins life as a URI that eventually migrates to having a raw +name will require implementations to be updated to check for both values. + +This section outlines why choices above were made (e.g., use of full URIs +rather than a shorter alternative) as the basis for a possible pathway to +evolving the naming strategy. These changes **would** require updating +implementations and therefore would require an additional ZEP but would be +backwards compatible with the clarifications above. + +### URIs everywhere + +By having a mechanism which maps all extension names to a global URI, the two +names (URI and raw) associated with an extensinon. In fact, the original name +assigned to the extension becomes its permanent internal representation even if +it can later be referred to by a shorter name. + +To demonstrate, under this representation the example above could be made +equivalent to: + +```javascript +{ + "https://zarr.dev/v3/array/data_type": "https://example.com/zarr/string", + "https://zarr.dev/v3/array/chunk_key_encoding": { + "name": "https://zarr.dev/v3/chunk_key_encodings/default", + }, + "https://zarr.dev/v3/array/codecs": [ + { + "name": "https://numcodecs.dev/vlen-utf8" + }, + { + "name": "https://zarr.dev/v3/codecs/zstd", + } + ], + "https://zarr.dev/v3/array/chunk_grid": { + "name": "https://zarr.dev/v3/chunk_grid/regular", + }, + "https://zarr.dev/v3/extensions": + { + "name": "https://example.com/zarr/offset", + }, + { + "name": "https://example.com/zarr/array-statistics", + }, + { + "name": "https://example.com/zarr/consolidated-metadata", + } + }, + "https://zarr.dev/v3/array/dimension_names": [ "x" ], + "https://zarr.dev/v3/attributes": { ... }, + "https://zarr.dev/v3/storage_transformers": [] +} +``` + +where all identifiers are now prefixed with a URI full scoping the value, +prefixed in this example with `https://zarr.dev/v3/`. The +specification would clearly list which prefix applies to all "raw" names within +the specification and any new "raw" names which are brought into the +specification **might** exist within URIs outside of `https://zarr.dev/` . + +### Implicit Naming Context + +For all intents and purposes, the immediate proposal defined in this document +defines a single, unversioned global naming context owned by the Zarr +organization which can provide such a mapping. This can be seen as a JSON +document of the form: + +``` +"@context": { + "bool": "https://zarr.dev/bool", // used for brevity and doesn't represent the final URI + ...etc... +} + +``` +This context is not written into each Zarr document but is implied. + +### Explicit Naming Context + +Through the ZEP process, we could make it possible to define an **explicit** +context within a zarr.json file which would override the default, unchaning +naming context. This would let us in the future slowly correct raw names as +necessary without a full breaking change to the format. + +``` +"@context": "https://zarr.dev/context/v3.1" +``` + +This explicit context reference would be written in the zarr.json itself. + + +### Relationship with JSON-LD + +Readers familiar with [JSON-LD](https://json-ld.org/) will recognize the +"@context" definition: + +- Each JSON-LD file MAY have a "@context" key which defines how name resolution + works. With that it's possible to either load a remote context field, or + define namespaces and fields inline. (For Zarr naming resolution, we would + likely want to avoid loading remote resources in favor of well-known contexts + which are cached within the implementations.) +- All keys within the JSON-LD are then resolved against this context. + *EVERYTHING* is a URI but can alternatively be referred to as + "https://example.com/field", "example:field" or for the default namespace + "field" + + +#### Prefixes and aliases + +As an existing and well-defined resolution process, other features of the +JSON-LD naming mechanism might be worth considering for future adoption. For +example, prefixes can be defined within the context, whether implicit or +explicit, which prevents the need to use the full URI. For example, with the +context: + +```javascript +"@context": { + "example": "https://example.com/some-prefix/", +``` + +```javascript +{ + "zarr_format": 3, + "data_type": "https://example.com/some-prefix/utf-16-string", +``` +becomes: +```javascript +{ + "zarr_format": 3, + "data_type": "example:utf-16-string", +``` + +A more advanced feature might be to introduce a short name inline with an +explicit context such that all uses of a given string can be replaced: + + +``` +"@context": { + "shrt": "https://example.com/some-prefix/long-name", +}, +... +"shrt": values +... +``` + +### Internal implications + +Under the hood, all name resolutions whether via implicit, explicit, or inline +context would convert identifer names to a URI which implementations could +reliably use for determining support. Once a URI is assigned, however, it +should never be changed. + + +## Copyright + +This proposal is licensed under [the Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).