@@ -79,88 +79,63 @@ at the intersection of both Zarr v2 and N5, and so aims to provide a
7979common target that can be fully implemented across multiple
8080programming environments and serve a wide range of applications.
8181
82+ In particular, we highlight the following areas motivating the
83+ development of this specification.
8284
8385
86+ Distributed storage
87+ -------------------
8488
85- Main difference with v2
86- =======================
89+ The Zarr v2 specification was originally developed and implemented for
90+ use with local filesystem storage only. It then became clear that the
91+ same protocol could also be used with distributed storage systems,
92+ including cloud object stores such as Amazon S3, Google Cloud Storage
93+ or Azure Blob Storage. However, distributed storage systems have a
94+ number of important differences from local file systems, both in terms
95+ of the features they support and their performance
96+ characteristics. For example, cloud stores have much greater latency
97+ per request than local file systems, and this means that certain
98+ operations such as exploring a hierarchy of arrays using the Zarr v2
99+ protocol can be unacceptably slow. Workarounds have been developed,
100+ such as the use of metadata consolidation, but there are opportunities
101+ for modifications to the core protocol that address these issues
102+ directly and work more performantly across a range of underlying
103+ storage systems with varying features and latency characteristics. For
104+ example, this protocol specification aims to minimise the number of
105+ storage requests required when opening and exploring a hierarchy of
106+ arrays.
107+
108+
109+ Interoperability
110+ ----------------
111+
112+ While the Zarr v2 and N5 specifications have each been implemented in
113+ multiple programming languages, there is currently not feature parity
114+ across all implementations. This is in part because the feature set
115+ includes some features that are not easily translated or supported
116+ across different programming languages. This specification aims to
117+ define a set of core features that are useful and sufficient to
118+ address a significant fraction of use cases, but are also
119+ straightforward to implement fully across different programming
120+ languages. Additional functionality can then be layered via
121+ extensions, some of which may aim for wide adoption, some of which may
122+ be more specialised and have more limited implementation.
87123
88- Zarr spec v2 was originally designed around local filesystem, but Zarr has
89- grown and is now regularly deployed on cloud / object storage. Those kind of
90- storage have characteristics, capabilities and usage patterns that can widely
91- differ from the assumptions of spec v2. V3 is designed to consider online
92- stores, in particular we want to achieve the following:
93-
94- - No assumption that the underlying store has locking ability.
95- - Ability to do concurrent writes with the assumption that writes from clients will be consistent, but not atomic.
96-
97- Unlike Zarr spec v2, the spec v3 has mainly the following differences:
98- - V3 is a flat key-value store instead of a hierarchical store. Hierarchy is implied.
99- - V3 has an explicit root, while v2 roots and groups could not be distinguished.
100- - Separation of the data and metadata key space.
101- - Explicit support for extensions.
102- - chunk separator is ``/ `` by default.
103- - `".json" ` suffix for the metadata document by default.
104-
105- This means that a store cannot be opened at an arbitrary point, but needs to be
106- opened at the root. User facing convenience functions could walk a given
107- hierarchy and return a sub-group, but this is not part of the API.
108-
109- Goal and Non-Goal of v3 spec with respect to v2 spec
110- ====================================================
111-
112- This section is informative and is present to help the reader familiar with
113- previous version of zarr to find and understand the differences and the reasons
114- behind them as well as guide the contributor during the draft and review
115- period.
116-
117- Better suitability for HPC file systems and network stores
118- ----------------------------------------------------------
119-
120- One goal of the spec v3 is to have a design that minimized the number of
121- round-trip operations that must done in order to understand the structure of a
122- Zarr store. Especially on highly parallel file system and network stores
123- listing keys and accessing metadata can be an expensive – high latency
124- – operation. Thus a nested hierarchy listing all available groups, datasets
125- and chunks can be a time consuming operation.
126-
127- The v3 spec tries to separate the metadata, from group and dataset data
128- using a prefix, as well as recommend a flatter way of storing keys in order to
129- facilitate bulk operations. This should in particular allow to decrease the
130- reliance on "metadata consolidation" seen with zarr v2.
131-
132- Another related changes is the notion of implicit groups created when a dataset
133- or chunk can be written via its full path even when the intermediate groups do
134- not exist. This allow lock-free write operation for non-contending
135- applications without the need for extra operations and round trip to create or
136- check existence of intermediate groups.
137-
138- Consideration of multiple programming languages
139- -----------------------------------------------
140-
141- Zarr spec v3 has an explicit goal of having better compatibility and easier
142- implementation with programming languages other then Python. Thus a number of
143- core features in previous spec have been relegated to extensions for the time
144- being. This include in particular a reduction of the number of datatypes that
145- are available in core.
146-
147- Compatibility with the N5 project
148- ---------------------------------
149-
150- The `N5 project <https://github.com/saalfeldlab/n5 >`_ and Zarr have similar
151- goals. One of the goal of Zarr Spec v3 is to provide compatibility for Most of
152- Zarr v2 and N5 users in order to allow consolidation under the v3 spec with the
153- end goal of merging the two projects.
154124
155125Extensibility
156126-------------
157127
158- One of the Non-goal of Zarr Spec V3 is to cover all use cases in the core, and
159- to provide a path forward for extensibility and future standardisation of
160- extensions without the need to rely on the Zarr core team. A challenge is to
161- make sure implementations of the Zarr protocol for which used extension are not
162- available can still give user access to data without triggering corruption when
163- possible.
128+ The development of systems for storage of very large array-like data
129+ is a very active area of research and development, and there are many
130+ possibilities that remain to be explored. A goal of this specification
131+ is to define a protocol with a number of clear extension points and
132+ mechanisms, in order to provide a framework for freely building on and
133+ exploring these possibilities. We aim to make this possible, whilst
134+ also providing pathways for a graceful degradation of functionality
135+ where possible, in order to retain interoperability. We also aim to
136+ provide a framework for community-defined extensions, which can be
137+ developed and published independently without requiring centralised
138+ coordination of all specifications.
164139
165140
166141Questions that still need to be resolved
@@ -1523,6 +1498,40 @@ There are no group extensions as as Zarr v3.0
15231498See https://github.com/zarr-developers/zarr-specs/issues/49 for a list of potential extensions
15241499
15251500
1501+ Comparison with Zarr v2
1502+ =======================
1503+
1504+ This section is informative.
1505+
1506+ Below is a summary of the key differences between this specification
1507+ (v3) and Zarr v2.
1508+
1509+ - In v3 each hierarchy has an explicit root, and must be opened at the
1510+ root. In v2 there was no explicit root and a hierarchy could be
1511+ opened at its original root or at any sub-group.
1512+
1513+ - In v3 the storage keys have been redesigned to separate the space of
1514+ keys used for metadata and data, by using different prefixes. This
1515+ is intended to allow for more performant listing and querying of
1516+ metadata documents on high latency stores. There are also
1517+ differences including a change to the default separator used to
1518+ construct chunk keys, and the addition of a key suffix for metadata
1519+ keys.
1520+
1521+ - v3 has explicit support for protocol extensions via defined
1522+ extension points and mechanisms.
1523+
1524+ - v3 allows for greater flexibility in how groups and arrays are
1525+ created. In particular, v3 supports implicit groups, which are
1526+ groups that do not have a metadata document but whose existence is
1527+ implied by descendant nodes. This change enables multiple arrays to
1528+ be created in parallel without generating any race conditions for
1529+ creating parent groups.
1530+
1531+ - The set of data types specified in v3 is less than in v2. Additional
1532+ data types will be defined via protocol extensions.
1533+
1534+
15261535References
15271536==========
15281537
0 commit comments