Skip to content

Commit e8c0b45

Browse files
authored
Merge pull request #105 from alimanfoo/edit-intro-20201023
Intro edit
2 parents 6d6565d + 83b52a7 commit e8c0b45

File tree

1 file changed

+128
-89
lines changed

1 file changed

+128
-89
lines changed

docs/protocol/core/v3.0.rst

Lines changed: 128 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,12 @@ Issue tracking:
1414
`GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/core-protocol-v3.0>`_
1515

1616
Suggest an edit for this spec:
17-
`GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/core-protocol-v3.0-dev/docs/protocol/core/v3.0.rst>`_
17+
`GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/master/docs/protocol/core/v3.0.rst>`_
1818

19-
Copyright 2019-Present `Zarr core development
20-
team <https://github.com/orgs/zarr-developers/teams/core-devs>`_ (@@TODO
21-
list institutions?). This work is licensed under a `Creative Commons
22-
Attribution 3.0 Unported
23-
License <https://creativecommons.org/licenses/by/3.0/>`_.
19+
Copyright 2019-Present `Zarr core development team
20+
<https://github.com/orgs/zarr-developers/teams/core-devs>`_. This work
21+
is licensed under a `Creative Commons Attribution 3.0 Unported License
22+
<https://creativecommons.org/licenses/by/3.0/>`_.
2423

2524
----
2625

@@ -36,101 +35,107 @@ Status of this document
3635
=======================
3736

3837
This document is a **Work in Progress**. It may be updated, replaced
39-
or obsoleted by other documents at any time. It is inappropriate to
40-
cite this document as other than work in progress.
38+
or obsoleted by other documents at any time.
4139

4240
Comments, questions or contributions to this document are very
4341
welcome. Comments and questions should be raised via `GitHub issues
44-
<https://github.com/zarr-developers/zarr-specs/labels/core-protocol-v3.0>`_. When
45-
raising an issue, please add the label
46-
"core-protocol-v3.0". Contributions and suggested edits can be made
47-
via GitHub via the `online editor
48-
<https://github.com/zarr-developers/zarr-specs/blob/core-protocol-v3.0-dev/docs/protocol/core/v3.0.rst>`_
49-
or by making a pull request against the
50-
`"core-protocol-v3.0-dev" branch <https://github.com/zarr-developers/zarr-specs/tree/core-protocol-v3.0-dev>`_.
42+
<https://github.com/zarr-developers/zarr-specs/labels/core-protocol-v3.0>`_.
5143

5244
This document was produced by the `Zarr core development team
5345
<https://github.com/orgs/zarr-developers/teams/core-devs>`_.
5446

55-
Main difference with v2
56-
=======================
5747

58-
Zarr spec v2 was originally designed around local filesystem, but Zarr has
59-
grown and is now regularly deployed on cloud / object storage. Those kind of
60-
storage have characteristics, capabilities and usage patterns that can widely
61-
differ from the assumptions of spec v2. V3 is designed to consider online
62-
stores, in particular we want to achieve the following:
63-
64-
- No assumption that the underlying store has locking ability.
65-
- Ability to do concurrent writes with the assumption that writes from clients will be consistent, but not atomic.
66-
67-
Unlike Zarr spec v2, the spec v3 has mainly the following differences:
68-
- V3 is a flat key-value store instead of a hierarchical store. Hierarchy is implied.
69-
- V3 has an explicit root, while v2 roots and groups could not be distinguished.
70-
- Separation of the data and metadata key space.
71-
- Explicit support for extensions.
72-
- chunk separator is ``/`` by default.
73-
- `".json"` suffix for the metadata document by default.
74-
75-
This means that a store cannot be opened at an arbitrary point, but needs to be
76-
opened at the root. User facing convenience functions could walk a given
77-
hierarchy and return a sub-group, but this is not part of the API.
78-
79-
Goal and Non-Goal of v3 spec with respect to v2 spec
80-
====================================================
81-
82-
This section is informative and is present to help the reader familiar with
83-
previous version of zarr to find and understand the differences and the reasons
84-
behind them as well as guide the contributor during the draft and review
85-
period.
86-
87-
Better suitability for HPC file systems and network stores
88-
----------------------------------------------------------
89-
90-
One goal of the spec v3 is to have a design that minimized the number of
91-
round-trip operations that must done in order to understand the structure of a
92-
Zarr store. Especially on highly parallel file system and network stores
93-
listing keys and accessing metadata can be an expensive – high latency
94-
– operation. Thus a nested hierarchy listing all available groups, datasets
95-
and chunks can be a time consuming operation.
96-
97-
The v3 spec tries to separate the metadata, from group and dataset data
98-
using a prefix, as well as recommend a flatter way of storing keys in order to
99-
facilitate bulk operations. This should in particular allow to decrease the
100-
reliance on "metadata consolidation" seen with zarr v2.
101-
102-
Another related changes is the notion of implicit groups created when a dataset
103-
or chunk can be written via its full path even when the intermediate groups do
104-
not exist. This allow lock-free write operation for non-contending
105-
applications without the need for extra operations and round trip to create or
106-
check existence of intermediate groups.
107-
108-
Consideration of multiple programming languages
109-
-----------------------------------------------
110-
111-
Zarr spec v3 has an explicit goal of having better compatibility and easier
112-
implementation with programming languages other then Python. Thus a number of
113-
core features in previous spec have been relegated to extensions for the time
114-
being. This include in particular a reduction of the number of datatypes that
115-
are available in core.
116-
117-
Compatibility with the N5 project
118-
---------------------------------
119-
120-
The `N5 project <https://github.com/saalfeldlab/n5>`_ and Zarr have similar
121-
goals. One of the goal of Zarr Spec v3 is to provide compatibility for Most of
122-
Zarr v2 and N5 users in order to allow consolidation under the v3 spec with the
123-
end goal of merging the two projects.
48+
Introduction
49+
============
50+
51+
This specification defines a protocol for storage and retrieval of
52+
data that is organised as one or more multidimensional arrays. This
53+
type of data is common in scientific and numerical computing
54+
applications. Many domains are facing computational challenges as
55+
increasingly large volumes of data are being generated, for example,
56+
via high resolution microscopy, remote sensing imagery, genome
57+
sequencing or numerical simulation. The primary motivation for the
58+
development of Zarr has been to help address this challenge by
59+
enabling the storage of large multidimensional arrays in a way that is
60+
compatible with parallel and/or distributed computing applications.
61+
62+
This protocol specification is intended to supersede the `Zarr storage
63+
specification version 2
64+
<https://zarr.readthedocs.io/en/stable/spec/v2.html>`_ (Zarr v2). The
65+
Zarr v2 specification has been implemented in several programming
66+
languages and has been used successfully to store and analyse large
67+
scientific datasets from a variety of domains. However, as experience
68+
has been gained, it has become clear that there are several
69+
opportunities for modest but useful improvements to be made in the
70+
protocol, and for establishing a foundation that allows for greater
71+
interoperability, whilst also enabling a variety of more advanced and
72+
specialised features to be explored and developed.
73+
74+
This protocol specification also draws heavily on the `N5 API and
75+
file-system specification <https://github.com/saalfeldlab/n5>`_, which
76+
was developed in parallel to Zarr v2 and has many of the same design
77+
goals and features. This specification defines a core set of features
78+
at the intersection of both Zarr v2 and N5, and so aims to provide a
79+
common target that can be fully implemented across multiple
80+
programming environments and serve a wide range of applications.
81+
82+
In particular, we highlight the following areas motivating the
83+
development of this specification.
84+
85+
86+
Distributed storage
87+
-------------------
88+
89+
The Zarr v2 specification was originally developed and implemented for
90+
use with local filesystem storage only. It then became clear that the
91+
same protocol could also be used with distributed storage systems,
92+
including cloud object stores such as Amazon S3, Google Cloud Storage
93+
or Azure Blob Storage. However, distributed storage systems have a
94+
number of important differences from local file systems, both in terms
95+
of the features they support and their performance
96+
characteristics. For example, cloud stores have much greater latency
97+
per request than local file systems, and this means that certain
98+
operations such as exploring a hierarchy of arrays using the Zarr v2
99+
protocol can be unacceptably slow. Workarounds have been developed,
100+
such as the use of metadata consolidation, but there are opportunities
101+
for modifications to the core protocol that address these issues
102+
directly and work more performantly across a range of underlying
103+
storage systems with varying features and latency characteristics. For
104+
example, this protocol specification aims to minimise the number of
105+
storage requests required when opening and exploring a hierarchy of
106+
arrays.
107+
108+
109+
Interoperability
110+
----------------
111+
112+
While the Zarr v2 and N5 specifications have each been implemented in
113+
multiple programming languages, there is currently not feature parity
114+
across all implementations. This is in part because the feature set
115+
includes some features that are not easily translated or supported
116+
across different programming languages. This specification aims to
117+
define a set of core features that are useful and sufficient to
118+
address a significant fraction of use cases, but are also
119+
straightforward to implement fully across different programming
120+
languages. Additional functionality can then be layered via
121+
extensions, some of which may aim for wide adoption, some of which may
122+
be more specialised and have more limited implementation.
123+
124124

125125
Extensibility
126126
-------------
127127

128-
One of the Non-goal of Zarr Spec V3 is to cover all use cases in the core, and
129-
to provide a path forward for extensibility and future standardisation of
130-
extensions without the need to rely on the Zarr core team. A challenge is to
131-
make sure implementations of the Zarr protocol for which used extension are not
132-
available can still give user access to data without triggering corruption when
133-
possible.
128+
The development of systems for storage of very large array-like data
129+
is a very active area of research and development, and there are many
130+
possibilities that remain to be explored. A goal of this specification
131+
is to define a protocol with a number of clear extension points and
132+
mechanisms, in order to provide a framework for freely building on and
133+
exploring these possibilities. We aim to make this possible, whilst
134+
also providing pathways for a graceful degradation of functionality
135+
where possible, in order to retain interoperability. We also aim to
136+
provide a framework for community-defined extensions, which can be
137+
developed and published independently without requiring centralised
138+
coordination of all specifications.
134139

135140

136141
Questions that still need to be resolved
@@ -1493,6 +1498,40 @@ There are no group extensions as as Zarr v3.0
14931498
See https://github.com/zarr-developers/zarr-specs/issues/49 for a list of potential extensions
14941499

14951500

1501+
Comparison with Zarr v2
1502+
=======================
1503+
1504+
This section is informative.
1505+
1506+
Below is a summary of the key differences between this specification
1507+
(v3) and Zarr v2.
1508+
1509+
- In v3 each hierarchy has an explicit root, and must be opened at the
1510+
root. In v2 there was no explicit root and a hierarchy could be
1511+
opened at its original root or at any sub-group.
1512+
1513+
- In v3 the storage keys have been redesigned to separate the space of
1514+
keys used for metadata and data, by using different prefixes. This
1515+
is intended to allow for more performant listing and querying of
1516+
metadata documents on high latency stores. There are also
1517+
differences including a change to the default separator used to
1518+
construct chunk keys, and the addition of a key suffix for metadata
1519+
keys.
1520+
1521+
- v3 has explicit support for protocol extensions via defined
1522+
extension points and mechanisms.
1523+
1524+
- v3 allows for greater flexibility in how groups and arrays are
1525+
created. In particular, v3 supports implicit groups, which are
1526+
groups that do not have a metadata document but whose existence is
1527+
implied by descendant nodes. This change enables multiple arrays to
1528+
be created in parallel without generating any race conditions for
1529+
creating parent groups.
1530+
1531+
- The set of data types specified in v3 is less than in v2. Additional
1532+
data types will be defined via protocol extensions.
1533+
1534+
14961535
References
14971536
==========
14981537

0 commit comments

Comments
 (0)