Skip to content

Commit 5d7c953

Browse files
committed
finished first draft of sharding spec
1 parent a0d021b commit 5d7c953

File tree

2 files changed

+57
-63
lines changed

2 files changed

+57
-63
lines changed

docs/protocol/core/v3.0.rst

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -393,7 +393,9 @@ conceptual model underpinning the Zarr protocol.
393393
an array_ in the underlying store_. Upon retrival the original data is
394394
restored within the transformer. Any number of `predefined storage
395395
transformers`_ can be registered and stacked.
396+
See the `storage transformers details`_ below.
396397

398+
.. _`storage transformers details`: #storage-transformers-1
397399

398400
Node names
399401
==========
@@ -1034,14 +1036,14 @@ The following names are optional:
10341036

10351037
``storage_transformers``
10361038

1037-
Specifies a codec to be used for encoding and decoding chunks. The
1038-
value must be an object containing the name ``codec`` whose value
1039-
is a URI that identifies a codec and dereferences to a human-readable
1040-
representation of the codec specification. The codec
1039+
Specifies a stack of `storage transformers`_. Each value in the list must
1040+
be an object containing the name ``storage_transformer`` whose value
1041+
is a URI that identifies a storage transformer and dereferences to a
1042+
human-readable representation of the codec specification. The
10411043
object may also contain a ``configuration`` object which consists of the
1042-
parameter names and values as defined by the corresponding codec
1043-
specification. When the ``compressor`` name is absent, this means that no
1044-
compressor is used.
1044+
parameter names and values as defined by the corresponding storage transformer
1045+
specification. When the ``storage_transformers`` name is absent no storage
1046+
transformer is used, same for an empty list.
10451047

10461048

10471049
All other names within the array metadata object are reserved for
@@ -1197,6 +1199,8 @@ a store implementation to support all of these capabilities.
11971199

11981200
A **readable store** supports the following operation:
11991201

1202+
@@TODO add bundled & partial access
1203+
12001204
``get`` - Retrieve the `value` associated with a given `key`.
12011205

12021206
| Parameters: `key`
@@ -1528,7 +1532,8 @@ Storage transformers
15281532

15291533
A Zarr storage transformer allows to change the zarr-compatible data before storing it.
15301534
The stored transformed data is restored to its original state whenever data is requested
1531-
by the Array.
1535+
by the Array. Storage transformers can be configured per array via the
1536+
``storage_transformers`` name in the `array metadata`_.
15321537

15331538
A storage transformer serves the same `Abstract store interface`_ as the store_.
15341539
However, it should not persistently store any information necessary to restore the original data,

docs/storage_transformers/sharding/v1.0.rst

Lines changed: 44 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ License <https://creativecommons.org/licenses/by/3.0/>`_.
2727
Abstract
2828
========
2929

30-
This specification defines an implementation of the Zarr abstract
31-
storage transformer API introducing sharding.
30+
This specification defines an implementation of the Zarr
31+
storage transformer protocol for sharding.
3232

3333

3434
Motivation
@@ -69,7 +69,7 @@ this specification are introduced with the words "for example".
6969
Configuration
7070
=============
7171

72-
:ref:`array-metadata`.
72+
Sharding can be configured per array in the :ref:`array-metadata`:
7373

7474
.. code-block::
7575
@@ -87,11 +87,49 @@ Configuration
8787
]
8888
}
8989
90+
``format``
9091

91-
Sharding Mechanism
92-
=========================
92+
Specifies a `Binary shard format`_. In this version, the only binary format is the
93+
``indexed`` format.
9394

94-
@@TODO
95+
``chunks_per_shard``
96+
97+
An array of integers providing the number of chunks that are combined in a shard
98+
for each dimension of the Zarr array, where each chunk may only start at a position
99+
that is divisble by ``chunks_per_shard`` per dimension, e.g. starting at the zero-origin.
100+
The length of the array must match the length of the array metadata ``shape`` entry.
101+
For example, a value ``[32, 2]`` indicates that 64 chunks are combined in one shard,
102+
32 along the first dimension, and for each of those 2 along the second dimension.
103+
Valid starting positions for a shard in the chunk-grid are therefore ``[0, 0]``,
104+
``[32, 2]``, ``[32, 4]``, ``[64, 2]`` or ``[96, 18]``.
105+
106+
107+
Key & value transformation
108+
==========================
109+
110+
The storage transformer protocol defines the abstract interface to be the same
111+
as the `Abstract store interface`_.
112+
113+
The Zarr store interface is defined in terms of `keys` and `values`,
114+
where a `key` is a sequence of characters and a `value` is a sequence
115+
of bytes. A key-value pair is called entry in the following part.
116+
117+
This sharding transformer only adapts entries where the key starts
118+
with `data/root`, as they indicate data keys for array chunks. All other
119+
entries are simply passed on.
120+
121+
Entries starting with `data/root` are grouped by their common shard, assuming
122+
`Storage keys` from a regular chunk grid which may use a customly configured
123+
``chunk separator``:
124+
For all entries that are part of the same shard the key is changed to the
125+
shard-key and the values are combined in the `Binary shard format`_ described
126+
below. The new shard-key is the chunk key divided by ``chunks_per_shard`` and
127+
floored per dimension. E.g. for ``chunks_per_shard=[32, 2]``, the chunk grid
128+
position ``[96, 18]`` (e.g. key "data/root/foo/baz/c96/18") is transformed to
129+
the shard grid position ``[3, 9]`` and reassigned to the respective new key,
130+
honoring the original chunk separator (e.g. "data/root/foo/baz/c3/9").
131+
Chunk grid positions ``[96, 19]``, ``[97, 18]``, …, up to ``[127, 19]`` will
132+
also have the same shard grid position ``[3, 9]``.
95133

96134

97135
Binary shard format
@@ -133,55 +171,6 @@ Any configuration parameters for the write strategy must not be part of the meta
133171
in a shard I'd propose to use Morton order, but this can easily be changed and customized, since any order can be read.
134172

135173

136-
Key translation
137-
===============
138-
139-
The Zarr store interface is defined in terms of `keys` and `values`,
140-
where a `key` is a sequence of characters and a `value` is a sequence
141-
of bytes.
142-
143-
@@TODO
144-
145-
146-
Store API implementation
147-
========================
148-
149-
@@TODO
150-
151-
The section below defines an implementation of the Zarr abstract store
152-
interface (@@TODO link) in terms of the native operations of this
153-
storage system. Below ``fspath_to_key()`` is a function that
154-
translates file system paths to store keys, and ``key_to_fspath()`` is
155-
a function that translates store keys to file system paths, as defined
156-
in the section above.
157-
158-
* ``get(key) -> value`` : Read and return the contents of the file at
159-
file system path ``key_to_fspath(key)``.
160-
161-
* ``set(key, value)`` : Write ``value`` as the contents of the file at
162-
file system path ``key_to_fspath(key)``.
163-
164-
* ``delete(key)`` : Delete the file or directory at file system path
165-
``key_to_fspath(key)``.
166-
167-
* ``list()`` : Recursively walk the file system from the base
168-
directory, returning an iterator over keys obtained by calling
169-
``fspath_to_key(fp)`` for each descendant file path ``fp``.
170-
171-
* ``list_prefix(prefix)`` : Obtain a file system path by calling
172-
``key_to_fspath(prefix)``. If the result is a directory path,
173-
recursively walk the file system from this directory, returning an
174-
iterator over keys obtained by calling ``fspath_to_key(fp)`` for
175-
each descendant file path ``fp``.
176-
177-
* ``list_dir(prefix)`` : Obtain a file system path by calling
178-
``key_to_fspath(prefix)``. If the result is a director path, list
179-
the directory children. Return a set of keys obtained by calling
180-
``fspath_to_key(fp)`` for each child file path ``fp``, and a set of
181-
prefixes obtained by calling ``fspath_to_key(dp)`` for each child
182-
directory path ``dp``.
183-
184-
185174
References
186175
==========
187176

0 commit comments

Comments
 (0)