|
| 1 | +.. _sharding-storage-transformer-v1: |
| 2 | + |
| 3 | +========================================== |
| 4 | +Sharding storage transformer (version 1.0) |
| 5 | +========================================== |
| 6 | +----------------------------- |
| 7 | + Editor's draft 18 02 2022 |
| 8 | +----------------------------- |
| 9 | + |
| 10 | +Specification URI: |
| 11 | + @@TODO |
| 12 | + http://purl.org/zarr/spec/storage_transformers/sharding/1.0 |
| 13 | +Issue tracking: |
| 14 | + `GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/storage_transformers-sharding-v1.0>`_ |
| 15 | +Suggest an edit for this spec: |
| 16 | + `GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/core-protocol-v3.0-dev/docs/storage_transformers/sharding/v1.0.rst>`_ |
| 17 | + |
| 18 | +Copyright 2022 `Zarr core development |
| 19 | +team <https://github.com/orgs/zarr-developers/teams/core-devs>`_ (@@TODO |
| 20 | +list institutions?). This work is licensed under a `Creative Commons |
| 21 | +Attribution 3.0 Unported |
| 22 | +License <https://creativecommons.org/licenses/by/3.0/>`_. |
| 23 | + |
| 24 | +---- |
| 25 | + |
| 26 | + |
| 27 | +Abstract |
| 28 | +======== |
| 29 | + |
| 30 | +This specification defines an implementation of the Zarr abstract |
| 31 | +storage transformer API introducing sharding. |
| 32 | + |
| 33 | + |
| 34 | +Motivation |
| 35 | +========== |
| 36 | + |
| 37 | +Sharding decouples the concept of chunks from storage keys, which become shards. |
| 38 | +This is helpful when the requirements for those don't align: |
| 39 | + |
| 40 | +- Compressible units of chunks often need to be read and written in smaller |
| 41 | + chunks, whereas |
| 42 | +- storage often is optimized for larger data per entry and fewer entries, e.g. |
| 43 | + as restricted by the file block size and maximum inode number for typical |
| 44 | + file systems. |
| 45 | + |
| 46 | +This does not necessarily fit the access patterns of the data, so chunks might |
| 47 | +need to be smaller than one storage key. In those cases sharding decouples those |
| 48 | +entities. One shard corresponds to one storage key, but can contain multiple chunks: |
| 49 | + |
| 50 | +.. image:: sharding.png |
| 51 | + |
| 52 | + |
| 53 | +Document conventions |
| 54 | +==================== |
| 55 | + |
| 56 | +Conformance requirements are expressed with a combination of |
| 57 | +descriptive assertions and [RFC2119]_ terminology. The key words |
| 58 | +"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", |
| 59 | +"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative |
| 60 | +parts of this document are to be interpreted as described in |
| 61 | +[RFC2119]_. However, for readability, these words do not appear in all |
| 62 | +uppercase letters in this specification. |
| 63 | + |
| 64 | +All of the text of this specification is normative except sections |
| 65 | +explicitly marked as non-normative, examples, and notes. Examples in |
| 66 | +this specification are introduced with the words "for example". |
| 67 | + |
| 68 | + |
| 69 | +Configuration |
| 70 | +============= |
| 71 | + |
| 72 | +:ref:`array-metadata`. |
| 73 | + |
| 74 | +.. code-block:: |
| 75 | +
|
| 76 | + { |
| 77 | + storage_transformers: [ |
| 78 | + { |
| 79 | + "storage_transformer": "https://purl.org/zarr/spec/storage_transformers/sharding/1.0", |
| 80 | + "configuration": { |
| 81 | + "format": "indexed", |
| 82 | + "chunks_per_shard": [ |
| 83 | + 2, |
| 84 | + 2 |
| 85 | + ] |
| 86 | + } |
| 87 | + ] |
| 88 | + } |
| 89 | +
|
| 90 | +
|
| 91 | +Sharding Mechanism |
| 92 | +========================= |
| 93 | + |
| 94 | +@@TODO |
| 95 | + |
| 96 | + |
| 97 | +Binary shard format |
| 98 | +=================== |
| 99 | + |
| 100 | +The only binary format is the ``indexed`` format, as specified by the ``format`` |
| 101 | +configuration key. Other binary formats might be added in future versions. |
| 102 | + |
| 103 | +In the indexed binary format chunks are written successively in a shard, where |
| 104 | +unused space between them is allowed, followed by an index referencing them. |
| 105 | +The index holds an `offset, length` pair of little-endian uint64 per chunk, |
| 106 | +the chunks-order in the index is row-major (C) order, e.g. for (2, 2) chunks |
| 107 | +per shard an index would look like: |
| 108 | + |
| 109 | +.. code-block:: |
| 110 | +
|
| 111 | + | chunk (0, 0) | chunk (0, 1) | chunk (1, 0) | chunk (1, 1) | |
| 112 | + | offset | length | offset | length | offset | length | offset | length | |
| 113 | + | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | |
| 114 | +
|
| 115 | +
|
| 116 | +Empty chunks are denoted by setting both offset and length to `2^64 - 1``. |
| 117 | +The index always has the full shape of all possible chunks per shard, |
| 118 | +even if they are outside of the array size. |
| 119 | + |
| 120 | +The actual order of the chunk-content is not fixed and may be chosen by the implementation |
| 121 | +as all possible write orders are valid according to this specification and therefore can |
| 122 | +be read by any other implementation. When writing partial chunks into an existing shard no |
| 123 | +specific order of the existing chunks may be expected. Some writing strategies might be |
| 124 | + |
| 125 | +* **Fixed order**: Specify a fixed order (e.g. row-, column-major or Morton order). |
| 126 | + When replacing existing chunks larger or equal sized chunks may be replaced in-place, |
| 127 | + leaving unused space up to an upper limit which might possibly be specified. |
| 128 | + Please note that for regular-sized uncompressed data all chunks have the same size and |
| 129 | + can therefore be replaced in-place. |
| 130 | +* **Append-only**: Any chunk to write is appended to the existing shard, followed by an updated index. |
| 131 | + |
| 132 | +Any configuration parameters for the write strategy must not be part of the metadata document, |
| 133 | +in a shard I'd propose to use Morton order, but this can easily be changed and customized, since any order can be read. |
| 134 | + |
| 135 | + |
| 136 | +Key translation |
| 137 | +=============== |
| 138 | + |
| 139 | +The Zarr store interface is defined in terms of `keys` and `values`, |
| 140 | +where a `key` is a sequence of characters and a `value` is a sequence |
| 141 | +of bytes. |
| 142 | + |
| 143 | +@@TODO |
| 144 | + |
| 145 | + |
| 146 | +Store API implementation |
| 147 | +======================== |
| 148 | + |
| 149 | +@@TODO |
| 150 | + |
| 151 | +The section below defines an implementation of the Zarr abstract store |
| 152 | +interface (@@TODO link) in terms of the native operations of this |
| 153 | +storage system. Below ``fspath_to_key()`` is a function that |
| 154 | +translates file system paths to store keys, and ``key_to_fspath()`` is |
| 155 | +a function that translates store keys to file system paths, as defined |
| 156 | +in the section above. |
| 157 | + |
| 158 | +* ``get(key) -> value`` : Read and return the contents of the file at |
| 159 | + file system path ``key_to_fspath(key)``. |
| 160 | + |
| 161 | +* ``set(key, value)`` : Write ``value`` as the contents of the file at |
| 162 | + file system path ``key_to_fspath(key)``. |
| 163 | + |
| 164 | +* ``delete(key)`` : Delete the file or directory at file system path |
| 165 | + ``key_to_fspath(key)``. |
| 166 | + |
| 167 | +* ``list()`` : Recursively walk the file system from the base |
| 168 | + directory, returning an iterator over keys obtained by calling |
| 169 | + ``fspath_to_key(fp)`` for each descendant file path ``fp``. |
| 170 | + |
| 171 | +* ``list_prefix(prefix)`` : Obtain a file system path by calling |
| 172 | + ``key_to_fspath(prefix)``. If the result is a directory path, |
| 173 | + recursively walk the file system from this directory, returning an |
| 174 | + iterator over keys obtained by calling ``fspath_to_key(fp)`` for |
| 175 | + each descendant file path ``fp``. |
| 176 | + |
| 177 | +* ``list_dir(prefix)`` : Obtain a file system path by calling |
| 178 | + ``key_to_fspath(prefix)``. If the result is a director path, list |
| 179 | + the directory children. Return a set of keys obtained by calling |
| 180 | + ``fspath_to_key(fp)`` for each child file path ``fp``, and a set of |
| 181 | + prefixes obtained by calling ``fspath_to_key(dp)`` for each child |
| 182 | + directory path ``dp``. |
| 183 | + |
| 184 | + |
| 185 | +References |
| 186 | +========== |
| 187 | + |
| 188 | +.. [RFC2119] S. Bradner. Key words for use in RFCs to Indicate |
| 189 | + Requirement Levels. March 1997. Best Current Practice. URL: |
| 190 | + https://tools.ietf.org/html/rfc2119 |
| 191 | +
|
| 192 | +
|
| 193 | +Change log |
| 194 | +========== |
| 195 | + |
| 196 | +@@TODO |
0 commit comments