diff --git a/docs/conf.py b/docs/conf.py index 33351f98..9f74fa86 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -28,6 +28,7 @@ # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ + 'sphinxcontrib.mermaid' ] # Add any paths that contain templates here, relative to this directory. diff --git a/docs/index.rst b/docs/index.rst index 7a985d07..b243ce2c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -10,8 +10,9 @@ Under construction. :caption: Contents: protocol - stores codecs + stores + storage_transformers Indices and tables diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index bb6858c9..5e4cf7df 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -383,6 +383,19 @@ conceptual model underpinning the Zarr protocol. interface`_ which is a common set of operations that stores may provide. +.. _storage transformer: +.. _storage transformers: + +*Storage transformer* + + To enhance the storage capabilities, storage transformers may + change the storage structure and behaviour of data coming from + an array_ in the underlying store_. Upon retrival the original data is + restored within the transformer. Any number of `predefined storage + transformers`_ can be registered and stacked. + See the `storage transformers details`_ below. + +.. _`storage transformers details`: #storage-transformers-1 Node names ========== @@ -895,6 +908,8 @@ ignored if not understood:: } +.. _array-metadata: + Array metadata -------------- @@ -1019,6 +1034,17 @@ The following names are optional: specification. When the ``compressor`` name is absent, this means that no compressor is used. +``storage_transformers`` + + Specifies a stack of `storage transformers`_. Each value in the list must + be an object containing the name ``storage_transformer`` whose value + is a URI that identifies a storage transformer and dereferences to a + human-readable representation of the codec specification. The + object may also contain a ``configuration`` object which consists of the + parameter names and values as defined by the corresponding storage transformer + specification. When the ``storage_transformers`` name is absent no storage + transformer is used, same for an empty list. + All other names within the array metadata object are reserved for future versions of this specification. @@ -1141,6 +1167,9 @@ interface`_ subsection. The store interface can be implemented using a variety of underlying storage technologies, described in the subsection on `Store implementations`_. + +.. _abstract-store-interface: + Abstract store interface ------------------------ @@ -1162,6 +1191,23 @@ one such pair for any given `key`. I.e., a store is a mapping from keys to values. It is also assumed that keys are case sensitive, i.e., the keys "foo" and "FOO" are different. +To read and write partial values, a `range` specifies two integers +`range_start` and `range_length`, that specify a part of the value +starting at byte `range_start` (inclusive) and having a length of +`range_length` bytes. `range_length` may be none, indicating all +available data until the end of the referenced value. For example +`range` ``[0, none]`` specifies the full value. Stores that do not +support partial access can still answer the requests using cutouts +of full values. It is recommended that the implementation of the +``get_partial_values``, ``set_partial_values`` and +``erase_values`` methods is made optional, providing fallbacks +for them by default. However, it is recommended to supply those operations +where possible for efficiency. Also, the ``get``, ``set`` and ``erase`` +can easily be mapped onto their `partial_values` counterparts. +Therefore, it is also recommended to supply fallbacks for those if the +`partial_values` operations can be implemented. +An entity containing those fallbacks could be named ``StoreWithPartialAccess``. + The store interface also defines some operations involving `prefixes`. In the context of this interface, a prefix is a string containing only characters that are valid for use in `keys` and ending @@ -1173,11 +1219,20 @@ a store implementation to support all of these capabilities. A **readable store** supports the following operation: +@@TODO add bundled & partial access + ``get`` - Retrieve the `value` associated with a given `key`. | Parameters: `key` | Output: `value` +``get_partial_values`` - Retrieve possibly partial `values` from given `key_ranges`. + + | Parameters: `key_ranges`: ordered set of `key`, `range` pairs, + | a `key` may occur multiple times with different `ranges` + | Output: list of `values`, in the order of the `key_ranges`, may contain none + | for missing keys + A **writeable store** supports the following operations: ``set`` - Store a (`key`, `value`) pair. @@ -1185,11 +1240,25 @@ A **writeable store** supports the following operations: | Parameters: `key`, `value` | Output: none +``set_partial_values`` - Store `values` at a given `key`, starting at byte `range_start`. + + | Parameters: `key_start_values`: set of `key`, + | `range_start`, `value` triples, a `key` may occur multiple + | times with different `range_starts`, `range_starts` with + | length of the respective `value` must not specify overlapping + | ranges for the same `key` + | Output: none + ``erase`` - Erase the given key/value pair from the store. | Parameters: `key` | Output: none +``erase_values`` - Erase the given key/value pairs from the store. + + | Parameters: `keys`: set of `keys` + | Output: none + ``erase_prefix`` - Erase all keys with the given prefix from the store: | Parameter: `prefix` @@ -1298,6 +1367,8 @@ Note that any non-root hierarchy path will have ancestor paths that identify ancestor nodes in the hierarchy. For example, the path "/foo/bar" has ancestor paths "/foo" and "/". +.. _storage-keys: + Storage keys ------------ @@ -1499,6 +1570,42 @@ Let "+" be the string concatenation operator. For listable store, ``list_dir(parent(P))`` can be an alternative. +Storage transformers +==================== + +A Zarr storage transformer allows to change the zarr-compatible data before storing it. +The stored transformed data is restored to its original state whenever data is requested +by the Array. Storage transformers can be configured per array via the +``storage_transformers`` name in the `array metadata`_. Storage transformers which do +not change the storage layout (e.g. for caching) may be specified at runtime without +adding them to the array metadata. + +A storage transformer serves the same `Abstract store interface`_ as the store_. +However, it should not persistently store any information necessary to restore the original data, +but instead propagates this to the next storage transformer or the final store. +From the perspective of an Array or a previous stage transformer both store and storage transformer follow the same +protocol and can be interchanged regarding the protocol. The behaviour can still be different, +e.g. requests may be cached or the form of the underlying data can change. + +Storage transformers may be stacked to combine different functionalities: + +.. mermaid:: + + graph LR + Array --> t1 + subgraph stack [Storage transformers] + t1[Transformer 1] --> t2[...] --> t3[Transformer N] + end + t3 --> Store + +A fixed set of storage providers is recommended for implementation with this protocol: + + +Predefined storage transformers +------------------------------- + +- :ref:`sharding-storage-transformer-v1` + Protocol extensions =================== diff --git a/docs/requirements.txt b/docs/requirements.txt index 1095b98a..e2adc36d 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,3 +1,3 @@ sphinx==2.0.1 pydata-sphinx-theme - +sphinxcontrib-mermaid diff --git a/docs/storage_transformers.rst b/docs/storage_transformers.rst new file mode 100644 index 00000000..c86ce4ea --- /dev/null +++ b/docs/storage_transformers.rst @@ -0,0 +1,11 @@ +==================== +Storage Transformers +==================== + +Under construction. + +.. toctree:: + :maxdepth: 1 + :caption: Contents: + + storage_transformers/sharding/v1.0 diff --git a/docs/storage_transformers/sharding/sharding.png b/docs/storage_transformers/sharding/sharding.png new file mode 100644 index 00000000..85a3c331 Binary files /dev/null and b/docs/storage_transformers/sharding/sharding.png differ diff --git a/docs/storage_transformers/sharding/v1.0.rst b/docs/storage_transformers/sharding/v1.0.rst new file mode 100644 index 00000000..66c7fe98 --- /dev/null +++ b/docs/storage_transformers/sharding/v1.0.rst @@ -0,0 +1,279 @@ +.. _sharding-storage-transformer-v1: + +========================================== +Sharding storage transformer (version 1.0) +========================================== +----------------------------- + Editor's draft 18 02 2022 +----------------------------- + +Specification URI: + @@TODO + http://purl.org/zarr/spec/storage_transformers/sharding/1.0 +Issue tracking: + `GitHub issues `_ +Suggest an edit for this spec: + `GitHub editor `_ + +Copyright 2022 `Zarr core development +team `_ (@@TODO +list institutions?). This work is licensed under a `Creative Commons +Attribution 3.0 Unported +License `_. + +---- + + +Abstract +======== + +This specification defines an implementation of the Zarr +storage transformer protocol for sharding. + +Sharding co-locates multiple chunks within a storage object, bundling them in shards. + + +Motivation +========== + +In many cases it becomes inefficient or impractical to store a large number of chunks in +single files or objects due to the design constraints of the underlying storage, +for example as restricted by the file block size and maximum inode number for typical file systems. + +Increasing the chunk size works only up to a certain point, as chunk sizes need to be small for +read efficiency requirements, for example to stream data in browser-based visualization software. + +Therefore, chunks may need to be smaller than the minimum size of one storage key. +In those cases it is efficient to store objects at a more coarse granularity than reading chunks. +Sharding solves this by allowing to store multiple chunks in one storage key, which is called a shard: + +.. image:: sharding.png + + +Document conventions +==================== + +Conformance requirements are expressed with a combination of +descriptive assertions and [RFC2119]_ terminology. The key words +"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", +"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative +parts of this document are to be interpreted as described in +[RFC2119]_. However, for readability, these words do not appear in all +uppercase letters in this specification. + +All of the text of this specification is normative except sections +explicitly marked as non-normative, examples, and notes. Examples in +this specification are introduced with the words "for example". + + +Configuration +============= + +Sharding can be configured per array in the :ref:`array-metadata`: + +.. code-block:: + + { + storage_transformers: [ + { + "storage_transformer": "https://purl.org/zarr/spec/storage_transformers/sharding/1.0", + "configuration": { + "format": "indexed", + "chunks_per_shard": [ + 2, + 2 + ] + } + ] + } + +``format`` + + Specifies a `Binary shard format`_. In this version, the only binary format is the + ``indexed`` format. + +``chunks_per_shard`` + + An array of integers providing the number of chunks that are combined in a shard + for each dimension of the Zarr array, where each chunk may only start at a position + that is divisble by ``chunks_per_shard`` per dimension, e.g. starting at the zero-origin. + The length of the array must match the length of the array metadata ``shape`` entry. + For example, a value ``[32, 2]`` indicates that 64 chunks are combined in one shard, + 32 along the first dimension, and for each of those 2 along the second dimension. + Valid starting positions for a shard in the chunk-grid are therefore ``[0, 0]``, + ``[32, 2]``, ``[32, 4]``, ``[64, 2]`` or ``[96, 18]``. + + +Storage transformer implementation +================================== + +Key & value transformation +-------------------------- + +The storage transformer protocol defines the abstract interface to be the same +as the :ref:`abstract-store-interface`. + +The Zarr store interface is defined as a mapping of `keys` and `values`, +where a `key` is a sequence of characters and a `value` is a sequence +of bytes. A key-value pair is called `entry` in the following part. + +This sharding transformer only adapts entries where the key starts +with `data/root`, as they indicate data keys for array chunks, see +:ref:`storage-keys`. All other entries are simply passed on. + +Entries starting with ``data/root`` are grouped by their common shard, assuming +storage keys from a regular chunk grid which may use a customly configured +``chunk separator``: +For all entries that are part of the same shard the key is changed to the +shard-key and the values are combined in the `Binary shard format`_ described +below. The new shard-key is the chunk key divided by ``chunks_per_shard`` and +floored per dimension. For example for ``chunks_per_shard=[32, 2]``, the chunk grid +position ``[96, 18]`` (e.g. key "data/root/foo/baz/c96/18") is transformed to +the shard grid position ``[3, 9]`` and reassigned to the respective new key, +honoring the original chunk separator (e.g. "data/root/foo/baz/c3/9"). +Chunk grid positions ``[96, 19]``, ``[97, 18]``, …, up to ``[127, 19]`` will +also have the same shard grid position ``[3, 9]``. + + +Binary shard format +------------------- + +The only binary format is the ``indexed`` format, as specified by the ``format`` +configuration key. Other binary formats might be added in future versions. + +In the indexed binary format chunks are written successively in a shard, where +unused space between them is allowed, followed by an index referencing them. +The index is placed at the end of the file and has a size of 16 bytes multiplied by the number of chunks +in a shard, for example ``16 bytes * 64 = 1014 bytes`` for ``chunks_per_shard=[32, 2]``. +The index holds an `offset, nbytes` pair of little-endian uint64 per chunk, +the chunks-order in the index is row-major (C) order, for example for +``chunks_per_shard=[2, 2]`` an index would look like: + +.. code-block:: + + | chunk (0, 0) | chunk (0, 1) | chunk (1, 0) | chunk (1, 1) | + | offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes | + | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | + + +Empty chunks are denoted by setting both offset and nbytes to ``2^64 - 1``. +The index always has the full shape of all possible chunks per shard, +even if they are outside of the array size. + +The actual order of the chunk-content is not fixed and may be chosen by the implementation +as all possible write orders are valid according to this specification and therefore can +be read by any other implementation. When writing partial chunks into an existing shard no +specific order of the existing chunks may be expected. Some writing strategies might be + +* **Fixed order**: Specify a fixed order (e.g. row-, column-major or Morton order). + When replacing existing chunks larger or equal sized chunks may be replaced in-place, + leaving unused space up to an upper limit which might possibly be specified. + Please note that for regular-sized uncompressed data all chunks have the same size and + can therefore be replaced in-place. +* **Append-only**: Any chunk to write is appended to the existing shard, + followed by an updated index. + +Any configuration parameters for the write strategy must not be part of the metadata document, +they need to be configured at runtime, as this is implementation specific. + + +API implementation +------------------ + +The section below defines an implementation of the +:ref:`abstract-store-interface` in terms of the operations of this +storage transformer as a ``StoreWithPartialAccess``. +The term `underlying store` references either the next storage transformer +in the stack or the actual store if this transformer is the last one in the +stack. Any operations with keys not starting with ``data/root`` are simply +relayed to the underlying store and not described explicitly. + +* ``get_partial_values(key_ranges) -> values``: + For each referenced key, request the indices from the underlying store using + ``get_partial_values``. For each `key`, `range` pair in in `key_ranges`, + check if the chunk exists by checking if the index offset and nbytes + are both ``2^64 - 1``. For existing keys, request the actual chunks by + their ranges as read from the index using ``get_partial_values``. + This operation should be implemented using two ``get_partial_values`` + operations on the underlying store, one for retrieving the indices and + one for retrieving existing chunks. + +* ``set_partial_values(key_start_values)`` : + For each referenced key, check if all available chunks in a shard are + referenced. In this case a shard can be constructed according to the + `Binary shard format`_ directly. + For all other keys, request the indices from the underlying store using + ``get_partial_values``. All chunks that are not updated completely and + exist according to the index (index offset and nbytes are both + ``2^64 - 1``) need to be read via ``get_partial_values`` from the + underlying store. For simplification purposes a shard may also be read + completely, combining the previous two `get` operations into one. + Based on the existing chunks and value ranges that need to be updated + new shards are constructed according to the `Binary shard format`_. + All shards that need to be updated must now be set via ``set`` or + ``set_partial_values(key_start_values)``, depending one the chosen + writing strategy provided by the implementation. + Specialized store implementations that allow appending to a storage + object may only need to read the index to update it. + +* ``erase_values(keys)`` : + For each referenced key, check if all available chunks in a shard are + referenced. In this case the full shard is removed using ``erase_values`` + on the underlying store. + For all other keys, request the indices from the underlying + store using ``get_partial_values``. Update the index using and offset and + nbytes of ``2^64 - 1`` to mark missing chunks. The updated index may be + be written in-place using ``set_partial_values(key_start_values)``, + or a larger rewrite of the shard may be done including the index update, + but also removing value ranges corresponding to the erased chunks. + +* ``erase_prefix()`` : If the prefix contains a part of the chunk-grid + key, this part is translated to the referenced shard and contained chunks. + For affected shards where all contained chunks are erased the prefix is + rewritten to the corresponding shard key and the operation is relayed to + the underlying store. + For all shards where only some chunks are erased the affected chunks + are removed by invoking the operation ``erase_values`` on this + storage transformer with the respective chunk keys. + +* ``list()``: See ``list_prefix`` with the prefix ``/``. + +* ``list_prefix(prefix)`` : If the prefix contains a part of the chunk-grid + key, this part is translated to the referenced shard and contained chunks. + Then, ``list_prefix`` is called on the underlying store with the translated + prefix. For all listed shards request the indices from the underlying store + using ``get_partial_values``. Existing chunks, where the index offset or + nbytes are not ``2^64 - 1`` are then listed by their original key. + +* ``list_dir(prefix)`` : If the prefix contains a part of the chunk-grid + key, this part is translated to the referenced shard and contained chunks. + Then, ``list_dir`` is called on the underlying store with the translated + prefix. For all *retrieved prefixes* (not full keys) with partial shard keys, + the corresponding original prefixes covering all possible chunks in the shard + are listed. For *retrieved full keys* the the indices from the underlying store + are requested using ``get_partial_values``. Existing chunks, where the index + offset or nbytes are not ``2^64 - 1`` are then listed by their original key. + + .. note:: + Not all listed prefixes must necessarily contain keys, as shard prefixes with + partially available chunks return prefixes for all possible chunks without + verifying their exisence for performance reasons. Listing those prefixes + is still safe as some chunks in their corresponding shard exist, but not + necessarily in the requested prefix, possibly leading to empty responses. + Please note, this only applies for returned prefixes, *not* for full keys + referencing storage objects. Returned full keys always reflect the actually + available chunks and are safe to request. + + +References +========== + +.. [RFC2119] S. Bradner. Key words for use in RFCs to Indicate + Requirement Levels. March 1997. Best Current Practice. URL: + https://tools.ietf.org/html/rfc2119 + + +Change log +========== + +@@TODO diff --git a/docs/stores/filesystem/v1.0.rst b/docs/stores/filesystem/v1.0.rst index 79e303c4..df070709 100644 --- a/docs/stores/filesystem/v1.0.rst +++ b/docs/stores/filesystem/v1.0.rst @@ -149,8 +149,8 @@ directory path is "C:\\data", then the file system path Store API implementation ======================== -The section below defines an implementation of the Zarr abstract store -interface (@@TODO link) in terms of the native operations of this +The section below defines an implementation of the Zarr +:ref:`abstract-store-interface` in terms of the native operations of this storage system. Below ``fspath_to_key()`` is a function that translates file system paths to store keys, and ``key_to_fspath()`` is a function that translates store keys to file system paths, as defined