Merge pull request #253 from scalableminds/sharding-index-codecs

jbms · web-flow · commit 36b483232a07 · 2023-07-26T15:53:02.000-07:00
Adds index_codecs to the sharding codec
diff --git a/docs/v3/codecs/crc32c/v1.0.rst b/docs/v3/codecs/crc32c/v1.0.rst
@@ -0,0 +1,95 @@
+.. _crc32c-codec-v1:
+
+============================
+ CRC32C checksum codec (version 1.0)
+============================
+
+  **Editor's draft 17 July 2023**
+
+Specification URI:
+    https://zarr-specs.readthedocs.io/en/latest/v3/codecs/crc32c/v1.0.html
+Corresponding ZEP:
+    `ZEP 2 — Sharding codec <https://zarr.dev/zeps/draft/ZEP0002.html>`_
+Issue tracking:
+    `GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/codec>`_
+Suggest an edit for this spec:
+    `GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/main/docs/v3/codecs/crc32c/v1.0.rst>`_
+
+Copyright 2022-Present `Zarr core development team
+<https://github.com/orgs/zarr-developers/teams/core-devs>`_. This work
+is licensed under a `Creative Commons Attribution 3.0 Unported License
+<https://creativecommons.org/licenses/by/3.0/>`_.
+
+----
+
+
+Abstract
+========
+
+Defines an ``bytes -> bytes`` codec that appends a CRC32C checksum of the input bytestream.
+
+
+Status of this document
+=======================
+
+.. warning::
+    This document is a draft for review and subject to changes.
+    It will become final when the `Zarr Enhancement Proposal (ZEP) 2 <https://zarr.dev/zeps/draft/ZEP0002.html>`_
+    is approved via the `ZEP process <https://zarr.dev/zeps/active/ZEP0000.html>`_.
+
+
+Document conventions
+====================
+
+Conformance requirements are expressed with a combination of
+descriptive assertions and [RFC2119]_ terminology. The key words
+"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
+"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative
+parts of this document are to be interpreted as described in
+[RFC2119]_. However, for readability, these words do not appear in all
+uppercase letters in this specification.
+
+All of the text of this specification is normative except sections
+explicitly marked as non-normative, examples, and notes. Examples in
+this specification are introduced with the words "for example".
+
+
+Codec name
+==========
+
+The value of the ``name`` member in the codec object MUST be ``crc32c``.
+
+
+Configuration parameters
+========================
+
+None.
+
+
+Format and algorithm
+====================
+
+This is a ``bytes -> bytes`` codec.
+
+The codec computes the CRC32C checksum as defined in [RFC3720]_ of the input
+bytestream. The output bytestream is composed of the unchanged input byte 
+stream with the appended checksum. The checksum is represented as a 32-bit
+unsigned integer represented in little endian. 
+
+
+References
+==========
+
+.. [RFC2119] S. Bradner. Key words for use in RFCs to Indicate
+   Requirement Levels. March 1997. Best Current Practice. URL:
+   https://tools.ietf.org/html/rfc2119
+
+.. [RFC3720] J. Satran et al. Internet Small Computer Systems 
+   Interface (iSCSI). April 2004. Proposed Standard. URL:
+   https://tools.ietf.org/html/rfc3720
+
+
+Change log
+==========
+
+No changes yet.
diff --git a/docs/v3/codecs/sharding-indexed/v1.0.rst b/docs/v3/codecs/sharding-indexed/v1.0.rst
@@ -3,16 +3,15 @@
 ==========================================
 Sharding codec (version 1.0)
 ==========================================
------------------------------
- Editor's draft 23 03 2023
------------------------------
+
+  **Editor's draft 17 July 2023**
 
 Specification URI:
     https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/v1.0.html
-
+Corresponding ZEP:
+    `ZEP 2 — Sharding codec <https://zarr.dev/zeps/draft/ZEP0002.html>`_
 Issue tracking:
     `GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/sharding-indexed-codec-v1.0>`-
-
 Suggest an edit for this spec:
     `GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/main/docs/codecs/sharding-indexed/v1.0.rst>`_
 
@@ -91,12 +90,27 @@ Sharding can be configured per array in the :ref:`array-metadata` as follows::
           "configuration": {
             "chunk_shape": [32, 32],
             "codecs": [
+              { 
+                "name": "endian",
+                "configuration": {
+                  "endian": "little",
+                }
+              },
               {
                 "name": "gzip",
                 "configuration": {
                   "level": 1
                 }
               }
+            ],
+            "index_codecs": [
+              { 
+                "name": "endian",
+                "configuration": {
+                  "endian": "little",
+                }
+              },
+              { "name": "crc32c"} 
             ]
           }
         }
@@ -105,85 +119,104 @@ Sharding can be configured per array in the :ref:`array-metadata` as follows::
 
 ``chunk_shape``
 
-    An array of integers specifying the size of the inner chunks in a shard
-    along each dimension of the outer array.  The length of the ``chunk_shape``
-    array must match the number of dimensions of the outer chunk to which this
-    sharding codec is applied, and the chunk size along each dimension must
-    evenly divide the size of the outer chunk.  For example, an inner chunk
-    shape of ``[32, 2]`` with an outer chunk shape ``[64, 64]`` indicates that
-    64 chunks are combined in one shard, 2 along the first dimension, and for
+    An array of integers specifying the shape of the inner chunks in a shard
+    along each dimension of the outer array. The length of the ``chunk_shape``
+    array must match the number of dimensions of the shard shape to which this
+    sharding codec is applied, and the inner chunk shape along each dimension must
+    evenly divide the size of the shard shape. For example, an inner chunk
+    shape of ``[32, 2]`` with an shard shape ``[64, 64]`` indicates that
+    64 inner chunks are combined in one shard, 2 along the first dimension, and for
     each of those 32 along the second dimension.
 
 ``codecs``
 
     Specifies a list of codecs to be used for encoding and decoding inner chunks. 
     The value must be an array of objects, as specified in the 
-    :ref:`array-metadata`. An absent ``codecs`` member is equivalent to 
-    specifying an empty list of codecs.
+    :ref:`array-metadata`. The ``codecs`` member is required and needs to contain
+    exactly one ``array -> bytes`` codec.
+
+``index_codecs``
+
+    Specifies a list of codecs to be used for encoding and decoding shard index. 
+    The value must be an array of objects, as specified in the 
+    :ref:`array-metadata`. The ``index_codecs`` member is required and needs to 
+    contain exactly one ``array -> bytes`` codec. Codecs that produce 
+    variable-sized encoded representation, such as compression codecs, MUST NOT
+    be used for index codecs. It is RECOMMENDED to use a little-endian codec 
+    followed by a crc32c checksum as index codecs.
+
+Definitions
+===========
+
+* **Shard** is a chunk of the outer array that corresponds to one storage object. 
+  As described in this document, shards MAY have multiple inner chunks.
+* **Inner chunk** is a chunk within the shard.
+* **Shard shape** is the chunk shape of the outer array.
+* **Inner chunk shape** is defined by the ``chunk_shape`` configuration of the codec.
+  The inner chunk shape needs to have the same dimensions as the shard shape and the 
+  inner chunk shape along each dimension must evenly divide the size of the shard shape.
+* **Chunks per shard** is the element-wise division of the shard shape by the 
+  inner chunk shape.
 
 
 Binary shard format
 ===================
 
 This is an ``array -> bytes`` codec.
 
-In the ``sharding_indexed`` binary format, chunks are written successively in a 
+In the ``sharding_indexed`` binary format, inner chunks are written successively in a 
 shard, where unused space between them is allowed, followed by an index 
 referencing them.
 
-The index is placed at the end of the file and has a size of ``16 * n + 4``
-bytes, where ``n`` is the number of chunks in the shard, i.e. the product of the
-sizes specified in ``chunk_shape``.  For example, ``16 * 4 + 4 = 68 bytes`` for a
-shard shape of ``[64, 64]`` and inner chunk shape of ``[32, 32]``.
-
-The index format is:
-
-- ``offset[0] : uint64le``
-- ``nbytes[0] : uint64le``
-- ``offset[1] : uint64le``
-- ``nbytes[1] : uint64le``
-- ...
-- ``offset[n-1] : uint64le``
-- ``nbytes[n-1] : uint64le``
-- ``checksum : uint32le``
-
-The final 4 bytes of the index is the CRC-32C checksum of the first ``16 * n``
-bytes of the index (everything except the final checksum).
-
-The chunks are listed in the index in row-major (C) order.
+The index is an array with 64-bit unsigned integers with a shape that matches the
+chunks per shard tuple with an appended dimension of size 2.
+For example, given a shard shape of ``[128, 128]`` and chunk shape of ``[32, 32]``,
+there are ``[4, 4]`` inner chunks in a shard. The corresponding shard index has a 
+shape of ``[4, 4, 2]``.
 
+The index contains the ``offset`` and ``nbytes`` values for each inner chunk.
 The ``offset[i]`` specifies the byte offset within the shard at which the
 encoded representation of chunk ``i`` begins, and ``nbytes[i]`` specifies the
 encoded length in bytes.
 
-Given the example of 2x2 inner chunks in a shard, the index would look like::
+Empty inner chunks are denoted by setting both offset and nbytes to ``2^64 - 1``. 
+Empty inner chunks are interpreted as being filled with the fill value. The index 
+always has the full shape of all possible inner chunks per shard, even if they extend
+beyond the array shape.
+
+The index is placed at the end of the file and encoded into binary representations 
+using the specified index codecs. The byte size of the index is determined by the 
+number of inner chunks in the shard ``n``, i.e. the product of chunks per shard, and 
+the choice of index codecs.
+
+For an example, consider a shard shape of ``[64, 64]``, an inner chunk shape of 
+``[32, 32]`` and an index codec combination of a little-endian codec followed by 
+a crc32c checksum codec. The size of the corresponding index is 
+``16 (2x uint64) * 4 (chunks per shard) + 4 (crc32c checksum) = 68 bytes``.
+The index would look like::
 
     | chunk (0, 0)    | chunk (0, 1)    | chunk (1, 0)    | chunk (1, 1)    |          |
     | offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes | checksum |
     | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint32   |
 
-Empty chunks are denoted by setting both offset and nbytes to ``2^64 - 1``. 
-Empty chunks are interpreted as being filled with the fill value. The index 
-always has the full shape of all possible chunks per shard, even if they extend
-beyond the array shape.
 
 The actual order of the chunk content is not fixed and may be chosen by the
 implementation. All possible write orders are valid according to this
 specification and therefore can be read by any other implementation. When
-writing partial chunks into an existing shard, no specific order of the existing
-chunks may be expected. Some writing strategies might be
+writing partial inner chunks into an existing shard, no specific order of the existing
+inner chunks may be expected. Some writing strategies might be
 
 * **Fixed order**: Specify a fixed order (e.g. row-, column-major, or Morton
-  order). When replacing existing chunks larger or equal-sized chunks may be
+  order). When replacing existing inner chunks larger or equal-sized inner chunks may be
   replaced in-place, leaving unused space up to an upper limit that might
   possibly be specified. Please note that, for regular-sized uncompressed data,
-  all chunks have the same size and can therefore be replaced in-place.
+  all inner chunks have the same size and can therefore be replaced in-place.
 * **Append-only**: Any chunk to write is appended to the existing shard,
-  followed by an updated index. If previous chunks are updated, their storage
+  followed by an updated index. If previous inner chunks are updated, their storage
   space becomes unused, as well as the previous index. This might be useful for
   storage that only allows append-only updates.
 * **Other formats**: Other formats that accept additional bytes at the end of
-  the file (such as HDF) could be used for storing shards, by writing the chunks
+  the file (such as HDF) could be used for storing shards, by writing the inner chunks
   in the order the format prescribes and appending a binary index derived from
   the byte offsets and lengths at the end of the file.
 
@@ -198,7 +231,7 @@ Implementation notes
 The section suggests a non-normative implementation of the codec including
 common optimizations.
 
-* **Decoding**: A simple implementation to decode chunks in a shard would (a) 
+* **Decoding**: A simple implementation to decode inner chunks in a shard would (a) 
   read the entire value from the store into a byte buffer, (b) parse the shard
   index as specified above from the end of the buffer and (c) cut out the 
   relevant bytes that belong to the requested chunk. The relevant bytes are 
@@ -207,27 +240,32 @@ common optimizations.
   configuration applying the :ref:`decoding_procedure`. This is similar to how
   an implementation would access a sub-slice of a chunk.
 
-  When reading all chunks of a shard at once, a useful optimization would be to 
+  The size of the index can be determined by applying ``c.compute_encoded_size``
+  for each index codec recursively. The initial size is the byte size of the index 
+  array, i.e. ``16 * chunks per shard``.
+
+  When reading all inner chunks of a shard at once, a useful optimization would be to 
   read the entire shard once into a byte buffer and then cut out and decode all 
-  chunks from that buffer in one pass.
+  inner chunks from that buffer in one pass.
 
   If the underlying store supports partial reads, the decoding of single inner
   chunks can be optimized. In that case, the shard index can be read from the
-  store by requesting the ``n`` last bytes, where ``n`` is 16 bytes multiplied 
-  by the number of chunks in a shard. After parsing the shard index, single
-  chunks can be requested from the store by specifying the byte range. The 
-  bytestream, then, needs to be decoded as above. 
+  store by requesting the ``n`` last bytes, where ``n`` is the size of the index
+  as determined by the number of inner chunks in the shard and choice of index 
+  codecs. After parsing the shard index, single inner chunks can be requested from 
+  the store by specifying the byte range. The bytestream, then, needs to be 
+  decoded as above. 
 
 * **Encoding**: A simple implementation to encode a chunk in a shard would (a)
   encode the new chunk per :ref:`encoding_procedure` in a byte buffer using the 
   shard's inner codecs, (b) read an existing shard from the store, (c) create a 
-  new bytestream with all encoded chunks of that shard including the overwritten 
+  new bytestream with all encoded inner chunks of that shard including the overwritten 
   chunk, (d) generate a new shard index that is appended to the chunk bytestream 
   and (e) writes the shard to the store. If there was no existing shard, an 
-  empty shard is assumed. When writing entire chunks, reading the existing shard 
+  empty shard is assumed. When writing entire inner chunks, reading the existing shard 
   first may be skipped.
 
-  When working with chunks that have a fixed byte size (e.g., uncompressed) and 
+  When working with inner chunks that have a fixed byte size (e.g., uncompressed) and 
   a store that supports partial writes, a optimization would be to replace the
   new chunk by writing to the store at the specified byte range.
 
diff --git a/docs/v3/core/v3.0.rst b/docs/v3/core/v3.0.rst
@@ -1186,6 +1186,20 @@ of an additional operation:
   encoded representation is a byte string, then ``decoded_regions``
   specifies a list of byte ranges.
 
+- ``c.compute_encoded_size(input_size)``, a procedure that determines the 
+  size of the encoded representation given a size of the decoded representation.
+  This procedure cannot be implemented for codecs that produce variable-sized
+  encoded representations, such as compression algorithms. Depending on the 
+  type of the codec, the signature could differ:
+
+  - ``c.compute_encoded_size(array_size, data_type) -> (array_size, data_type)`` 
+    for ``array -> array`` codecs, where ``array_size`` is the number of items 
+    in the array, i.e., the product of the components of the array's shape;
+  - ``c.compute_encoded_size(array_size, data_type) -> byte_size`` 
+    for ``array -> bytes`` codecs;
+  - ``c.compute_encoded_size(byte_size) -> byte_size`` 
+    for ``bytes -> bytes`` codecs.
+
 .. note::
 
    If ``partial_decode`` is not supported by a particular codec, it can