Skip to content

Commit 46ed6cd

Browse files
authored
Merge pull request #237 from jbms/sharding-checksum
Add CRC-32C checksum to sharded index format
2 parents a084325 + 7479e13 commit 46ed6cd

File tree

1 file changed

+40
-19
lines changed

1 file changed

+40
-19
lines changed

docs/v3/codecs/sharding-indexed/v1.0.rst

Lines changed: 40 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -105,15 +105,14 @@ Sharding can be configured per array in the :ref:`array-metadata` as follows::
105105

106106
``chunk_shape``
107107

108-
An array of integers providing the shape of inner chunks in a shard for each
109-
dimension of the Zarr array. The length of the array must match the length
110-
of the array metadata ``shape`` entry. The each integer must by divisible by
111-
the ``chunk_shape`` of the array as defined in the ``chunk_grid``
112-
:ref:`array-metadata`.
113-
For example, an inner chunk shape of ``[32, 2]`` with an outer chunk shape
114-
``[64, 64]`` indicates that 64 chunks are combined in one shard, 2 along the
115-
first dimension, and for each of those 32 along the second dimension.
116-
Currently, only the ``regular`` chunk grid is supported.
108+
An array of integers specifying the size of the inner chunks in a shard
109+
along each dimension of the outer array. The length of the ``chunk_shape``
110+
array must match the number of dimensions of the outer chunk to which this
111+
sharding codec is applied, and the chunk size along each dimension must
112+
evenly divide the size of the outer chunk. For example, an inner chunk
113+
shape of ``[32, 2]`` with an outer chunk shape ``[64, 64]`` indicates that
114+
64 chunks are combined in one shard, 2 along the first dimension, and for
115+
each of those 32 along the second dimension.
117116

118117
``codecs``
119118

@@ -130,16 +129,38 @@ This is an ``array -> bytes`` codec.
130129

131130
In the ``sharding_indexed`` binary format, chunks are written successively in a
132131
shard, where unused space between them is allowed, followed by an index
133-
referencing them. The index is placed at the end of the file and has a size of
134-
16 bytes multiplied by the number of chunks in a shard, for example
135-
``16 bytes * 4 = 1024 bytes`` for shard shape of ``[64, 64]`` and inner chunk
136-
shape of ``[32, 32]``. The index holds an `offset, nbytes` pair of little-endian
137-
uint64 per chunk, the chunks-order in the index is row-major (C) order. Given
138-
the example of 2x2 inner chunks in a shard, the index would look like::
139-
140-
| chunk (0, 0) | chunk (0, 1) | chunk (1, 0) | chunk (1, 1) |
141-
| offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes |
142-
| uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 |
132+
referencing them.
133+
134+
The index is placed at the end of the file and has a size of ``16 * n + 4``
135+
bytes, where ``n`` is the number of chunks in the shard, i.e. the product of the
136+
sizes specified in ``chunk_shape``. For example, ``16 * 4 + 4 = 68 bytes`` for a
137+
shard shape of ``[64, 64]`` and inner chunk shape of ``[32, 32]``.
138+
139+
The index format is:
140+
141+
- ``offset[0] : uint64le``
142+
- ``nbytes[0] : uint64le``
143+
- ``offset[1] : uint64le``
144+
- ``nbytes[1] : uint64le``
145+
- ...
146+
- ``offset[n-1] : uint64le``
147+
- ``nbytes[n-1] : uint64le``
148+
- ``checksum : uint32le``
149+
150+
The final 4 bytes of the index is the CRC-32C checksum of the first ``16 * n``
151+
bytes of the index (everything except the final checksum).
152+
153+
The chunks are listed in the index in row-major (C) order.
154+
155+
The ``offset[i]`` specifies the byte offset within the shard at which the
156+
encoded representation of chunk ``i`` begins, and ``nbytes[i]`` specifies the
157+
encoded length in bytes.
158+
159+
Given the example of 2x2 inner chunks in a shard, the index would look like::
160+
161+
| chunk (0, 0) | chunk (0, 1) | chunk (1, 0) | chunk (1, 1) | |
162+
| offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes | checksum |
163+
| uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint32 |
143164

144165
Empty chunks are denoted by setting both offset and nbytes to ``2^64 - 1``.
145166
Empty chunks are interpreted as being filled with the fill value. The index

0 commit comments

Comments
 (0)