Skip to content

Commit 4d96684

Browse files
mkittiGemini
andcommitted
Replace "external" with "configuration" as an index_location
Co-authored-by: Gemini <[email protected]>
1 parent a70c350 commit 4d96684

File tree

1 file changed

+79
-37
lines changed

1 file changed

+79
-37
lines changed

docs/v3/codecs/sharding-indexed/index.rst

Lines changed: 79 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -147,27 +147,48 @@ Sharding can be configured per array in the :ref:`array-metadata` as follows::
147147

148148
``index_codecs``
149149

150-
Specifies a list of codecs to be used for encoding and decoding a shard index.
150+
Specifies a list of codecs to be used for processing a shard index.
151151
The shard index is an array with a ``shape`` of ``[N,2]`` and a ``data_type`` of
152152
``uint64`` where ``N`` is the number of chunks to be indexed in the shard.
153153
The ``index_codecs`` value must be an array of objects, as specified in the
154-
:ref:`array-metadata`. The ``index_codecs`` member is required and needs to
155-
contain exactly one ``array -> bytes`` codec. That codec MAY be preceded by
156-
``array -> array`` codecs that modify either the ``shape`` or ``data_type``
157-
of the array. Codecs that produce variable-sized encoded representation,
158-
such as compression codecs, MUST NOT be used for index codecs. It is
159-
RECOMMENDED to use a little-endian codec followed by a crc32c checksum as
160-
index codecs.
154+
:ref:`array-metadata`.
155+
156+
Unless ``index_location`` is ``configuration`` and `predefined_index` is a
157+
JSON array, the ``index_codecs`` member is required and needs to contain
158+
exactly one ``array -> bytes`` codec. This ``array -> bytes`` codec MAY be
159+
preceded by ``array -> array`` codecs that modify either the ``shape`` or
160+
``data_type`` of the array.
161+
162+
If ``index_location`` is ``configuration`` and `predefined_index` is a JSON array,
163+
the ``index_codecs`` member MAY be empty or contain only ``array -> array``
164+
codecs. An ``array -> bytes`` codec MUST NOT be present. The ``array -> array``
165+
codecs define the transformation from the conceptual `[N, 2]` index to the
166+
structure represented in the JSON array.
167+
168+
Codecs that produce variable-sized encoded representation, such as
169+
compression codecs, MUST NOT be used for index codecs. It is RECOMMENDED
170+
to use a little-endian codec followed by a crc32c checksum as index codecs
171+
when an ``array -> bytes`` codec is used.
161172

162173
``index_location``
163174

164175
Specifies whether the shard index is located at the beginning of the file,
165-
the end of the file, or external to the file in its own key. The parameter
166-
value must be either the string ``start``, ``end``, or ``external``. A value
167-
of external indicates the shard index has been written to an external file
168-
referred to by a key constructed by appending ".shard_index" to the key of
169-
the sharded chunk. If the parameter is not present, the value defaults to
176+
the end of the file, or is defined directly in the codec's configuration. The
177+
parameter value must be either the string ``start``, ``end``, or
178+
``configuration``. If the parameter is not present, the value defaults to
170179
``end``.
180+
181+
``predefined_index``
182+
183+
REQUIRED if ``index_location`` is ``configuration``. This parameter
184+
contains the shard index data itself. The value can either be:
185+
186+
* A JSON array: A JSON array representing the `(offset, nbytes)` pairs
187+
for each inner chunk. This array must have a shape of `[N, 2]` where `N`
188+
is the number of inner chunks in the shard. Each element must be an integer.
189+
* A BASE64 encoded string: A string containing the BASE64 encoding of
190+
the binary representation of the shard index. The binary representation
191+
is as described in the "Binary shard format" section.
171192

172193
Definitions
173194
===========
@@ -190,7 +211,7 @@ This is an ``array -> bytes`` codec.
190211

191212
In the ``sharding_indexed`` binary format, inner chunks are written successively in a
192213
shard, where unused space between them is allowed. An index referencing them may
193-
precede, follow, or exist external to the shard.
214+
precede, follow, or be defined directly in the codec configuration.
194215

195216
The index is an array with 64-bit unsigned integers with a shape that matches the
196217
chunks per shard tuple with an appended dimension of size 2.
@@ -208,11 +229,24 @@ Empty inner chunks are interpreted as being filled with the fill value. The inde
208229
always has the full shape of all possible inner chunks per shard, even if they extend
209230
beyond the array shape.
210231

211-
The index is either placed at the end of the file or, at the beginning of the file,
212-
or under its own key, as configured by the ``index_location`` parameter. The index
213-
is encoded into binary representations using the specified index codecs. The byte
214-
size of the index is determined by the number of inner chunks in the shard ``n``,
215-
i.e. the product of chunks per shard, and the choice of index codecs.
232+
The index is either placed at the end of the file, at the beginning of the file,
233+
or defined directly within the codec configuration, as configured by the
234+
``index_location`` parameter.
235+
236+
When ``index_location`` is ``start`` or ``end``, the index is encoded into a
237+
binary representation using the specified index codecs, which must include one
238+
``array -> bytes`` codec.
239+
240+
When ``index_location`` is ``configuration``, the index is provided via the
241+
``predefined_index`` parameter.
242+
If ``predefined_index`` is a BASE64 encoded string, its content is the binary
243+
representation produced by applying the full ``index_codecs`` chain (including
244+
an ``array -> bytes`` codec) to the index array.
245+
If ``predefined_index`` is a JSON array, it represents the index *after* any
246+
``array -> array`` codecs in the ``index_codecs`` chain have been applied. In
247+
this case, the ``index_codecs`` chain MUST NOT contain an ``array -> bytes``
248+
codec. The byte size of the index is determined by the number of inner chunks
249+
in the shard ``n``, i.e. the product of chunks per shard.
216250

217251
For an example, consider a shard shape of ``[64, 64]``, an inner chunk shape of
218252
``[32, 32]`` and an index codec combination of a little-endian codec followed by
@@ -259,8 +293,9 @@ common optimizations.
259293
* **Decoding**: A simple implementation to decode inner chunks in a shard would (a)
260294
read the entire value from the store into a byte buffer, (b) parse the shard
261295
index as specified above from the beginning or end (according to the
262-
``index_location``) of the buffer or from an external index and (c) cut out
263-
the relevant bytes that belong to the requested chunk. The relevant bytes are
296+
``index_location``) of the buffer, or retrieve it directly from the
297+
``predefined_index`` parameter when ``index_location`` is ``configuration``,
298+
and (c) cut out the relevant bytes that belong to the requested chunk. The relevant bytes are
264299
determined by the ``offset,nbytes`` pair in the shard index. This bytestream
265300
then needs to be decoded with the inner codecs as specified in the sharding
266301
configuration applying the :ref:`decoding_procedure`. This is similar to how
@@ -274,24 +309,33 @@ common optimizations.
274309
read the entire shard once into a byte buffer and then cut out and decode all
275310
inner chunks from that buffer in one pass.
276311

277-
If the underlying store supports partial reads, the decoding of single inner
278-
chunks can be optimized. In that case, the shard index can be read from the
279-
store by requesting the ``n`` first or last bytes (according to the
280-
``index_location``), where ``n`` is the size of the index as determined by
281-
the number of inner chunks in the shard and choice of index codecs. After
282-
parsing the shard index, single inner chunks can be requested from the store
283-
by specifying the byte range. The bytestream, then, needs to be decoded as above.
312+
If the underlying store supports partial reads and ``index_location`` is
313+
``start`` or ``end``, the decoding of single inner chunks can be optimized.
314+
In that case, the shard index can be read from the store by requesting the
315+
``n`` first or last bytes (according to the ``index_location``), where ``n``
316+
is the size of the index as determined by the number of inner chunks in the
317+
shard and choice of index codecs. After parsing the shard index, single
318+
inner chunks can be requested from the store by specifying the byte range.
319+
If ``index_location`` is ``configuration``, the index is directly available
320+
from the codec configuration and no partial read is needed for the index itself.
321+
The bytestream, then, needs to be decoded as above.
284322

285323
* **Encoding**: A simple implementation to encode a chunk in a shard would (a)
286324
encode the new chunk per :ref:`encoding_procedure` in a byte buffer using the
287325
shard's inner codecs, (b) read an existing shard from the store, (c) create a
288326
new bytestream with all encoded inner chunks of that shard including the overwritten
289-
chunk, (d) generate a new shard index that is prepended, appended, or
290-
externally written (according to the ``index_location``) to the chunk
327+
chunk, (d) generate a new shard index that is prepended or appended
328+
(according to the ``index_location``) to the chunk
291329
bytestream and (e) writes the shard to the store. If there was no existing
292330
shard, an empty shard is assumed. When writing entire inner chunks, reading
293331
the existing shard first may be skipped.
294332

333+
Due to the difficulty of updating an index stored in the array metadata,
334+
implementations MAY consider any array using ``"index_location": "configuration"``
335+
(at any level of nesting) to be read-only. Writing to such an array may
336+
produce an error or lead to a corrupted state if the written data would
337+
require a change to the predefined index.
338+
295339
When working with inner chunks that have a fixed byte size (e.g., uncompressed) and
296340
a store that supports partial writes, a optimization would be to replace the
297341
new chunk by writing to the store at the specified byte range.
@@ -305,16 +349,14 @@ common optimizations.
305349
Other use case-specific optimizations may be available, e.g., for append-only
306350
workloads.
307351

308-
* **Nesting**: The ``sharding_indexed`` codec can be used as part of a codec
352+
* **Nesting**: The ``sharding_indexed`` codec MAY be used as part of a codec
309353
chain of another ``sharding_indexed`` codec. This means that an inner chunk
310354
MAY itself be a shard nested within an outer chunk, creating a hierarchical
311355
index and multiple levels of partitioning. While the number of nested levels
312356
of shards is not restricted, some implementations MAY support a limited
313-
number of nested shards or MAY NOT support nesting. Primary shards that
314-
are not contained within other shards MAY have an ``index_location`` value of
315-
``start``, ``end``, or ``external``. Nested shards MAY have an
316-
``index_location`` value of ``start`` or ``end``. Nested shards MUST NOT have
317-
an ``index_location`` value of ``external``.
357+
number of nested shards or MAY NOT support nesting. Both primary and nested
358+
shards MAY have an ``index_location`` value of ``start``, ``end``, or
359+
``configuration``.
318360

319361
References
320362
==========
@@ -326,7 +368,7 @@ References
326368
Change log
327369
==========
328370

329-
* Add ``external`` as a parameter value for ``index_location`` to Version 1.1 and clarified nesting. `PR ABC <https://github.com/zarr-developers/zarr-specs/pull/ABC>`_
371+
* Add ``configuration`` as a parameter value for ``index_location`` to Version 1.1 and clarified nesting. `PR ABC <https://github.com/zarr-developers/zarr-specs/pull/368>`_
330372

331373
* Adds ``index_location`` parameter. `PR 280 <https://github.com/zarr-developers/zarr-specs/pull/280>`_
332374

0 commit comments

Comments
 (0)