You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/v3/codecs/sharding-indexed/index.rst
+79-37Lines changed: 79 additions & 37 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -147,27 +147,48 @@ Sharding can be configured per array in the :ref:`array-metadata` as follows::
147
147
148
148
``index_codecs``
149
149
150
-
Specifies a list of codecs to be used for encoding and decoding a shard index.
150
+
Specifies a list of codecs to be used for processing a shard index.
151
151
The shard index is an array with a ``shape`` of ``[N,2]`` and a ``data_type`` of
152
152
``uint64`` where ``N`` is the number of chunks to be indexed in the shard.
153
153
The ``index_codecs`` value must be an array of objects, as specified in the
154
-
:ref:`array-metadata`. The ``index_codecs`` member is required and needs to
155
-
contain exactly one ``array -> bytes`` codec. That codec MAY be preceded by
156
-
``array -> array`` codecs that modify either the ``shape`` or ``data_type``
157
-
of the array. Codecs that produce variable-sized encoded representation,
158
-
such as compression codecs, MUST NOT be used for index codecs. It is
159
-
RECOMMENDED to use a little-endian codec followed by a crc32c checksum as
160
-
index codecs.
154
+
:ref:`array-metadata`.
155
+
156
+
Unless ``index_location`` is ``configuration`` and `predefined_index` is a
157
+
JSON array, the ``index_codecs`` member is required and needs to contain
158
+
exactly one ``array -> bytes`` codec. This ``array -> bytes`` codec MAY be
159
+
preceded by ``array -> array`` codecs that modify either the ``shape`` or
160
+
``data_type`` of the array.
161
+
162
+
If ``index_location`` is ``configuration`` and `predefined_index` is a JSON array,
163
+
the ``index_codecs`` member MAY be empty or contain only ``array -> array``
164
+
codecs. An ``array -> bytes`` codec MUST NOT be present. The ``array -> array``
165
+
codecs define the transformation from the conceptual `[N, 2]` index to the
166
+
structure represented in the JSON array.
167
+
168
+
Codecs that produce variable-sized encoded representation, such as
169
+
compression codecs, MUST NOT be used for index codecs. It is RECOMMENDED
170
+
to use a little-endian codec followed by a crc32c checksum as index codecs
171
+
when an ``array -> bytes`` codec is used.
161
172
162
173
``index_location``
163
174
164
175
Specifies whether the shard index is located at the beginning of the file,
165
-
the end of the file, or external to the file in its own key. The parameter
166
-
value must be either the string ``start``, ``end``, or ``external``. A value
167
-
of external indicates the shard index has been written to an external file
168
-
referred to by a key constructed by appending ".shard_index" to the key of
169
-
the sharded chunk. If the parameter is not present, the value defaults to
176
+
the end of the file, or is defined directly in the codec's configuration. The
177
+
parameter value must be either the string ``start``, ``end``, or
178
+
``configuration``. If the parameter is not present, the value defaults to
170
179
``end``.
180
+
181
+
``predefined_index``
182
+
183
+
REQUIRED if ``index_location`` is ``configuration``. This parameter
184
+
contains the shard index data itself. The value can either be:
185
+
186
+
* A JSON array: A JSON array representing the `(offset, nbytes)` pairs
187
+
for each inner chunk. This array must have a shape of `[N, 2]` where `N`
188
+
is the number of inner chunks in the shard. Each element must be an integer.
189
+
* A BASE64 encoded string: A string containing the BASE64 encoding of
190
+
the binary representation of the shard index. The binary representation
191
+
is as described in the "Binary shard format" section.
171
192
172
193
Definitions
173
194
===========
@@ -190,7 +211,7 @@ This is an ``array -> bytes`` codec.
190
211
191
212
In the ``sharding_indexed`` binary format, inner chunks are written successively in a
192
213
shard, where unused space between them is allowed. An index referencing them may
193
-
precede, follow, or exist external to the shard.
214
+
precede, follow, or be defined directly in the codec configuration.
194
215
195
216
The index is an array with 64-bit unsigned integers with a shape that matches the
196
217
chunks per shard tuple with an appended dimension of size 2.
@@ -208,11 +229,24 @@ Empty inner chunks are interpreted as being filled with the fill value. The inde
208
229
always has the full shape of all possible inner chunks per shard, even if they extend
209
230
beyond the array shape.
210
231
211
-
The index is either placed at the end of the file or, at the beginning of the file,
212
-
or under its own key, as configured by the ``index_location`` parameter. The index
213
-
is encoded into binary representations using the specified index codecs. The byte
214
-
size of the index is determined by the number of inner chunks in the shard ``n``,
215
-
i.e. the product of chunks per shard, and the choice of index codecs.
232
+
The index is either placed at the end of the file, at the beginning of the file,
233
+
or defined directly within the codec configuration, as configured by the
234
+
``index_location`` parameter.
235
+
236
+
When ``index_location`` is ``start`` or ``end``, the index is encoded into a
237
+
binary representation using the specified index codecs, which must include one
238
+
``array -> bytes`` codec.
239
+
240
+
When ``index_location`` is ``configuration``, the index is provided via the
241
+
``predefined_index`` parameter.
242
+
If ``predefined_index`` is a BASE64 encoded string, its content is the binary
243
+
representation produced by applying the full ``index_codecs`` chain (including
244
+
an ``array -> bytes`` codec) to the index array.
245
+
If ``predefined_index`` is a JSON array, it represents the index *after* any
246
+
``array -> array`` codecs in the ``index_codecs`` chain have been applied. In
247
+
this case, the ``index_codecs`` chain MUST NOT contain an ``array -> bytes``
248
+
codec. The byte size of the index is determined by the number of inner chunks
249
+
in the shard ``n``, i.e. the product of chunks per shard.
216
250
217
251
For an example, consider a shard shape of ``[64, 64]``, an inner chunk shape of
218
252
``[32, 32]`` and an index codec combination of a little-endian codec followed by
@@ -259,8 +293,9 @@ common optimizations.
259
293
* **Decoding**: A simple implementation to decode inner chunks in a shard would (a)
260
294
read the entire value from the store into a byte buffer, (b) parse the shard
261
295
index as specified above from the beginning or end (according to the
262
-
``index_location``) of the buffer or from an external index and (c) cut out
263
-
the relevant bytes that belong to the requested chunk. The relevant bytes are
296
+
``index_location``) of the buffer, or retrieve it directly from the
297
+
``predefined_index`` parameter when ``index_location`` is ``configuration``,
298
+
and (c) cut out the relevant bytes that belong to the requested chunk. The relevant bytes are
264
299
determined by the ``offset,nbytes`` pair in the shard index. This bytestream
265
300
then needs to be decoded with the inner codecs as specified in the sharding
266
301
configuration applying the :ref:`decoding_procedure`. This is similar to how
@@ -274,24 +309,33 @@ common optimizations.
274
309
read the entire shard once into a byte buffer and then cut out and decode all
275
310
inner chunks from that buffer in one pass.
276
311
277
-
If the underlying store supports partial reads, the decoding of single inner
278
-
chunks can be optimized. In that case, the shard index can be read from the
279
-
store by requesting the ``n`` first or last bytes (according to the
280
-
``index_location``), where ``n`` is the size of the index as determined by
281
-
the number of inner chunks in the shard and choice of index codecs. After
282
-
parsing the shard index, single inner chunks can be requested from the store
283
-
by specifying the byte range. The bytestream, then, needs to be decoded as above.
312
+
If the underlying store supports partial reads and ``index_location`` is
313
+
``start`` or ``end``, the decoding of single inner chunks can be optimized.
314
+
In that case, the shard index can be read from the store by requesting the
315
+
``n`` first or last bytes (according to the ``index_location``), where ``n``
316
+
is the size of the index as determined by the number of inner chunks in the
317
+
shard and choice of index codecs. After parsing the shard index, single
318
+
inner chunks can be requested from the store by specifying the byte range.
319
+
If ``index_location`` is ``configuration``, the index is directly available
320
+
from the codec configuration and no partial read is needed for the index itself.
321
+
The bytestream, then, needs to be decoded as above.
284
322
285
323
* **Encoding**: A simple implementation to encode a chunk in a shard would (a)
286
324
encode the new chunk per :ref:`encoding_procedure` in a byte buffer using the
287
325
shard's inner codecs, (b) read an existing shard from the store, (c) create a
288
326
new bytestream with all encoded inner chunks of that shard including the overwritten
289
-
chunk, (d) generate a new shard index that is prepended, appended, or
290
-
externally written (according to the ``index_location``) to the chunk
327
+
chunk, (d) generate a new shard index that is prepended or appended
328
+
(according to the ``index_location``) to the chunk
291
329
bytestream and (e) writes the shard to the store. If there was no existing
292
330
shard, an empty shard is assumed. When writing entire inner chunks, reading
293
331
the existing shard first may be skipped.
294
332
333
+
Due to the difficulty of updating an index stored in the array metadata,
334
+
implementations MAY consider any array using ``"index_location": "configuration"``
335
+
(at any level of nesting) to be read-only. Writing to such an array may
336
+
produce an error or lead to a corrupted state if the written data would
337
+
require a change to the predefined index.
338
+
295
339
When working with inner chunks that have a fixed byte size (e.g., uncompressed) and
296
340
a store that supports partial writes, a optimization would be to replace the
297
341
new chunk by writing to the store at the specified byte range.
@@ -305,16 +349,14 @@ common optimizations.
305
349
Other use case-specific optimizations may be available, e.g., for append-only
306
350
workloads.
307
351
308
-
* **Nesting**: The ``sharding_indexed`` codec can be used as part of a codec
352
+
* **Nesting**: The ``sharding_indexed`` codec MAY be used as part of a codec
309
353
chain of another ``sharding_indexed`` codec. This means that an inner chunk
310
354
MAY itself be a shard nested within an outer chunk, creating a hierarchical
311
355
index and multiple levels of partitioning. While the number of nested levels
312
356
of shards is not restricted, some implementations MAY support a limited
313
-
number of nested shards or MAY NOT support nesting. Primary shards that
314
-
are not contained within other shards MAY have an ``index_location`` value of
315
-
``start``, ``end``, or ``external``. Nested shards MAY have an
316
-
``index_location`` value of ``start`` or ``end``. Nested shards MUST NOT have
317
-
an ``index_location`` value of ``external``.
357
+
number of nested shards or MAY NOT support nesting. Both primary and nested
358
+
shards MAY have an ``index_location`` value of ``start``, ``end``, or
359
+
``configuration``.
318
360
319
361
References
320
362
==========
@@ -326,7 +368,7 @@ References
326
368
Change log
327
369
==========
328
370
329
-
* Add ``external`` as a parameter value for ``index_location`` to Version 1.1 and clarified nesting. `PR ABC <https://github.com/zarr-developers/zarr-specs/pull/ABC>`_
371
+
* Add ``configuration`` as a parameter value for ``index_location`` to Version 1.1 and clarified nesting. `PR ABC <https://github.com/zarr-developers/zarr-specs/pull/368>`_
0 commit comments