Skip to content

Commit 7957bb9

Browse files
authored
Merge pull request #280 from scalableminds/sharding-index-location
Adds index_location to sharding codec
2 parents f818fde + 1d78fb4 commit 7957bb9

File tree

1 file changed

+38
-22
lines changed

1 file changed

+38
-22
lines changed

docs/v3/codecs/sharding-indexed/v1.0.rst

Lines changed: 38 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,8 @@ Sharding can be configured per array in the :ref:`array-metadata` as follows::
117117
}
118118
},
119119
{ "name": "crc32c" }
120-
]
120+
],
121+
"index_location": "end"
121122
}
122123
}
123124
]
@@ -151,6 +152,12 @@ Sharding can be configured per array in the :ref:`array-metadata` as follows::
151152
be used for index codecs. It is RECOMMENDED to use a little-endian codec
152153
followed by a crc32c checksum as index codecs.
153154

155+
``index_location``
156+
157+
Specifies whether the shard index is located at the beginning or end of the
158+
file. The parameter value must be either the string ``start`` or ``end``.
159+
If the parameter is not present, the value defaults to ``end``.
160+
154161
Definitions
155162
===========
156163

@@ -190,10 +197,11 @@ Empty inner chunks are interpreted as being filled with the fill value. The inde
190197
always has the full shape of all possible inner chunks per shard, even if they extend
191198
beyond the array shape.
192199

193-
The index is placed at the end of the file and encoded into binary representations
194-
using the specified index codecs. The byte size of the index is determined by the
195-
number of inner chunks in the shard ``n``, i.e. the product of chunks per shard, and
196-
the choice of index codecs.
200+
The index is either placed at the end of the file or at the beginning of the file,
201+
as configured by the ``index_location`` parameter. The index is encoded into binary
202+
representations using the specified index codecs. The byte size of the index is
203+
determined by the number of inner chunks in the shard ``n``, i.e. the product of
204+
chunks per shard, and the choice of index codecs.
197205

198206
For an example, consider a shard shape of ``[64, 64]``, an inner chunk shape of
199207
``[32, 32]`` and an index codec combination of a little-endian codec followed by
@@ -239,12 +247,13 @@ common optimizations.
239247

240248
* **Decoding**: A simple implementation to decode inner chunks in a shard would (a)
241249
read the entire value from the store into a byte buffer, (b) parse the shard
242-
index as specified above from the end of the buffer and (c) cut out the
243-
relevant bytes that belong to the requested chunk. The relevant bytes are
244-
determined by the ``offset,nbytes`` pair in the shard index. This bytestream
245-
then needs to be decoded with the inner codecs as specified in the sharding
246-
configuration applying the :ref:`decoding_procedure`. This is similar to how
247-
an implementation would access a sub-slice of a chunk.
250+
index as specified above from the beginning or end (according to the
251+
``index_location``) of the buffer and (c) cut out the relevant bytes that belong
252+
to the requested chunk. The relevant bytes are determined by the
253+
``offset,nbytes`` pair in the shard index. This bytestream then needs to be
254+
decoded with the inner codecs as specified in the sharding configuration applying
255+
the :ref:`decoding_procedure`. This is similar to how an implementation would
256+
access a sub-slice of a chunk.
248257

249258
The size of the index can be determined by applying ``c.compute_encoded_size``
250259
for each index codec recursively. The initial size is the byte size of the index
@@ -256,25 +265,31 @@ common optimizations.
256265

257266
If the underlying store supports partial reads, the decoding of single inner
258267
chunks can be optimized. In that case, the shard index can be read from the
259-
store by requesting the ``n`` last bytes, where ``n`` is the size of the index
260-
as determined by the number of inner chunks in the shard and choice of index
261-
codecs. After parsing the shard index, single inner chunks can be requested from
262-
the store by specifying the byte range. The bytestream, then, needs to be
263-
decoded as above.
268+
store by requesting the ``n`` first or last bytes (according to the
269+
``index_location``), where ``n`` is the size of the index as determined by
270+
the number of inner chunks in the shard and choice of index codecs. After
271+
parsing the shard index, single inner chunks can be requested from the store
272+
by specifying the byte range. The bytestream, then, needs to be decoded as above.
264273

265274
* **Encoding**: A simple implementation to encode a chunk in a shard would (a)
266275
encode the new chunk per :ref:`encoding_procedure` in a byte buffer using the
267276
shard's inner codecs, (b) read an existing shard from the store, (c) create a
268277
new bytestream with all encoded inner chunks of that shard including the overwritten
269-
chunk, (d) generate a new shard index that is appended to the chunk bytestream
270-
and (e) writes the shard to the store. If there was no existing shard, an
271-
empty shard is assumed. When writing entire inner chunks, reading the existing shard
272-
first may be skipped.
278+
chunk, (d) generate a new shard index that is prepended or appended (according
279+
to the ``index_location``) to the chunk bytestream and (e) writes the shard to
280+
the store. If there was no existing shard, an empty shard is assumed. When
281+
writing entire inner chunks, reading the existing shard first may be skipped.
273282

274283
When working with inner chunks that have a fixed byte size (e.g., uncompressed) and
275284
a store that supports partial writes, a optimization would be to replace the
276285
new chunk by writing to the store at the specified byte range.
277286

287+
On stores with random-write capabilities, it may be useful to (a) place the shard
288+
index at the beginning of the file, (b) write out inner chunks in
289+
application-specific order, and (c) update the shard index accordingly.
290+
Synchronization of parallelly written inner chunks needs to be handled by the
291+
application.
292+
278293
Other use case-specific optimizations may be available, e.g., for append-only
279294
workloads.
280295

@@ -289,5 +304,6 @@ References
289304
Change log
290305
==========
291306

292-
This section is a placeholder for keeping a log of the snapshots of this
293-
document that are tagged in GitHub and what changed between them.
307+
* Adds ``index_location`` parameter. `PR 280 <https://github.com/zarr-developers/zarr-specs/pull/280>`_
308+
309+
* ZEP0002 was accepted. `Issue 254 <https://github.com/zarr-developers/zarr-specs/pull/254>`_

0 commit comments

Comments
 (0)