Skip to content

Commit 9346bcc

Browse files
authored
Merge pull request #222 from jbms/codec-partial-decode
Describe partial decode support for codecs
2 parents 23a74f7 + d5bfc6e commit 9346bcc

File tree

6 files changed

+70
-16
lines changed

6 files changed

+70
-16
lines changed

docs/v3/codecs/blosc/v1.0.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,7 @@ is licensed under a `Creative Commons Attribution 3.0 Unported License
2424
Abstract
2525
========
2626

27-
This specification defines an implementation of the Zarr abstract
28-
store API using a file system.
27+
Defines a ``bytes -> bytes`` codec that uses the blosc container format.
2928

3029

3130
Status of this document
@@ -106,6 +105,8 @@ default block size::
106105
Format and algorithm
107106
====================
108107

108+
This is a ``bytes -> bytes`` codec.
109+
109110
Blosc is a meta-compressor, which divides an input buffer into blocks,
110111
then applies an internal compression algorithm to each block, then
111112
packs the encoded blocks together into a single output buffer with a

docs/v3/codecs/endian/v1.0.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@ is licensed under a `Creative Commons Attribution 3.0 Unported License
2626
Abstract
2727
========
2828

29-
This specification defines an implementation of the Zarr abstract
30-
store API using a file system.
29+
Defines an ``array -> bytes`` codec that encodes arrays of fixed-size numeric
30+
data types as little endian or big endian in lexicographical order.
3131

3232

3333
Status of this document
@@ -65,6 +65,8 @@ endian:
6565
Format and algorithm
6666
====================
6767

68+
This is an ``array -> bytes`` codec.
69+
6870
Each element of the array is encoded using the specified endian variant of its
6971
default binary representation. Array elements are encoded in lexicographical
7072
order. For example, with ``endian`` specified as ``big``, the ``int32`` data

docs/v3/codecs/gzip/v1.0.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,7 @@ is licensed under a `Creative Commons Attribution 3.0 Unported License
2424
Abstract
2525
========
2626

27-
This specification defines an implementation of the Zarr abstract
28-
store API using a file system.
27+
Defines a ``bytes -> bytes`` codec that applies gzip compression.
2928

3029

3130
Status of this document
@@ -79,6 +78,8 @@ the Gzip codec configured with a compression level of 1::
7978
Format and algorithm
8079
====================
8180

81+
This is a ``bytes -> bytes`` codec.
82+
8283
Encoding and decoding is performed using the algorithm defined in
8384
[RFC1951]_.
8485

docs/v3/codecs/sharding-indexed/v1.0.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,7 @@ is licensed under a `Creative Commons Attribution 3.0 Unported License
2727
Abstract
2828
========
2929

30-
This specification defines an implementation of the Zarr codec specification
31-
for sharding.
30+
This specification defines a Zarr ``array -> bytes`` codec for sharding.
3231

3332
Sharding logically splits chunks ("shards") into sub-chunks ("inner chunks")
3433
that can be individually compressed and accessed. This allows to colocate
@@ -121,6 +120,8 @@ Sharding can be configured per array in the :ref:`array-metadata` as follows::
121120
Binary shard format
122121
===================
123122

123+
This is an ``array -> bytes`` codec.
124+
124125
In the ``sharding_indexed`` binary format, chunks are written successively in a
125126
shard, where unused space between them is allowed, followed by an index
126127
referencing them. The index is placed at the end of the file and has a size of

docs/v3/codecs/transpose/v1.0.rst

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@ is licensed under a `Creative Commons Attribution 3.0 Unported License
2626
Abstract
2727
========
2828

29-
Defines a codec that permutes the dimensions of the chunk array.
29+
Defines an ``array -> array`` codec that permutes the dimensions of the chunk
30+
array.
3031

3132

3233
Status of this document
@@ -71,9 +72,7 @@ order:
7172
Format and algorithm
7273
====================
7374

74-
The decoded chunk representation to which this codec is applied must be an
75-
array. Implementations must fail if this codec is specified immediately after
76-
another codec that produces a byte string as its encoded representation.
75+
This is an ``array -> array`` codec.
7776

7877
Given a chunk array ``A`` with shape ``A_shape`` as the decoded representation,
7978
the encoded representation is an array ``B`` with the same data type as ``A``

docs/v3/core/v3.0.rst

Lines changed: 54 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1089,6 +1089,36 @@ each of these two representations are defined to be either:
10891089
- a multi-dimensional array of some shape and data type, or
10901090
- a byte string.
10911091

1092+
Based on the input and output representations for the encode transform,
1093+
codecs can be classified as one of three kinds:
1094+
1095+
- ``array -> array``
1096+
- ``array -> bytes``
1097+
- ``bytes -> bytes``
1098+
1099+
.. note::
1100+
1101+
``bytes -> array`` codecs, where after encoding an array as a byte
1102+
string, it is subsequently transformed back into an array, to then later
1103+
be transformed back into a byte string, are not currently allowed, due to
1104+
the lack of a clear use case.
1105+
1106+
If multiple codecs are specified for an array, each codec is applied
1107+
sequentially; when encoding, the encoded output of codec ``i`` serves as the
1108+
decoded input of codec ``i+1``, and similarly when decoding, the decoded output
1109+
of codec ``i+1`` serves as the encoded input to codec ``i``. Since ``bytes ->
1110+
array`` codecs are not supported, it follows that the list of codecs must be of
1111+
the following form:
1112+
1113+
- zero or more ``array -> array`` codecs; followed by
1114+
- at most one ``array -> bytes`` codec; followed by
1115+
- zero or more ``bytes -> bytes`` codecs.
1116+
1117+
If no ``array -> bytes`` codec is specified, then the default byte
1118+
representation for the data type of the array is used. For all data types
1119+
currently defined by the core spec, that is equivalent to the ``endian`` codec
1120+
with an endianness of ``little``.
1121+
10921122
Logically, a codec ``c`` must define three properties:
10931123

10941124
- ``c.compute_encoded_representation_type(decoded_representation_type)``, a
@@ -1106,11 +1136,31 @@ Logically, a codec ``c`` must define three properties:
11061136
- ``c.decode(encoded_value, decoded_representation_type)``, a procedure that
11071137
computes the decoded representation, and is used when reading an array.
11081138

1109-
If more than one codec is specified for an array, each codec is applied
1110-
sequentially; when encoding, the encoded output of codec ``i`` serves as the
1111-
decoded input of codec ``i+1``, and similarly when decoding, the decoded
1112-
output of codec ``i+1`` serves as the encoded input to codec ``i``.
1139+
Implementations MAY support partial decoding for certain codecs
1140+
(e.g. sharding, blosc). Logically, partial decoding may be defined in terms
1141+
of an additional operation:
1142+
1143+
- ``c.partial_decode(input_handle, decoded_representation_type,
1144+
decoded_regions)``, where:
1145+
1146+
- ``input_handle`` provides an interface for requesting partial reads of
1147+
the encoded representation and itself supports the same
1148+
``partial_decode`` interface;
1149+
- ``decoded_representation_type`` is the same as for ``c.decode``;
1150+
- ``decoded_regions`` specifies the regions of the decoded representation
1151+
that must be returned.
1152+
1153+
If the encoded representation is a multi-dimensional array, then
1154+
``decoded_regions`` specifies a subset of the array's domain. If the
1155+
encoded representation is a byte string, then ``decoded_regions``
1156+
specifies a list of byte ranges.
1157+
1158+
.. note::
11131159

1160+
If ``partial_decode`` is not supported by a particular codec, it can
1161+
always be implemented in terms of ``decode`` by simply decoding in full
1162+
and then satisfying any ``decoded_regions`` requests directly from the
1163+
cached decoded representation.
11141164

11151165
Determination of encoded representations
11161166
----------------------------------------

0 commit comments

Comments
 (0)