Skip to content

Commit f7f7a2e

Browse files
authored
Merge pull request #249 from jbms/remove-default-binary-representation
Require array -> bytes codec be specified in metadata
2 parents f2db45d + 7e6defc commit f7f7a2e

File tree

2 files changed

+76
-85
lines changed

2 files changed

+76
-85
lines changed

docs/v3/codecs/endian/v1.0.rst

Lines changed: 45 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -74,20 +74,54 @@ Format and algorithm
7474
This is an ``array -> bytes`` codec.
7575

7676
Each element of the array is encoded using the specified endian variant of its
77-
default binary representation. Array elements are encoded in lexicographical
78-
order. For example, with ``endian`` specified as ``big``, the ``int32`` data
79-
type is encoded as a 4-byte big endian two's complement integer, and the
80-
``complex128`` data type is encoded as two consecutive 8-byte big endian IEEE
81-
754 binary64 values.
77+
binary representation listed below. Array elements are encoded in
78+
lexicographical order. For example, with ``endian`` specified as ``big``, the
79+
``int32`` data type is encoded as a 4-byte big endian two's complement integer,
80+
and the ``complex128`` data type is encoded as two consecutive 8-byte big endian
81+
IEEE 754 binary64 values.
82+
83+
.. list-table:: Supported data types
84+
:header-rows: 1
85+
86+
* - Identifier
87+
- Binary representation
88+
* - ``bool``
89+
- Single byte, with false encoded as ``\\x00`` and true encoded as
90+
``\\x01``. Does not depend on ``endian`` parameter.
91+
* - ``int8``
92+
- 1 byte two's complement. Does not depend on ``endian`` parameter.
93+
* - ``int16``
94+
- 2-byte two's complement
95+
* - ``int32``
96+
- 4-byte two's complement
97+
* - ``int64``
98+
- 8-byte two's complement
99+
* - ``uint8``
100+
- 1 byte. Does not depend on ``endian`` parameter.
101+
* - ``uint16``
102+
- 2-byte
103+
* - ``uint32``
104+
- 4-byte
105+
* - ``uint64``
106+
- 8-byte
107+
* - ``float16`` (optionally supported)
108+
- 2-byte IEEE 754 binary16
109+
* - ``float32``
110+
- 4-byte IEEE 754 binary32
111+
* - ``float64``
112+
- 8-byte IEEE 754 binary64
113+
* - ``complex64``
114+
- 2 consecutive 4-byte IEEE 754 binary32 values (real component followed by imaginary component)
115+
* - ``complex128``
116+
- 2 consecutive 8-byte IEEE 754 binary64 values (real component followed by imaginary component)
117+
* - ``r*``
118+
- number of bits, which must be a multiple of 8, given by ``*``.
82119

83120
.. note::
84121

85-
Since the default binary representation of all data types is little endian,
86-
specifying this codec with ``endian`` equal to ``"little"`` is equivalent to
87-
omitting this codec, because if this codec is omitted, the default binary
88-
representation of the data type, which is always little endian, is used
89-
instead.
90-
122+
To encode elements in a different order than lexicographical order (C
123+
order/row major), the :ref:`transpose codec<transpose-codec-v1>` may be
124+
specified.
91125

92126
References
93127
==========

docs/v3/core/v3.0.rst

Lines changed: 31 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -269,12 +269,10 @@ The following figure illustrates the first part of the terminology:
269269
*Data type*
270270

271271
A data type defines the set of possible values that an array_ may
272-
contain, and a default binary representation (i.e., sequence of bytes) for
273-
each possible value. For example, the 32-bit signed
274-
integer data type defines binary representations for all integers
275-
in the range −2,147,483,648 to 2,147,483,647. This specification
276-
only defines a limited set of data types, but extensions
277-
may define other data types.
272+
contain. For example, the 32-bit signed integer data type defines binary
273+
representations for all integers in the range −2,147,483,648 to
274+
2,147,483,647. This specification only defines a limited set of data types,
275+
but extensions may define other data types.
278276

279277
.. _chunk:
280278
.. _chunks:
@@ -655,6 +653,17 @@ mandatory names:
655653
the data type will be chosen. However, the default fill value that is
656654
chosen MUST be recorded in the metadata.
657655

656+
``codecs``
657+
^^^^^^^^^^
658+
659+
Specifies a list of codecs to be used for encoding and decoding chunks. The
660+
value must be an array of objects, each object containing a member with
661+
``name`` whose value is a string referring to a v3 codec specification. The
662+
codec object may also contain a ``configuration`` object which consists of
663+
the parameter names and values as defined by the corresponding codec
664+
specification. Since an ``array -> bytes`` codec must be specified, the
665+
list cannot be empty.
666+
658667
The following members are optional:
659668

660669
``attributes``
@@ -673,17 +682,6 @@ The following members are optional:
673682
A proposal to specify metadata conventions (ZEP 4) is being discussed in
674683
https://github.com/zarr-developers/zeps/pull/28.
675684

676-
``codecs``
677-
^^^^^^^^^^
678-
679-
Specifies a list of codecs to be used for encoding and decoding chunks. The
680-
value must be an array of objects, each object containing a member with
681-
``name`` whose value is a string referring to a v3 codec specification. The
682-
codec object may also contain a ``configuration`` object which consists of
683-
the parameter names and values as defined by the corresponding codec
684-
specification. An absent ``codecs`` member is equivalent to specifying an
685-
empty list of codecs.
686-
687685
``storage_transformers``
688686
^^^^^^^^^^^^^^^^^^^^^^^^
689687

@@ -936,52 +934,36 @@ Core data types
936934

937935
* - Identifier
938936
- Numerical type
939-
- Default binary representation
940937
* - ``bool``
941938
- Boolean
942-
- Single byte, with false encoded as ``\\x00`` and true encoded as ``\\x01``.
943939
* - ``int8``
944940
- Integer in ``[-2^7, 2^7-1]``
945-
- 1 byte two's complement
946941
* - ``int16``
947942
- Integer in ``[-2^15, 2^15-1]``
948-
- 2-byte little endian two's complement
949943
* - ``int32``
950944
- Integer in ``[-2^31, 2^31-1]``
951-
- 4-byte little endian two's complement
952945
* - ``int64``
953946
- Integer in ``[-2^63, 2^63-1]``
954-
- 8-byte little endian two's complement
955947
* - ``uint8``
956948
- Integer in ``[0, 2^8-1]``
957-
- 1 byte
958949
* - ``uint16``
959950
- Integer in ``[0, 2^16-1]``
960-
- 2-byte little endian
961951
* - ``uint32``
962952
- Integer in ``[0, 2^32-1]``
963-
- 4-byte little endian
964953
* - ``uint64``
965954
- Integer in ``[0, 2^64-1]``
966-
- 8-byte little endian
967955
* - ``float16`` (optionally supported)
968956
- IEEE 754 half-precision floating point: sign bit, 5 bits exponent, 10 bits mantissa
969-
- 2-byte little endian IEEE 754 binary16
970957
* - ``float32``
971958
- IEEE 754 single-precision floating point: sign bit, 8 bits exponent, 23 bits mantissa
972-
- 4-byte little endian IEEE 754 binary32
973959
* - ``float64``
974960
- IEEE 754 double-precision floating point: sign bit, 11 bits exponent, 52 bits mantissa
975-
- 8-byte little endian IEEE 754 binary64
976961
* - ``complex64``
977962
- real and complex components are each IEEE 754 single-precision floating point
978-
- 2 consecutive 4-byte little endian IEEE 754 binary32 values
979963
* - ``complex128``
980964
- real and complex components are each IEEE 754 double-precision floating point
981-
- 2 consecutive 8-byte little endian IEEE 754 binary64 values
982965
* - ``r*`` (Optional)
983966
- raw bits, use for extension type fallbacks
984-
- variable, given by ``*``, is limited to be a multiple of 8.
985967

986968
Additionally to these base types, an implementation should also handle the
987969
raw/opaque pass-through type designated by the lower-case letter ``r`` followed
@@ -991,11 +973,6 @@ should be understood as fall-back types of respectively 1, 2, and 3 byte length.
991973
Zarr v3 is limited to type sizes that are a multiple of 8 bits but may support
992974
other type sizes in later versions of this specification.
993975

994-
.. note::
995-
996-
While the default binary representation is little endian, the :ref:`endian
997-
codec<endian-codec-v1>` may be specified to use big endian encoding instead.
998-
999976
.. note::
1000977

1001978
We are explicitly looking for more feedback and prototypes of code using the ``r*``,
@@ -1111,7 +1088,7 @@ the chain of codecs_ specified by the ``codecs`` metadata field.
11111088
Codecs
11121089
------
11131090

1114-
An array_ may be associated with a list of *codecs*. Each codec specifies a
1091+
An array_ has an associated list of *codecs*. Each codec specifies a
11151092
bidirectional transform (an *encode* transform and a *decode* transform).
11161093

11171094
Each codec has an *encoded representation* and a *decoded representation*;
@@ -1142,14 +1119,9 @@ array`` codecs are not supported, it follows that the list of codecs must be of
11421119
the following form:
11431120

11441121
- zero or more ``array -> array`` codecs; followed by
1145-
- at most one ``array -> bytes`` codec; followed by
1122+
- exactly one ``array -> bytes`` codec; followed by
11461123
- zero or more ``bytes -> bytes`` codecs.
11471124

1148-
If no ``array -> bytes`` codec is specified, then the default byte
1149-
representation for the data type of the array is used. For all data types
1150-
currently defined by the core spec, that is equivalent to the ``endian`` codec
1151-
with an endianness of ``little``.
1152-
11531125
Logically, a codec ``c`` must define three properties:
11541126

11551127
- ``c.compute_encoded_representation_type(decoded_representation_type)``, a
@@ -1224,27 +1196,6 @@ codec in the chain must first be determined as follows:
12241196
If ``compute_encoded_representation_type`` fails because of an incompatible
12251197
decoded representation, an implementation should indicate an error.
12261198

1227-
.. _default-array-byte-string-conversion:
1228-
1229-
Conversion between multi-dimensional array and byte string representations
1230-
--------------------------------------------------------------------------
1231-
1232-
Some codecs operate directly on multi-dimensional arrays of elements,
1233-
e.g. encoding a 3-d array as a multi-channel jpeg image. Other codecs operate
1234-
at the byte level, e.g. gzip compression. If a codec that operates at the byte
1235-
level receives as input an array that is not a 1-dimensional uint8 array, it may
1236-
convert the input array to a byte string by concatenating the default binary
1237-
representations of each element in lexicographical order (C order). Similarly,
1238-
if a codec that expects a multi-dimensional array as input instead receives a
1239-
byte string, it may decode each element in lexicographical order according to
1240-
the default binary representation of each element.
1241-
1242-
.. note::
1243-
1244-
To encode elements in a different order than the default lexicographical
1245-
order (C order/row major), the :ref:`transpose codec<transpose-codec-v1>` may
1246-
be specified.
1247-
12481199
.. _encoding_procedure:
12491200

12501201
Encoding procedure
@@ -1260,11 +1211,9 @@ the following procedure:
12601211
2. For each codec ``codecs[i]`` in ``codecs``, ``EC[i+1] :=
12611212
codecs[i].encode(EC[i])``.
12621213

1263-
3. The final encoded chunk representation ``EC_final`` is always a byte string.
1264-
If ``EC[codecs.length]`` is a byte string, then ``EC_final :=
1265-
EC[codecs.length]``. Otherwise, ``EC_final`` is
1266-
:ref:`converted<default-array-byte-string-conversion>` from
1267-
``EC[codecs.length]``.
1214+
3. The final encoded chunk representation ``EC_final := EC[codecs.length]``.
1215+
This is always a byte string due to the requirement that the list of codecs
1216+
include an ``array -> bytes`` codec.
12681217

12691218
4. ``EC_final`` is written to the store_.
12701219

@@ -1278,9 +1227,7 @@ the following procedure:
12781227

12791228
1. The encoded chunk representation ``EC_final`` is read from the store_.
12801229

1281-
2. If ``codecs[codecs.length]`` is a byte string, ``EC[codecs.length] :=
1282-
EC_final``. Otherwise, ``EC[codecs.length]`` is
1283-
:ref:`converted<default-array-byte-string-conversion>` from ``EC_final``.
1230+
2. ``EC[codecs.length] := EC_final``.
12841231

12851232
3. For each codec ``codecs[i]`` in ``codecs``, iterating in reverse order,
12861233
``EC[i] := codecs[i].decode(EC[i+1], decoded_representation[i])``.
@@ -1808,6 +1755,16 @@ All notable and possibly implementation-affecting changes to this specification
18081755
are documented in this section, grouped by the specification status and ordered
18091756
by time.
18101757

1758+
Changes after Provisional Acceptance
1759+
------------------------------------
1760+
1761+
- It is now required to specify an ``array -> bytes`` codec in the ``codecs``
1762+
array metadata field. `PR #249
1763+
<https://github.com/zarr-developers/zarr-specs/pull/249>`_
1764+
- The representation of fill values for floating point numbers was changed to
1765+
avoid ambiguity. `PR #236
1766+
<https://github.com/zarr-developers/zarr-specs/pull/236>`_
1767+
18111768
Draft Changes
18121769
--------------------------
18131770

0 commit comments

Comments
 (0)