Skip to content

Commit f729c93

Browse files
authored
Merge pull request #214 from jbms/chunk-key-encoding
Add chunk_key_encoding array metadata field
2 parents dd27077 + a4cb904 commit f729c93

File tree

1 file changed

+101
-42
lines changed

1 file changed

+101
-42
lines changed

docs/v3/core/v3.0.rst

Lines changed: 101 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
Specification URI:
1111
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html
12-
12+
1313
Editors:
1414
* Alistair Miles (`@alimanfoo <https://github.com/alimanfoo>`_), Wellcome Sanger Institute
1515
* Jonathan Striebel (`@jstriebel <https://github.com/jstriebel>`_), Scalable Minds
@@ -462,7 +462,7 @@ This follows the
462462
representation is the ``UTF-8`` encoded Unicode string.
463463

464464
.. note::
465-
The prefix ``__zarr`` is reserved for core zarr data, and extensions
465+
The prefix ``__zarr`` is reserved for core zarr data, and extensions
466466
can use other files and folders starting with ``__``.
467467

468468

@@ -529,10 +529,10 @@ Core data types
529529
- 8-byte little endian
530530
* - ``float16`` (optionally supported)
531531
- IEEE 754 half-precision floating point: sign bit, 5 bits exponent, 10 bits mantissa
532-
- 2-byte little endian IEEE 754 binary16
532+
- 2-byte little endian IEEE 754 binary16
533533
* - ``float32``
534534
- IEEE 754 single-precision floating point: sign bit, 8 bits exponent, 23 bits mantissa
535-
- 4-byte little endian IEEE 754 binary32
535+
- 4-byte little endian IEEE 754 binary32
536536
* - ``float64``
537537
- IEEE 754 double-precision floating point: sign bit, 11 bits exponent, 52 bits mantissa
538538
- 8-byte little endian IEEE 754 binary64
@@ -647,17 +647,8 @@ that chunk, where "%" is the modulo operator. For example, if a
647647
is contained within the chunk at grid index (1, 7, 2) and has coordinates
648648
(2, 10, 100) within that chunk.
649649

650-
The identifier for chunk with grid index (``k``, ``j``, ``i``, ...) is
651-
formed by taking the initial prefix ``c``, and appending for each dimension:
652-
653-
- the ``separator`` character specified within the ``chunk_grid`` metadata object (see
654-
the section on `Array metadata`_ below), followed by,
655-
656-
- the ASCII decimal string representation of the chunk index within that dimension.
657-
658-
For example, in a 3 dimensional array, with a separator of ``/`` the identifier
659-
for the chunk at grid index (1, 23, 45) is the string "c/1/23/45". With a
660-
separator of ``.``, the identifier is the string "c.1.23.45".
650+
The store key corresponding to a given grid cell is determined based on the
651+
`chunk_key_encoding`_ member of the `Array metadata`_.
661652

662653
Note that this specification does not consider the case where the
663654
chunk grid and the array space are not aligned at the origin vertices
@@ -668,14 +659,6 @@ origin element of the array may occur at an arbitrary position within
668659
any chunk, which is required to allow arrays to be extended by an
669660
arbitrary length in a "negative" direction along any dimension.
670661

671-
.. note:: A main difference with spec v2 is that the default chunk separator
672-
changed from ``.`` to ``/``, as in N5. This decreases the maximum number of
673-
items in hierarchical stores like directory stores.
674-
675-
.. note:: Arrays may have 0 dimensions (when for example representing scalars),
676-
in which case the coordinate of a chunk is the empty tuple, and the chunk key
677-
will consist of the string ``c``.
678-
679662
.. note:: Chunks at the border of an array always have the full chunk size, even when
680663
the array only covers parts of it. For example, having an array with ``"shape": [30, 30]`` and
681664
``"chunk_shape": [16, 16]``, the chunk ``0,1`` would also contain unused values for the indices
@@ -863,7 +846,7 @@ mandatory names:
863846
if provided, its value must be one or a list of the data type identifiers
864847
defined in this specification or an extension. Fallback extension datatypes
865848
are specified as an object with ``name`` and (optionally) ``configuration``.
866-
849+
867850
If an implementation does not recognise the extension or specific data type,
868851
but a ``fallback`` is present, then the implementation may proceed using the
869852
first known ``fallback`` value as the data type. For fixed-sized data types,
@@ -883,10 +866,10 @@ mandatory names:
883866
as defined in this specification, then the value must be an object with the
884867
names ``name`` and ``configuration``. The value of ``name`` must be the
885868
string ``"regular"``, and the value of ``configuration`` an object with the
886-
names ``chunk_shape`` and ``separator``. ``chunk_shape`` must be an array of
869+
member ``chunk_shape``. ``chunk_shape`` must be an array of
887870
integers providing the lengths of the chunk along each dimension of the
888-
array. ``separator`` must be either ``"/"`` or ``"."``. For example,
889-
``{"type": "regular", "configuration": {"chunk_shape": [2, 5], "separator":"/"}}``
871+
array. For example,
872+
``{"type": "regular", "configuration": {"chunk_shape": [2, 5]}}``
890873
means a regular grid where the chunks have length 2 along the first
891874
dimension and length 5 along the second dimension.
892875

@@ -895,6 +878,71 @@ mandatory names:
895878
must be a string referring to a v3 chunk grid specification. The
896879
``configuration`` is optional and defined by the extension.
897880

881+
``chunk_key_encoding``
882+
^^^^^^^^^^^^^^^^^^^^^^
883+
884+
The mapping from chunk grid cell coordinates to keys in the underlying
885+
store.
886+
887+
The value must be an object with required string member ``name``, specifying
888+
the encoding type, and optional member ``configuration`` specifying encoding
889+
type-dependent parameters; the ``configuration`` value must be an object if
890+
it is specified.
891+
892+
The following encodings are defined:
893+
894+
- ``default``
895+
896+
The ``configuration`` object may contain one optional member,
897+
``separator``, which must be either ``"/"`` or ``"."``. If not specified,
898+
``separator`` defaults to ``"/"``.
899+
900+
The key for a chunk with grid index (``k``, ``j``, ``i``, ...) is
901+
formed by taking the initial prefix ``c``, and appending for each dimension:
902+
903+
- the ``separator`` character, followed by,
904+
905+
- the ASCII decimal string representation of the chunk index within that dimension.
906+
907+
For example, in a 3 dimensional array, with a separator of ``/`` the identifier
908+
for the chunk at grid index (1, 23, 45) is the string ``"c/1/23/45"``. With a
909+
separator of ``.``, the identifier is the string ``"c.1.23.45"``.
910+
911+
.. note:: A main difference with spec v2 is that the default chunk separator
912+
changed from ``.`` to ``/``, as in N5. This decreases the maximum number of
913+
items in hierarchical stores like directory stores.
914+
915+
.. note:: Arrays may have 0 dimensions (when for example representing scalars),
916+
in which case the coordinate of a chunk is the empty tuple, and the chunk key
917+
will consist of the string ``c``.
918+
919+
- ``v2``
920+
921+
The ``configuration`` object may contain one optional member,
922+
``separator``, which must be either ``"/"`` or ``"."``. If not specified,
923+
``separator`` defaults to ``"."``.
924+
925+
The identifier for chunk with at least one dimension is formed by
926+
concatenating for each dimension:
927+
928+
- the ASCII decimal string representation of the chunk index within that
929+
dimension, followed by
930+
931+
- the ``separator`` character, except that it is omitted for the last
932+
dimension.
933+
934+
For example, in a 3 dimensional array, with a separator of ``.`` the identifier
935+
for the chunk at grid index (1, 23, 45) is the string ``"1.23.45"``. With a
936+
separator of ``/``, the identifier is the string ``"1/23/45"``.
937+
938+
For chunk grids with 0 dimensions, the single chunk has the key ``"0"``.
939+
940+
.. note::
941+
942+
This encoding is intended only to allow existing v2 arrays to be
943+
converted to v3 without having to rename chunks. It is not recommended
944+
to be used when writing new arrays.
945+
898946
``fill_value``
899947
^^^^^^^^^^^^^^
900948

@@ -1006,8 +1054,13 @@ compressed using gzip compression prior to storage::
10061054
"chunk_grid": {
10071055
"name": "regular",
10081056
"configuration": {
1009-
"chunk_shape": [1000, 100],
1010-
"separator" : "/"
1057+
"chunk_shape": [1000, 100]
1058+
}
1059+
},
1060+
"chunk_key_encoding": {
1061+
"name": "default",
1062+
"configuration": {
1063+
"separator": "/"
10111064
}
10121065
},
10131066
"codecs": [{
@@ -1035,15 +1088,20 @@ above, but using a (currently made up) extension data type::
10351088
"data_type": {
10361089
"name": "datetime",
10371090
"configuration": {
1038-
"unit": "ns"
1091+
"unit": "ns"
10391092
},
10401093
"fallback": "int64"
10411094
},
10421095
"chunk_grid": {
10431096
"name": "regular",
10441097
"configuration": {
1045-
"chunk_shape": [1000, 100],
1046-
"separator" : "/"
1098+
"chunk_shape": [1000, 100]
1099+
}
1100+
},
1101+
"chunk_key_encoding": {
1102+
"name": "default",
1103+
"configuration": {
1104+
"separator": "/"
10471105
}
10481106
},
10491107
"codecs": [{
@@ -1056,14 +1114,14 @@ above, but using a (currently made up) extension data type::
10561114
}
10571115

10581116
.. note::
1059-
1117+
10601118
Comparison with zarr spec v2:
1061-
1119+
10621120
- ``dtype`` has been renamed to ``data_type``,
1063-
- ``chunks`` has been renamed to ``chunk_grid``,
1121+
- ``chunks`` has been replaced with ``chunk_grid``,
1122+
- ``dimension_separator`` has been replaced with ``chunk_key_encoding``,
10641123
- ``order`` has been replaced by the :ref:`transpose <transpose-codec-v1>` codec,
1065-
- the separate ``filters`` and ``compressor`` fields been combined into the single ``codecs`` field,
1066-
- ``zarr_format`` is now a string URL rather than a number.
1124+
- the separate ``filters`` and ``compressor`` fields been combined into the single ``codecs`` field.
10671125

10681126

10691127
Group metadata
@@ -1551,12 +1609,13 @@ Extension points
15511609
Different types of extensions can exist and they can be grouped as follows:
15521610

15531611
=========== ======================= ================================================
1554-
level extension metadata
1612+
level extension metadata
15551613
=========== ======================= ================================================
1556-
array data type `data_type`_
1557-
array chunk grid `chunk_grid`_
1558-
array codecs `codecs`_
1559-
array storage transformer `storage_transformers (array)`_
1614+
array data type `data_type`_
1615+
array chunk grid `chunk_grid`_
1616+
array chunk key encoding `chunk_key_encoding`_
1617+
array codecs `codecs`_
1618+
array storage transformer `storage_transformers (array)`_
15601619
=========== ======================= ================================================
15611620

15621621
If such extension points are used by groups or arrays, they are required, except

0 commit comments

Comments
 (0)