Skip to content

Commit aa21bfa

Browse files
committed
Merge branch 'main' of https://github.com/zarr-developers/zarr-extensions into zarr-python-dtypes
2 parents 89ea29d + a1c0ae5 commit aa21bfa

File tree

80 files changed

+2637
-6
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

80 files changed

+2637
-6
lines changed

README.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
1-
# zarr-extensions (PREVIEW)
2-
3-
Currently, this repository is in preview mode. It will become the registry for Zarr v3 extensions as soon as [the changes to the spec](https://github.com/zarr-developers/zarr-specs/pull/330) have been adopted.
4-
5-
---
1+
# zarr-extensions
62

73
This repository contains the specifications for Zarr extensions for [Zarr version 3](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html).
84

@@ -20,15 +16,17 @@ To register an extension, open a new PR with a new extension directory under the
2016

2117
Each extension MUST have a `README.md` file that describes the extension and its metadata specification.
2218
Extensions SHOULD have a `schema.json` file that contains the JSON schema for the metadata, if the README.md does not provide a link to an external schema.
19+
The JSON schema should be formatted with `npx prettier -w **/schema.json`.
2320
Please note that all extensions documents will be licensed under the [Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).
2421
Only open a PR if you are willing to license your extension under this license.
2522

2623
The PR will be reviewed by the [Zarr steering council](https://github.com/orgs/zarr-developers/teams/steering-council).
2724
We aim to be very open about registering extensions.
2825
The review will be done largely based on avoiding confusing extension names and preventing malicious activity as well as maintaining the formal requirements of the extensions.
26+
We recommend opening a "draft PR" first, if you still want to solicit feedback from others in the community. As soon as you turn your PR into a regular PR, the review will be processed.
2927
Extension maintainers are responsible for their extensions.
3028
Updates to the extensions will also be reviewed by the steering council.
31-
29+
The steering council reserves the right to reassign extensions to other maintainers in case of prolonged inactivity or other reasons at its own discretion.
3230

3331
## Document conventions
3432

codecs/packbits/README.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# packbits codec
2+
3+
Defines an `array -> bytes` codec that packs together values that are
4+
represented by a fixed number of bits that is not necessarily a multiple of 8.
5+
6+
## Codec name
7+
8+
The value of the `name` member in the codec object MUST be `packbits`.
9+
10+
## Configuration parameters
11+
12+
### `padding_encoding` (Optional)
13+
14+
Specifies how the number of padding bits is encoded, such that the number of
15+
decoded elements may be determined from the encoded representation alone.
16+
17+
Must be one of:
18+
- `"first_byte"`, indicating that the first byte specifies the number of padding
19+
bits that were added;
20+
- `"last_byte"`, indicating that the final byte specifies the number of padding
21+
bits that were added;
22+
- `"none"` (default), indicating that the number of padding bits is not encoded.
23+
In this case, the number of decoded elements cannot be determined from the
24+
encoded representation alone.
25+
26+
While zarr itself does not need to be able to recover the number of decoded
27+
elements from the encoded representation alone, because this information can be
28+
propagated from the metadata through any prior codecs in the chain, it may still
29+
be useful as an additional sanity check or for non-zarr uses of the codec.
30+
31+
A value of `"first_byte"` provides compatibility with the [numcodecs packbits
32+
codec](https://github.com/zarr-developers/numcodecs/blob/3c933cf19d4d84f2efc5f3a36926d8c569514a90/numcodecs/packbits.py#L7)
33+
defined for zarr v2 (which only supports `bool`).
34+
35+
### `first_bit` (Optional)
36+
37+
Specifies the index (starting from the least-significant bit) of the first bit
38+
to be encoded. If omitted, or specified as `null`, defaults to `0`.
39+
40+
### `last_bit` (Optional)
41+
42+
Specifies the index (starting from the least-significant bit) of the (inclusive)
43+
last bit to be encoded. If omitted, or specified as `null`, defaults to `N - 1`,
44+
where `N` is the total number of bits per component of the data type (specified
45+
below).
46+
47+
It is invalid for `last_bit` to be less than `first_bit`.
48+
49+
Note: for complex number data types, `first_bit` and `last_bit` apply to the
50+
real and imaginary coefficients separately.
51+
52+
## Format and algorithm
53+
54+
This is an `array -> bytes` codec.
55+
56+
### Encoding/decoding of individual array elements
57+
58+
Each element of the array is encoded as a fixed number of bits, `k`, where `k`
59+
is determined from the data type, `first_bit`, and `last_bit`. Specifically
60+
61+
```
62+
b := last_bit - first_bit + 1,
63+
k := num_components * b,
64+
```
65+
66+
where `num_components` is determined by the data type (2 for complex number data
67+
types, 1 for all other data types).
68+
69+
Note: If `first_bit` and `last_bit` are both unspecified, `b == N`.
70+
71+
Logically, to encode an element of the array, each component is first encoded as
72+
an `N`-bit value (retaining all bits). From this `N`-bit value, the `b` bits
73+
from `first_bit` to `last_bit` are then extracted.
74+
75+
To decode an element of the array, the `b` encoded bits are first shifted to
76+
`first_bit`. Depending on the data type, the value is then:
77+
78+
- sign-extended (for signed integer data types), or
79+
- zero-extended (all other data types)
80+
81+
up to `N` bits.
82+
83+
### Encoding and decoding multiple
84+
85+
- Array elements are encoded in lexicographical order, to produce a bit
86+
sequence; element `i` corresponds to bits `[i * k, (i+1) * k)` within the
87+
sequence.
88+
- The bit sequence is padded with 0 bits to ensure its length is a multiple of
89+
8 bits.
90+
- Encoded byte `i` corresponds to bits `[i * 8, (i+1) * 8)` within the sequence.
91+
- If `padding_encoding` is `"first_byte"`, a single byte specifying
92+
the number of padding bits that were added is prepended to the encoded byte
93+
sequence.
94+
- If `padding_encoding` is `"last_byte"`, a single byte specifying the number of
95+
padding bits that were added is appended to the encoded byte sequence.
96+
97+
## Supported data types
98+
99+
- bool (encoded as 1 bit)
100+
- int2, uint2 (encoded as 2 bits)
101+
- int4, uint4, float4_e2m1fn (encoded as 4 bits)
102+
- float6_e2m3fn, float6_e3m2fn (encoded as 6 bits)
103+
- complex_float4_e2m1fn (encoded as 2 4-bit components)
104+
- complex_float6_e2m3fn, complex_float6_e3m2fn (encoded as 2 6-bit components)
105+
- int8, uint8, int16, uint16, int32, uint32, int64, uint64, float32, float64,
106+
bfloat16, complex_float32, complex_float64, complex_bfloat16 (encoded using
107+
their little-endian representation as with the "bytes" codec)
108+
109+
Note: For these types, if `first_bit` and `last_bit` are not used to limit the
110+
number of bits that are encoded, this codec does not provide any benefit over
111+
the `"bytes"` codec but is supported nonetheless for uniformity.
112+
113+
## Change log
114+
115+
No changes yet.
116+
117+
## Current maintainers
118+
119+
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google

codecs/packbits/schema.json

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"$schema": "https://json-schema.org/draft/2020-12/schema",
3+
"type": "object",
4+
"properties": {
5+
"name": {
6+
"const": "packbits"
7+
},
8+
"configuration": {
9+
"type": "object",
10+
"properties": {
11+
"padding_encoding": {
12+
"enum": ["start_byte", "end_byte", "none"],
13+
"default": "none"
14+
},
15+
"start_bit": {
16+
"oneOf": [
17+
{
18+
"type": "integer",
19+
"minimum": 0
20+
},
21+
{ "const": null }
22+
]
23+
},
24+
"end_bit": {
25+
"oneOf": [
26+
{
27+
"type": "integer",
28+
"minimum": 0
29+
},
30+
{ "const": null }
31+
]
32+
}
33+
},
34+
"additionalProperties": false
35+
}
36+
},
37+
"required": ["name"],
38+
"additionalProperties": false
39+
}

codecs/vlen-bytes/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Vlen-bytes codec
2+
3+
Defines an `array -> bytes` codec that serializes variable-length byte string arrays.
4+
5+
## Codec name
6+
7+
The value of the `name` member in the codec object MUST be `vlen-bytes`.
8+
9+
## Configuration parameters
10+
11+
None.
12+
13+
## Example
14+
15+
For example, the array metadata below specifies that the array contains variable-length byte strings:
16+
17+
```json
18+
{
19+
"data_type": "bytes",
20+
"codecs": [{
21+
"name": "vlen-bytes"
22+
}],
23+
}
24+
```
25+
26+
## Format and algorithm
27+
28+
This is a `array -> bytes` codec.
29+
30+
This codec is only compatible with the [`"bytes"`](../../data-types/bytes/README.md) data type.
31+
32+
In the encoded format, each chunk is prefixed with a 32-bit little-endian unsigned integer (u32le) that specifies the number of elements in the chunk.
33+
This prefix is followed by a sequence of encoded elements in lexicographical order.
34+
Each element in the sequence is encoded by a u32le representing the number of bytes followed by the bytes themselves.
35+
36+
See https://numcodecs.readthedocs.io/en/stable/other/vlen.html#vlenbytes for details about the encoding.
37+
38+
## Change log
39+
40+
No changes yet.
41+
42+
## Current maintainers
43+
44+
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)

codecs/vlen-bytes/schema.json

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
{
2+
"$schema": "https://json-schema.org/draft/2020-12/schema",
3+
"oneOf": [
4+
{
5+
"type": "object",
6+
"properties": {
7+
"name": {
8+
"const": "vlen-bytes"
9+
},
10+
"configuration": {
11+
"type": "object",
12+
"additionalProperties": false
13+
}
14+
},
15+
"required": ["name"],
16+
"additionalProperties": false
17+
},
18+
{ "const": "vlen-bytes" }
19+
]
20+
}

codecs/vlen-utf8/README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Vlen-utf8 codec
2+
3+
Defines an `array -> bytes` codec that serializes variable-length UTF-8 string arrays.
4+
5+
## Codec name
6+
7+
The value of the `name` member in the codec object MUST be `vlen-utf8`.
8+
9+
## Configuration parameters
10+
11+
None.
12+
13+
## Example
14+
15+
For example, the array metadata below specifies that the array contains variable-length UTF-8 strings:
16+
17+
```json
18+
{
19+
"data_type": "string",
20+
"codecs": [{
21+
"name": "vlen-utf8"
22+
}],
23+
}
24+
```
25+
26+
## Format and algorithm
27+
28+
This is a `array -> bytes` codec.
29+
30+
This codec is only compatible with the [`"string"`](../../data-types/string/README.md) data type.
31+
32+
In the encoded format, each chunk is prefixed with a 32-bit little-endian unsigned integer (u32le) that specifies the number of elements in the chunk.
33+
This prefix is followed by a sequence of encoded elements in lexicographical order.
34+
Each element in the sequence is encoded by a u32le representing the number of bytes followed by the bytes themselves.
35+
The bytes for each element are obtained by encoding the element as UTF8 bytes.
36+
37+
See https://numcodecs.readthedocs.io/en/stable/other/vlen.html#vlenutf8 for details about the encoding.
38+
39+
## Change log
40+
41+
No changes yet.
42+
43+
## Current maintainers
44+
45+
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)

codecs/vlen-utf8/schema.json

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
{
2+
"$schema": "https://json-schema.org/draft/2020-12/schema",
3+
"oneOf": [
4+
{
5+
"type": "object",
6+
"properties": {
7+
"name": {
8+
"const": "vlen-utf8"
9+
},
10+
"configuration": {
11+
"type": "object",
12+
"additionalProperties": false
13+
}
14+
},
15+
"required": ["name"],
16+
"additionalProperties": false
17+
},
18+
{ "const": "vlen-utf8" }
19+
]
20+
}

0 commit comments

Comments
 (0)