Skip to content

Commit 8870ab3

Browse files
authored
Add lower-precision integer and floating point data types, and packbits codec (#3)
* Add lower-precision int, float, complex data types, and packbits codec * Add first_bit and last_bit options to packbits codec
1 parent 69f9b14 commit 8870ab3

File tree

60 files changed

+1845
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+1845
-0
lines changed

codecs/packbits/README.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# packbits codec
2+
3+
Defines an `array -> bytes` codec that packs together values that are
4+
represented by a fixed number of bits that is not necessarily a multiple of 8.
5+
6+
## Codec name
7+
8+
The value of the `name` member in the codec object MUST be `packbits`.
9+
10+
## Configuration parameters
11+
12+
### `padding_encoding` (Optional)
13+
14+
Specifies how the number of padding bits is encoded, such that the number of
15+
decoded elements may be determined from the encoded representation alone.
16+
17+
Must be one of:
18+
- `"first_byte"`, indicating that the first byte specifies the number of padding
19+
bits that were added;
20+
- `"last_byte"`, indicating that the final byte specifies the number of padding
21+
bits that were added;
22+
- `"none"` (default), indicating that the number of padding bits is not encoded.
23+
In this case, the number of decoded elements cannot be determined from the
24+
encoded representation alone.
25+
26+
While zarr itself does not need to be able to recover the number of decoded
27+
elements from the encoded representation alone, because this information can be
28+
propagated from the metadata through any prior codecs in the chain, it may still
29+
be useful as an additional sanity check or for non-zarr uses of the codec.
30+
31+
A value of `"first_byte"` provides compatibility with the [numcodecs packbits
32+
codec](https://github.com/zarr-developers/numcodecs/blob/3c933cf19d4d84f2efc5f3a36926d8c569514a90/numcodecs/packbits.py#L7)
33+
defined for zarr v2 (which only supports `bool`).
34+
35+
### `first_bit` (Optional)
36+
37+
Specifies the index (starting from the least-significant bit) of the first bit
38+
to be encoded. If omitted, or specified as `null`, defaults to `0`.
39+
40+
### `last_bit` (Optional)
41+
42+
Specifies the index (starting from the least-significant bit) of the (inclusive)
43+
last bit to be encoded. If omitted, or specified as `null`, defaults to `N - 1`,
44+
where `N` is the total number of bits per component of the data type (specified
45+
below).
46+
47+
It is invalid for `last_bit` to be less than `first_bit`.
48+
49+
Note: for complex number data types, `first_bit` and `last_bit` apply to the
50+
real and imaginary coefficients separately.
51+
52+
## Format and algorithm
53+
54+
This is an `array -> bytes` codec.
55+
56+
### Encoding/decoding of individual array elements
57+
58+
Each element of the array is encoded as a fixed number of bits, `k`, where `k`
59+
is determined from the data type, `first_bit`, and `last_bit`. Specifically
60+
61+
```
62+
b := last_bit - first_bit + 1,
63+
k := num_components * b,
64+
```
65+
66+
where `num_components` is determined by the data type (2 for complex number data
67+
types, 1 for all other data types).
68+
69+
Note: If `first_bit` and `last_bit` are both unspecified, `b == N`.
70+
71+
Logically, to encode an element of the array, each component is first encoded as
72+
an `N`-bit value (retaining all bits). From this `N`-bit value, the `b` bits
73+
from `first_bit` to `last_bit` are then extracted.
74+
75+
To decode an element of the array, the `b` encoded bits are first shifted to
76+
`first_bit`. Depending on the data type, the value is then:
77+
78+
- sign-extended (for signed integer data types), or
79+
- zero-extended (all other data types)
80+
81+
up to `N` bits.
82+
83+
### Encoding and decoding multiple
84+
85+
- Array elements are encoded in lexicographical order, to produce a bit
86+
sequence; element `i` corresponds to bits `[i * k, (i+1) * k)` within the
87+
sequence.
88+
- The bit sequence is padded with 0 bits to ensure its length is a multiple of
89+
8 bits.
90+
- Encoded byte `i` corresponds to bits `[i * 8, (i+1) * 8)` within the sequence.
91+
- If `padding_encoding` is `"first_byte"`, a single byte specifying
92+
the number of padding bits that were added is prepended to the encoded byte
93+
sequence.
94+
- If `padding_encoding` is `"last_byte"`, a single byte specifying the number of
95+
padding bits that were added is appended to the encoded byte sequence.
96+
97+
## Supported data types
98+
99+
- bool (encoded as 1 bit)
100+
- int2, uint2 (encoded as 2 bits)
101+
- int4, uint4, float4_e2m1fn (encoded as 4 bits)
102+
- float6_e2m3fn, float6_e3m2fn (encoded as 6 bits)
103+
- complex_float4_e2m1fn (encoded as 2 4-bit components)
104+
- complex_float6_e2m3fn, complex_float6_e3m2fn (encoded as 2 6-bit components)
105+
- int8, uint8, int16, uint16, int32, uint32, int64, uint64, float32, float64,
106+
bfloat16, complex_float32, complex_float64, complex_bfloat16 (encoded using
107+
their little-endian representation as with the "bytes" codec)
108+
109+
Note: For these types, if `first_bit` and `last_bit` are not used to limit the
110+
number of bits that are encoded, this codec does not provide any benefit over
111+
the `"bytes"` codec but is supported nonetheless for uniformity.
112+
113+
## Change log
114+
115+
No changes yet.
116+
117+
## Current maintainers
118+
119+
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google

codecs/packbits/schema.json

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"$schema": "https://json-schema.org/draft/2020-12/schema",
3+
"type": "object",
4+
"properties": {
5+
"name": {
6+
"const": "packbits"
7+
},
8+
"configuration": {
9+
"type": "object",
10+
"properties": {
11+
"padding_encoding": {
12+
"enum": ["start_byte", "end_byte", "none"],
13+
"default": "none"
14+
},
15+
"start_bit": {
16+
"oneOf": [
17+
{
18+
"type": "integer",
19+
"minimum": 0
20+
},
21+
{ "const": null }
22+
]
23+
},
24+
"end_bit": {
25+
"oneOf": [
26+
{
27+
"type": "integer",
28+
"minimum": 0
29+
},
30+
{ "const": null }
31+
]
32+
}
33+
},
34+
"additionalProperties": false
35+
}
36+
},
37+
"required": ["name"],
38+
"additionalProperties": false
39+
}

data-types/bfloat16/README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# bfloat16 data type
2+
3+
Defines the `bfloat16` floating-point data type
4+
(https://en.wikipedia.org/wiki/Bfloat16_floating-point_format).
5+
6+
A `bfloat16` number is a IEEE 754 binary32 floating-point number truncated at
7+
16-bits.
8+
9+
- 1 sign bit
10+
- 8 exponent bits, with bias of 127
11+
- 7 mantissa bits
12+
- IEEE 754-compliant, with NaN and +/-inf.
13+
- Subnormal numbers when biased exponent is 0.
14+
15+
## Data type name
16+
17+
The data type is specified as `"bfloat16"`.
18+
19+
## Configuration
20+
21+
None.
22+
23+
## Fill value representation
24+
25+
The fill value is specified in the same way as the core IEEE 754 floating point
26+
numbers:
27+
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value
28+
29+
The constant `"NaN"` corresponds to a representation of `"0x7fc0"`.
30+
31+
## Codec compatibility
32+
33+
### bytes
34+
35+
Encoded as a 2-byte little-endian or big-endian value `0bSEEEEEEEEMMMMMMM`.
36+
37+
## See also
38+
39+
A Python implementation is available at https://pypi.org/project/ml-dtypes/
40+
41+
## Current maintainers
42+
43+
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google

data-types/bfloat16/schema.json

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"$schema": "https://json-schema.org/draft/2020-12/schema",
3+
"oneOf": [
4+
{
5+
"type": "object",
6+
"properties": {
7+
"name": {
8+
"const": "bfloat16"
9+
},
10+
"configuration": {
11+
"type": "object",
12+
"additionalProperties": false
13+
}
14+
},
15+
"required": ["name"],
16+
"additionalProperties": false
17+
},
18+
{
19+
"const": "bfloat16"
20+
}
21+
]
22+
}
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# complex_bfloat16 data type
2+
3+
Defines a complex number data type where the real and imaginary components are
4+
represented by the `bfloat16` data type.
5+
6+
## Data type name
7+
8+
The data type is specified as `"complex_bfloat16"`.
9+
10+
## Configuration
11+
12+
None.
13+
14+
## Fill value representation
15+
16+
The fill value is specified in the same way as the core complex number data data types:
17+
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value
18+
19+
## Codec compatibility
20+
21+
### bytes
22+
23+
Encoded as 2 consecutive (real component followed by imaginary component) 2-byte
24+
little-endian or big-endian values.
25+
26+
## Current maintainers
27+
28+
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"$schema": "https://json-schema.org/draft/2020-12/schema",
3+
"oneOf": [
4+
{
5+
"type": "object",
6+
"properties": {
7+
"name": {
8+
"const": "complex_bfloat16"
9+
},
10+
"configuration": {
11+
"type": "object",
12+
"additionalProperties": false
13+
}
14+
},
15+
"required": ["name"],
16+
"additionalProperties": false
17+
},
18+
{
19+
"const": "complex_bfloat16"
20+
}
21+
]
22+
}
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# complex_float16 data type
2+
3+
Defines a complex number data type where the real and imaginary components are
4+
represented by the `float16` data type.
5+
6+
## Data type name
7+
8+
The data type is specified as `"complex_float16"`.
9+
10+
## Configuration
11+
12+
None.
13+
14+
## Fill value representation
15+
16+
The fill value is specified in the same way as the core complex number data data types:
17+
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value
18+
19+
## Codec compatibility
20+
21+
### bytes
22+
23+
Encoded as 2 consecutive (real component followed by imaginary component) 2-byte
24+
little-endian or big-endian values.
25+
26+
## Current maintainers
27+
28+
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"$schema": "https://json-schema.org/draft/2020-12/schema",
3+
"oneOf": [
4+
{
5+
"type": "object",
6+
"properties": {
7+
"name": {
8+
"const": "complex_float16"
9+
},
10+
"configuration": {
11+
"type": "object",
12+
"additionalProperties": false
13+
}
14+
},
15+
"required": ["name"],
16+
"additionalProperties": false
17+
},
18+
{
19+
"const": "complex_float16"
20+
}
21+
]
22+
}
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# complex_float32 data type
2+
3+
Defines an alias of the core `complex64` data type, for consistency with the new
4+
complex number data types for which `complexNNN` naming is not sufficient.
5+
6+
## Data type name
7+
8+
The data type is specified as `"complex_float32"`.
9+
10+
## Configuration
11+
12+
None.
13+
14+
## Current maintainers
15+
16+
* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"$schema": "https://json-schema.org/draft/2020-12/schema",
3+
"oneOf": [
4+
{
5+
"type": "object",
6+
"properties": {
7+
"name": {
8+
"const": "complex_float32"
9+
},
10+
"configuration": {
11+
"type": "object",
12+
"additionalProperties": false
13+
}
14+
},
15+
"required": ["name"],
16+
"additionalProperties": false
17+
},
18+
{
19+
"const": "complex_float32"
20+
}
21+
]
22+
}

0 commit comments

Comments
 (0)