|
| 1 | +# packbits codec |
| 2 | + |
| 3 | +Defines an `array -> bytes` codec that packs together values that are |
| 4 | +represented by a fixed number of bits that is not necessarily a multiple of 8. |
| 5 | + |
| 6 | +## Codec name |
| 7 | + |
| 8 | +The value of the `name` member in the codec object MUST be `packbits`. |
| 9 | + |
| 10 | +## Configuration parameters |
| 11 | + |
| 12 | +### `padding_encoding` (Optional) |
| 13 | + |
| 14 | +Specifies how the number of padding bits is encoded, such that the number of |
| 15 | +decoded elements may be determined from the encoded representation alone. |
| 16 | + |
| 17 | +Must be one of: |
| 18 | +- `"first_byte"`, indicating that the first byte specifies the number of padding |
| 19 | + bits that were added; |
| 20 | +- `"last_byte"`, indicating that the final byte specifies the number of padding |
| 21 | + bits that were added; |
| 22 | +- `"none"` (default), indicating that the number of padding bits is not encoded. |
| 23 | + In this case, the number of decoded elements cannot be determined from the |
| 24 | + encoded representation alone. |
| 25 | + |
| 26 | +While zarr itself does not need to be able to recover the number of decoded |
| 27 | +elements from the encoded representation alone, because this information can be |
| 28 | +propagated from the metadata through any prior codecs in the chain, it may still |
| 29 | +be useful as an additional sanity check or for non-zarr uses of the codec. |
| 30 | + |
| 31 | +A value of `"first_byte"` provides compatibility with the [numcodecs packbits |
| 32 | +codec](https://github.com/zarr-developers/numcodecs/blob/3c933cf19d4d84f2efc5f3a36926d8c569514a90/numcodecs/packbits.py#L7) |
| 33 | +defined for zarr v2 (which only supports `bool`). |
| 34 | + |
| 35 | +### `first_bit` (Optional) |
| 36 | + |
| 37 | +Specifies the index (starting from the least-significant bit) of the first bit |
| 38 | +to be encoded. If omitted, or specified as `null`, defaults to `0`. |
| 39 | + |
| 40 | +### `last_bit` (Optional) |
| 41 | + |
| 42 | +Specifies the index (starting from the least-significant bit) of the (inclusive) |
| 43 | +last bit to be encoded. If omitted, or specified as `null`, defaults to `N - 1`, |
| 44 | +where `N` is the total number of bits per component of the data type (specified |
| 45 | +below). |
| 46 | + |
| 47 | +It is invalid for `last_bit` to be less than `first_bit`. |
| 48 | + |
| 49 | +Note: for complex number data types, `first_bit` and `last_bit` apply to the |
| 50 | +real and imaginary coefficients separately. |
| 51 | + |
| 52 | +## Format and algorithm |
| 53 | + |
| 54 | +This is an `array -> bytes` codec. |
| 55 | + |
| 56 | +### Encoding/decoding of individual array elements |
| 57 | + |
| 58 | +Each element of the array is encoded as a fixed number of bits, `k`, where `k` |
| 59 | +is determined from the data type, `first_bit`, and `last_bit`. Specifically |
| 60 | + |
| 61 | +``` |
| 62 | +b := last_bit - first_bit + 1, |
| 63 | +k := num_components * b, |
| 64 | +``` |
| 65 | + |
| 66 | +where `num_components` is determined by the data type (2 for complex number data |
| 67 | +types, 1 for all other data types). |
| 68 | + |
| 69 | +Note: If `first_bit` and `last_bit` are both unspecified, `b == N`. |
| 70 | + |
| 71 | +Logically, to encode an element of the array, each component is first encoded as |
| 72 | +an `N`-bit value (retaining all bits). From this `N`-bit value, the `b` bits |
| 73 | +from `first_bit` to `last_bit` are then extracted. |
| 74 | + |
| 75 | +To decode an element of the array, the `b` encoded bits are first shifted to |
| 76 | +`first_bit`. Depending on the data type, the value is then: |
| 77 | + |
| 78 | +- sign-extended (for signed integer data types), or |
| 79 | +- zero-extended (all other data types) |
| 80 | + |
| 81 | +up to `N` bits. |
| 82 | + |
| 83 | +### Encoding and decoding multiple |
| 84 | + |
| 85 | +- Array elements are encoded in lexicographical order, to produce a bit |
| 86 | + sequence; element `i` corresponds to bits `[i * k, (i+1) * k)` within the |
| 87 | + sequence. |
| 88 | +- The bit sequence is padded with 0 bits to ensure its length is a multiple of |
| 89 | + 8 bits. |
| 90 | +- Encoded byte `i` corresponds to bits `[i * 8, (i+1) * 8)` within the sequence. |
| 91 | +- If `padding_encoding` is `"first_byte"`, a single byte specifying |
| 92 | + the number of padding bits that were added is prepended to the encoded byte |
| 93 | + sequence. |
| 94 | +- If `padding_encoding` is `"last_byte"`, a single byte specifying the number of |
| 95 | + padding bits that were added is appended to the encoded byte sequence. |
| 96 | + |
| 97 | +## Supported data types |
| 98 | + |
| 99 | +- bool (encoded as 1 bit) |
| 100 | +- int2, uint2 (encoded as 2 bits) |
| 101 | +- int4, uint4, float4_e2m1fn (encoded as 4 bits) |
| 102 | +- float6_e2m3fn, float6_e3m2fn (encoded as 6 bits) |
| 103 | +- complex_float4_e2m1fn (encoded as 2 4-bit components) |
| 104 | +- complex_float6_e2m3fn, complex_float6_e3m2fn (encoded as 2 6-bit components) |
| 105 | +- int8, uint8, int16, uint16, int32, uint32, int64, uint64, float32, float64, |
| 106 | + bfloat16, complex_float32, complex_float64, complex_bfloat16 (encoded using |
| 107 | + their little-endian representation as with the "bytes" codec) |
| 108 | + |
| 109 | + Note: For these types, if `first_bit` and `last_bit` are not used to limit the |
| 110 | + number of bits that are encoded, this codec does not provide any benefit over |
| 111 | + the `"bytes"` codec but is supported nonetheless for uniformity. |
| 112 | + |
| 113 | +## Change log |
| 114 | + |
| 115 | +No changes yet. |
| 116 | + |
| 117 | +## Current maintainers |
| 118 | + |
| 119 | +* Jeremy Maitin-Shepard ([@jbms](https://github.com/jbms)), Google |
0 commit comments