Status: Proposed Date: 2026-02-08 Parent: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine Author: System Architecture Team
Note: Tiered quantization formats are now implemented in the rvf-quant crate as part of ADR-029 (RVF). See the RVF temperature-tiering specification (docs/research/rvf/spec/03-temperature-tiering.md).
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | 2026-02-08 | Architecture Team | Initial proposal |
This ADR defines the concrete quantization formats, bit-packing layouts, and codec interfaces for the five tiers of tensor storage established in ADR-017. Where ADR-017 introduced the concept of access-frequency-driven quantization and temporal scale reuse, this document specifies the exact byte-level formats for 8-bit (Tier 1 / Hot), 7-bit and 5-bit (Tier 2 / Warm), 3-bit (Tier 3 / Cold), and Compression-to-Zero (Tier 0 / Absent). It also resolves two open design questions from ADR-017: whether 5-bit quantization is permitted within the warm tier, and how Tier 0 reads behave when no reconstruction policy exists.
The codec_bits module provides a single allocation-free bit packer/unpacker that
all sub-byte formats share. The quant module provides per-format quantize and
dequantize functions, with SIMD-accelerated max_abs on native targets and a
portable fallback for WASM. Rust trait interfaces are defined so that new bit widths
can be added without modifying the core codec.
ADR-017 established the tiered compression architecture and segment binary format but left the per-tier quantization details at the algorithmic level. Implementers need exact byte layouts to write interoperable encoders and decoders, particularly for the sub-byte formats (7-bit, 5-bit, 3-bit) where values do not align on byte boundaries.
Standard 8-bit quantization maps trivially to [u8] storage. Sub-byte formats
require a bit-packing codec that can write and read arbitrary-width codes into a
byte stream without wasting bits. The codec must:
- Handle bit widths 3, 5, and 7 (with 8 as a degenerate identity case).
- Operate without heap allocations (caller provides output slice).
- Be deterministic and platform-independent (little-endian byte order).
- Support WASM targets where SIMD is optional.
At 3 bits per value, the quantization range is [-3, +3] (qmax = 3). Large
outliers in the tensor distribution can cause severe clamping. ADR-017 noted this
risk but did not specify a mitigation. This ADR introduces a two-level scale
option for Tier 3 that uses a 1-bit flag per value to select between a primary
scale (covering the majority of values) and a secondary scale (covering outliers),
while keeping the packed format compact.
ADR-017 listed Compression-to-Zero as a future possibility. This ADR formalizes
it: Tier 0 stores no quantized data at all. Only metadata and an optional
reconstruct_policy survive. This enables aggressive memory reclamation for
tensors that are no longer accessed but may be reconstructable from other sources
(deltas, factors, or recomputation).
| Question | Resolution |
|---|---|
| Allow 5-bit within warm tier? | Yes. Dynamic downgrade from 7-bit to 5-bit when warm set exceeds a configurable byte cap (warm_byte_cap). |
| Tier 0 read semantics? | Return zeros by default. If a reconstruct_policy (Delta or Factor) exists, reconstruct from stored representation. |
We adopt the following five-tier quantization format hierarchy, each with a well-defined byte layout, packing strategy, and error budget:
| Tier | Name | Bits | Compression vs f32 | Use Case |
|---|---|---|---|---|
| 1 | Hot | 8 | 4.00x | Active tensors, full fidelity |
| 2a | Warm | 7 | 4.57x | Default warm, near-lossless |
| 2b | Warm-aggressive | 5 | 6.40x | Warm set exceeds warm_byte_cap |
| 3 | Cold | 3 | 10.67x | Archived tensors, bounded error |
| 0 | Absent | 0 | Infinite | No data stored; metadata only |
All sub-byte formats share the codec_bits packer. All quantization formats use
symmetric per-block quantization with scale = max_abs / qmax stored as f32 per
block. The choice of f32 (rather than f16 as in ADR-017 segment headers) is
deliberate at this layer: the segment encoder may convert to f16 for storage, but
the quantizer operates in f32 for precision during the quantize/dequantize path.
Algorithm: Symmetric per-block quantization.
Given: block of N f32 values, block_size typically 64 or 128
scale = max_abs(values) / 127
q[i] = round(values[i] / scale)
q[i] = clamp(q[i], -127, +127) // i8 range
store: q as [i8; N] + scale as f32
Storage layout (one block, block_size = 8 for illustration):
Byte offset: 0 1 2 3 4 5 6 7 8 9 10 11
[ scale (f32, LE) ] [q0] [q1] [q2] [q3] [q4] [q5] [q6] [q7]
~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4 bytes 8 bytes (1 byte per i8 value)
Total per block: 4 + block_size bytes
Effective compression (block_size = 64):
raw = 64 * 4 = 256 bytes
quant = 4 + 64 * 1 = 68 bytes
ratio = 256 / 68 = 3.76x (single block)
With temporal amortization (100 frames sharing scales): 256*100 / (4 + 64*100) = 4.00x.
Dequantize:
values[i] = q[i] as f32 * scale
Error bound: max_error = scale / (2 * 127). See Section 3.7 for full analysis.
Algorithm: Symmetric per-block, 7-bit codes packed into a bitstream.
Given: block of N f32 values
scale = max_abs(values) / 63 // qmax = 2^(7-1) - 1 = 63
q[i] = round(values[i] / scale)
q[i] = clamp(q[i], -63, +63)
u[i] = q[i] + 63 // bias to unsigned [0, 126], fits 7 bits
pack u[i] values using codec_bits at width=7
Bit-packing layout (8 values packed into 7 bytes):
Values: u0 u1 u2 u3 u4 u5 u6 u7
Bits: [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0]
7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits
Packed into 7 bytes (56 bits = 8 * 7 bits):
Byte 0: [u0[6:0] | u1[0] ] = u0(7) + u1(1) = 8 bits
|<--- 7 bits --->|<1>|
Byte 1: [u1[6:1] | u2[1:0]] = u1(6) + u2(2) = 8 bits
|<--- 6 bits --->|<-2->|
Byte 2: [u2[6:2] | u3[2:0] ] = u2(5) + u3(3) = 8 bits
|<-- 5 bits -->|<--3-->|
Byte 3: [u3[6:3] | u4[3:0] ] = u3(4) + u4(4) = 8 bits
|<- 4 bits ->|<--4--->|
Byte 4: [u4[6:4] | u5[4:0] ] = u4(3) + u5(5) = 8 bits
|<-3->|<---- 5 bits ---->|
Byte 5: [u5[6:5] | u6[5:0] ] = u5(2) + u6(6) = 8 bits
|<2>|<----- 6 bits ------>|
Byte 6: [u6[6] | u7[6:0] ] = u6(1) + u7(7) = 8 bits
|1|<------- 7 bits ------->|
Total: 7 bytes for 8 values = 0.875 bytes/value
Storage per block (block_size = 64):
scale: 4 bytes (f32)
data: ceil(64 * 7 / 8) = 56 bytes
total: 60 bytes
ratio: 256 / 60 = 4.27x
Algorithm: Symmetric per-block, 5-bit codes.
Given: block of N f32 values
scale = max_abs(values) / 15 // qmax = 2^(5-1) - 1 = 15
q[i] = round(values[i] / scale)
q[i] = clamp(q[i], -15, +15)
u[i] = q[i] + 15 // bias to unsigned [0, 30], fits 5 bits
pack u[i] values using codec_bits at width=5
Activation policy: 5-bit is used instead of 7-bit when the total warm set
size exceeds warm_byte_cap (default: 64 MiB). The tier policy monitors
aggregate warm storage and downgrades from 7-bit to 5-bit for the least recently
accessed warm tensors until the cap is satisfied.
Bit-packing layout (8 values packed into 5 bytes):
Values: u0 u1 u2 u3 u4 u5 u6 u7
Bits: [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0]
5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits
Packed into 5 bytes (40 bits = 8 * 5 bits):
Byte 0: [u0[4:0] | u1[2:0] ] = u0(5) + u1(3) = 8 bits
|<-- 5 bits -->|<--3-->|
Byte 1: [u1[4:3] | u2[4:0] | u3[0]] = u1(2) + u2(5) + u3(1) = 8 bits
|<2>|<-- 5 bits -->|<1>|
Byte 2: [u3[4:1] | u4[3:0] ] = u3(4) + u4(4) = 8 bits
|<-- 4 bits -->|<--4-->|
Byte 3: [u4[4] | u5[4:0] | u6[1:0]] = u4(1) + u5(5) + u6(2) = 8 bits
|1|<-- 5 bits -->|<-2->|
Byte 4: [u6[4:2] | u7[4:0] ] = u6(3) + u7(5) = 8 bits
|<-3->|<--- 5 bits --->|
Total: 5 bytes for 8 values = 0.625 bytes/value
Storage per block (block_size = 64):
scale: 4 bytes (f32)
data: ceil(64 * 5 / 8) = 40 bytes
total: 44 bytes
ratio: 256 / 44 = 5.82x
Algorithm: Symmetric per-block, 3-bit codes with optional two-level scale.
Given: block of N f32 values
scale = max_abs(values) / 3 // qmax = 2^(3-1) - 1 = 3
q[i] = round(values[i] / scale)
q[i] = clamp(q[i], -3, +3)
u[i] = q[i] + 3 // bias to unsigned [0, 6], fits 3 bits
pack u[i] values using codec_bits at width=3
When the value distribution has outliers (values significantly larger than the bulk of the distribution), a single scale wastes most of the 3-bit range on the long tail. The two-level scale splits the range:
Given: block of N f32 values, outlier_fraction (default: 0.05)
sorted_abs = sort(|values|, descending)
outlier_count = ceil(N * outlier_fraction)
primary_max = sorted_abs[outlier_count] // excludes top 5%
secondary_max = sorted_abs[0] // full range
primary_scale = primary_max / 3 // covers bulk values
secondary_scale = secondary_max / 3 // covers outliers
For each value[i]:
if |value[i]| > primary_max:
flag[i] = 1 // use secondary scale
q[i] = round(value[i] / secondary_scale)
else:
flag[i] = 0 // use primary scale
q[i] = round(value[i] / primary_scale)
q[i] = clamp(q[i], -3, +3)
u[i] = q[i] + 3
store: primary_scale (f32) + secondary_scale (f32) + flag bits + packed codes
Bit-packing layout (8 values packed into 3 bytes):
Values: u0 u1 u2 u3 u4 u5 u6 u7
Bits: [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0]
3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits
Packed into 3 bytes (24 bits = 8 * 3 bits):
Byte 0: [u0[2:0] | u1[2:0] | u2[1:0] ] = u0(3) + u1(3) + u2(2) = 8 bits
|<-3->|<-3->|<2>|
Byte 1: [u2[2] | u3[2:0] | u4[2:0] | u5[0]] = u2(1) + u3(3) + u4(3) + u5(1) = 8 bits
|1|<-3->|<-3->|1|
Byte 2: [u5[2:1] | u6[2:0] | u7[2:0] ] = u5(2) + u6(3) + u7(3) = 8 bits
|<2>|<-3->|<-3->|
Total: 3 bytes for 8 values = 0.375 bytes/value
Two-level scale storage layout (one block, block_size = 64):
Byte offset: 0 3 7 8 9 ... 15 16 ...
[primary_scale f32] [secondary_scale f32] [flag bytes ] [packed codes]
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~
4 bytes 4 bytes ceil(64/8)=8 ceil(64*3/8)=24
Total per block (two-level): 4 + 4 + 8 + 24 = 40 bytes
Total per block (standard): 4 + 24 = 28 bytes
ratio (standard): 256 / 28 = 9.14x
ratio (two-level): 256 / 40 = 6.40x
The two-level mode trades compression ratio for outlier fidelity. It is selected
automatically when the ratio max_abs / median_abs exceeds a configurable
threshold (default: 5.0), indicating a heavy-tailed distribution.
Algorithm: No quantized data is stored.
Tier 0 representation:
metadata: TensorMeta (id, shape, dtype, timestamps)
reconstruct_policy: Option<ReconstructPolicy>
quantized_data: None
enum ReconstructPolicy {
None, // reads return zeros
Delta { base_id: TensorId, delta: ... }, // reconstruct as base + delta
Factor { source_id: TensorId, ... }, // reconstruct via transformation
}
Read semantics:
reconstruct_policy |
Behavior |
|---|---|
None |
Return a zero-filled tensor of the recorded shape. Fast-fail mode returns Err(TierZeroNoPolicy) instead. |
Delta |
Load base tensor, apply stored delta. May trigger recursive decompression if base is also tiered. |
Factor |
Load source tensor, apply stored transformation (scale, permutation, projection). |
Transition to Tier 0: A tensor is eligible for Tier 0 when its tier score
drops below absent_min_score (default: 1) and it has not been accessed for
longer than absent_age_threshold (default: 24 hours). The transition is
irreversible without external data: once quantized data is discarded, only the
reconstruction policy (if any) can recover approximate values.
The core packing and unpacking functions shared by all sub-byte formats.
/// Errors from bit codec operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum CodecErr {
/// Output buffer too small. Contains the required size in bytes.
OutputTooSmall { required: usize },
/// Input buffer too small for the declared number of values.
InputTooSmall { required: usize },
/// Bit width must be in [1, 8].
InvalidBitWidth { bits: u8 },
}
/// Pack `values.len()` signed codes into `out`, using `bits` bits per code.
///
/// Each value in `values` is treated as a signed integer in `[-(2^(bits-1)-1), 2^(bits-1)-1]`.
/// It is biased to unsigned before packing: `u = v + (2^(bits-1) - 1)`.
///
/// Returns the number of bytes written to `out`.
///
/// # Errors
/// - `CodecErr::OutputTooSmall` if `out` cannot hold the packed data.
/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
pub fn pack_bits(values: &[i8], bits: u8, out: &mut [u8]) -> Result<usize, CodecErr> {
if bits == 0 || bits > 8 {
return Err(CodecErr::InvalidBitWidth { bits });
}
let total_bits = values.len() as u64 * bits as u64;
let required = ((total_bits + 7) / 8) as usize;
if out.len() < required {
return Err(CodecErr::OutputTooSmall { required });
}
let qmax = (1i8 << (bits - 1)) - 1; // bias offset
let mask: u64 = (1u64 << bits) - 1;
let mut acc: u64 = 0;
let mut acc_bits: u32 = 0;
let mut pos: usize = 0;
for &v in values {
let u = (v as i16 + qmax as i16) as u64 & mask;
acc |= u << acc_bits;
acc_bits += bits as u32;
while acc_bits >= 8 {
out[pos] = (acc & 0xFF) as u8;
pos += 1;
acc >>= 8;
acc_bits -= 8;
}
}
// Flush remaining bits
if acc_bits > 0 {
out[pos] = (acc & 0xFF) as u8;
pos += 1;
}
Ok(pos)
}
/// Unpack codes from `inp` into `out`, reading `bits` bits per code.
///
/// Reads exactly `out.len()` values. Each unsigned code is unbiased back to signed:
/// `v = u - (2^(bits-1) - 1)`.
///
/// Returns the number of bytes consumed from `inp`.
///
/// # Errors
/// - `CodecErr::InputTooSmall` if `inp` does not contain enough data.
/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
pub fn unpack_bits(inp: &[u8], bits: u8, out: &mut [i8]) -> Result<usize, CodecErr> {
if bits == 0 || bits > 8 {
return Err(CodecErr::InvalidBitWidth { bits });
}
let total_bits = out.len() as u64 * bits as u64;
let required = ((total_bits + 7) / 8) as usize;
if inp.len() < required {
return Err(CodecErr::InputTooSmall { required });
}
let qmax = (1i8 << (bits - 1)) - 1;
let mask: u64 = (1u64 << bits) - 1;
let mut acc: u64 = 0;
let mut acc_bits: u32 = 0;
let mut byte_pos: usize = 0;
let mut val_pos: usize = 0;
while val_pos < out.len() {
while acc_bits < bits as u32 {
acc |= (inp[byte_pos] as u64) << acc_bits;
acc_bits += 8;
byte_pos += 1;
}
let u = (acc & mask) as i16;
out[val_pos] = (u - qmax as i16) as i8;
acc >>= bits;
acc_bits -= bits as u32;
val_pos += 1;
}
Ok(required)
}Properties:
- No heap allocations. Callers provide both input and output slices.
- Single bit writer / bit reader using a 64-bit accumulator.
- Deterministic little-endian byte order.
- The
pack_bits/unpack_bitspair is its own inverse:unpack(pack(v)) == vfor all valid inputs.
/// Block-level quantization configuration.
pub struct QuantConfig {
pub block_size: usize, // elements per quantization block (default: 64)
pub two_level_threshold: f32, // max/median ratio to trigger two-level (default: 5.0)
}
/// Quantized block result.
pub struct QuantizedBlock {
pub scale: f32,
pub secondary_scale: Option<f32>, // only for two-level 3-bit
pub flags: Option<Vec<u8>>, // 1-bit-per-value flags for two-level
pub codes: Vec<i8>, // signed quantized codes
pub bits: u8,
}
/// Symmetric 8-bit quantization (Tier 1 - Hot).
///
/// Quantizes each block of `block_size` values independently.
/// scale = max_abs(block) / 127
/// q[i] = clamp(round(x[i] / scale), -127, 127)
pub fn quantize_s8(
values: &[f32],
config: &QuantConfig,
) -> Vec<QuantizedBlock>;
/// Symmetric N-bit quantization (Tier 2/3 - Warm/Cold).
///
/// `bits` must be one of: 7, 5, 3.
/// qmax = 2^(bits-1) - 1
/// scale = max_abs(block) / qmax
/// q[i] = clamp(round(x[i] / scale), -qmax, qmax)
///
/// For bits=3 and config.two_level_threshold exceeded: uses two-level scale.
pub fn quantize_bits(
values: &[f32],
bits: u8,
config: &QuantConfig,
) -> Vec<QuantizedBlock>;
/// Dequantize a block back to f32 values.
///
/// For standard mode: x'[i] = codes[i] as f32 * scale
/// For two-level mode: x'[i] = codes[i] as f32 * (if flags[i] then secondary_scale else scale)
pub fn dequantize(block: &QuantizedBlock) -> Vec<f32>;
/// Compute the maximum absolute value across a slice.
///
/// On native targets with `target_feature = "avx2"` or `target_feature = "neon"`:
/// uses SIMD intrinsics for 4-8x throughput.
/// On WASM with `target_feature = "simd128"` (optional):
/// uses wasm_simd128 intrinsics.
/// Fallback: portable scalar loop.
#[inline]
pub fn max_abs(values: &[f32]) -> f32;SIMD implementation sketch for max_abs (AVX2):
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn max_abs_avx2(values: &[f32]) -> f32 {
use std::arch::x86_64::*;
let sign_mask = _mm256_set1_ps(f32::from_bits(0x7FFF_FFFF)); // abs mask
let mut vmax = _mm256_setzero_ps();
let chunks = values.len() / 8;
for i in 0..chunks {
let v = _mm256_loadu_ps(values.as_ptr().add(i * 8));
let abs_v = _mm256_and_ps(v, sign_mask);
vmax = _mm256_max_ps(vmax, abs_v);
}
// Horizontal max reduction
let hi128 = _mm256_extractf128_ps(vmax, 1);
let lo128 = _mm256_castps256_ps128(vmax);
let max128 = _mm_max_ps(hi128, lo128);
let shuf = _mm_movehdup_ps(max128);
let max64 = _mm_max_ps(max128, shuf);
let shuf2 = _mm_movehl_ps(max64, max64);
let max32 = _mm_max_ss(max64, shuf2);
let mut result = _mm_cvtss_f32(max32);
// Handle remainder
for i in (chunks * 8)..values.len() {
result = result.max(values[i].abs());
}
result
}WASM portable fallback:
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
pub fn max_abs(values: &[f32]) -> f32 {
let mut m: f32 = 0.0;
for &v in values {
let a = v.abs();
if a > m {
m = a;
}
}
m
}When WASM SIMD is enabled via target_feature = "simd128", a vectorized path
processes 4 f32 values per iteration using v128 types. This is optional and
gated behind a cargo feature flag wasm-simd.
For symmetric quantization with bit width B, block scale s, and qmax = 2^(B-1) - 1:
quantization_step = s / qmax
max_element_error = quantization_step / 2 (from rounding)
max_relative_error = 1 / (2 * qmax) (per element, worst case)
rms_error = quantization_step / sqrt(12) (uniform quantization noise)
Per-tier error bounds:
| Tier | Bits | qmax | Max Rel. Error | RMS Rel. Error | Max Abs. Error (scale=1.0) |
|---|---|---|---|---|---|
| Hot (8-bit) | 8 | 127 | 0.394% | 0.228% | 0.00394 |
| Warm (7-bit) | 7 | 63 | 0.794% | 0.458% | 0.00794 |
| Warm-agg (5-bit) | 5 | 15 | 3.333% | 1.925% | 0.03333 |
| Cold (3-bit, std) | 3 | 3 | 16.667% | 9.623% | 0.16667 |
| Cold (3-bit, 2-level) | 3 | 3 | 16.667% per scale | 9.623% | Reduced for bulk values |
Two-level scale improvement for 3-bit: When 95% of values fall within
primary_max and outliers use secondary_scale:
| Component | Fraction | Scale | Effective Max Error |
|---|---|---|---|
| Bulk values (95%) | 0.95 | primary_scale (smaller) | 16.7% of primary_max |
| Outlier values (5%) | 0.05 | secondary_scale (larger) | 16.7% of secondary_max |
The bulk values achieve much lower absolute error because primary_scale is
typically 3-10x smaller than the single-scale scale. The outliers retain the
same relative error but are fewer in number.
Drift compounding: When drift tolerance is d (e.g., 10%), and a frame is
quantized with scales from an earlier frame, the effective max relative error
becomes (1 + d) / (2 * qmax). For 8-bit with 10% drift: 1.1 / 254 = 0.433%.
Cumulative error table with drift:
| Tier | Bits | No Drift | 10% Drift | 20% Drift |
|---|---|---|---|---|
| Hot | 8 | 0.394% | 0.433% | 0.472% |
| Warm | 7 | 0.794% | 0.873% | 0.952% |
| Warm-agg | 5 | 3.333% | 3.667% | 4.000% |
| Cold | 3 | 16.667% | 18.333% | 20.000% |
/// Trait for quantization formats that can encode and decode tensor blocks.
pub trait TensorQuantizer {
/// The bit width of this quantizer.
fn bit_width(&self) -> u8;
/// Quantize a block of f32 values into signed codes and scale(s).
fn quantize_block(
&self,
values: &[f32],
config: &QuantConfig,
) -> QuantizedBlock;
/// Dequantize a block back to f32 values.
fn dequantize_block(
&self,
block: &QuantizedBlock,
out: &mut [f32],
) -> Result<(), CodecErr>;
/// Returns the packed byte size for `num_values` at this bit width,
/// excluding scale storage.
fn packed_data_size(&self, num_values: usize) -> usize {
(num_values * self.bit_width() as usize + 7) / 8
}
/// Returns total block storage size including scale(s) and flags.
fn block_storage_size(&self, block_size: usize) -> usize;
}
/// Trait for bit-level packing codecs.
pub trait BitCodec {
/// Pack signed codes into a byte buffer.
fn pack(
&self,
codes: &[i8],
bits: u8,
out: &mut [u8],
) -> Result<usize, CodecErr>;
/// Unpack codes from a byte buffer.
fn unpack(
&self,
data: &[u8],
bits: u8,
out: &mut [i8],
) -> Result<usize, CodecErr>;
}
/// Standard implementation using the accumulator-based codec_bits functions.
pub struct StandardBitCodec;
impl BitCodec for StandardBitCodec {
fn pack(
&self,
codes: &[i8],
bits: u8,
out: &mut [u8],
) -> Result<usize, CodecErr> {
pack_bits(codes, bits, out)
}
fn unpack(
&self,
data: &[u8],
bits: u8,
out: &mut [i8],
) -> Result<usize, CodecErr> {
unpack_bits(data, bits, out)
}
}TIER 1 (8-bit):
+--------+-------+-------+-------+-----+-------+
| scale | q[0] | q[1] | q[2] | ... | q[63] |
| f32 LE | i8 | i8 | i8 | | i8 |
+--------+-------+-------+-------+-----+-------+
4 bytes 1 1 1 1 = 68 bytes / block
TIER 2a (7-bit):
+--------+--------------------------------------------+
| scale | packed 7-bit codes (56 bytes for 64 vals) |
| f32 LE | bitstream, little-endian accumulator |
+--------+--------------------------------------------+
4 bytes ceil(64*7/8) = 56 bytes = 60 bytes / block
TIER 2b (5-bit):
+--------+--------------------------------------------+
| scale | packed 5-bit codes (40 bytes for 64 vals) |
| f32 LE | bitstream, little-endian accumulator |
+--------+--------------------------------------------+
4 bytes ceil(64*5/8) = 40 bytes = 44 bytes / block
TIER 3 standard (3-bit):
+--------+--------------------------------------------+
| scale | packed 3-bit codes (24 bytes for 64 vals) |
| f32 LE | bitstream, little-endian accumulator |
+--------+--------------------------------------------+
4 bytes ceil(64*3/8) = 24 bytes = 28 bytes / block
TIER 3 two-level (3-bit):
+--------+--------+----------+-------------------------------+
| pscale | sscale | flags | packed 3-bit codes |
| f32 LE | f32 LE | ceil(N/8)| bitstream |
+--------+--------+----------+-------------------------------+
4 4 8 bytes 24 bytes = 40 bytes / block
TIER 0 (absent):
+--------------------------------------+
| TensorMeta + ReconstructPolicy only |
| NO quantized data |
+--------------------------------------+
variable (typically 32-128 bytes metadata)
4-bit quantization (qmax = 7, 8.00x compression) is the most widely studied format (GPTQ, AWQ). We considered using 4-bit instead of 7-bit for the warm tier. Rejected because: (a) the jump from 8-bit to 4-bit is too large for tensors that were recently hot, causing unnecessary quality loss; (b) 7-bit provides a gentler step-down; (c) 5-bit is available as an intermediate when memory pressure increases.
A simpler design with only two quantization levels (8-bit hot, 4-bit everything else). Rejected because: (a) cold tensors waste 1 extra bit per value when 3-bit suffices; (b) no path to aggressive compression under memory pressure; (c) loses the granularity that enables smooth quality degradation.
Using asymmetric quantization (with zero-point) for 3-bit to better utilize the
[0, 7] unsigned range when distributions are not centered. Rejected
because: (a) adds 4 bytes of zero-point storage per block; (b) requires an
additional subtraction in the dequantize path; (c) the two-level scale approach
handles asymmetric distributions more effectively by splitting the scale rather
than shifting the range.
Using a small codebook (e.g., 8 centroids) instead of uniform 3-bit levels. Rejected because: (a) requires a per-block or per-tensor codebook training step that is expensive for streaming data; (b) codebook storage overhead is comparable to scale storage but with higher decode complexity; (c) uniform quantization is simpler to implement and reason about.
Omitting the two-level scale option entirely. Considered but rejected because agent embedding tensors frequently exhibit heavy-tailed distributions where a few dimensions carry disproportionate magnitude. Without two-level scale, these outliers cause the single scale to be too large, wasting most of the 3-bit range on the bulk of near-zero values.
-
pack_bitsfollowed byunpack_bitsis a lossless round-trip for all bit widths (3, 5, 7, 8) and all valid signed input ranges. -
quantize_s8followed bydequantizeproduces values within the theoretical error bound (scale / 254) of the originals. -
quantize_bits(7, ...)followed bydequantizeproduces values withinscale / 126of the originals. -
quantize_bits(5, ...)followed bydequantizeproduces values withinscale / 30of the originals. -
quantize_bits(3, ...)followed bydequantizeproduces values withinscale / 6of the originals (standard mode). - Two-level 3-bit mode activates when
max/median > two_level_threshold. - Tier 0 reads return zeros when
reconstruct_policyisNone. - Tier 0 reads invoke reconstruction when a policy exists.
-
pack_bitsthroughput >= 2 GB/s on native (AVX2-capable hardware). -
unpack_bitsthroughput >= 2 GB/s on native. -
max_abswith SIMD is >= 3x faster than the scalar fallback on 512+ element blocks. - WASM
pack_bits/unpack_bitsthroughput >= 500 MB/s (without SIMD). - No heap allocations in
pack_bits,unpack_bits, ormax_abs.
- 8-bit block storage: exactly
4 + block_sizebytes. - 7-bit block storage: exactly
4 + ceil(block_size * 7 / 8)bytes. - 5-bit block storage: exactly
4 + ceil(block_size * 5 / 8)bytes. - 3-bit block storage (standard): exactly
4 + ceil(block_size * 3 / 8)bytes. - 3-bit block storage (two-level): exactly
8 + ceil(block_size / 8) + ceil(block_size * 3 / 8)bytes. - No padding bits between consecutive blocks in a segment.
- When aggregate warm storage exceeds
warm_byte_cap, the least recently accessed warm tensors are re-encoded from 7-bit to 5-bit. - The downgrade is reversible: if warm storage drops below
warm_byte_cap * 0.8(hysteresis), tensors can be re-promoted to 7-bit on next access.
| Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|
| 3-bit two-level scale adds format complexity without sufficient accuracy gain for most distributions | Medium | Medium | Gate behind a cargo feature two-level-cold; default to standard 3-bit. Benchmark on real agent embeddings before enabling by default. |
| Dynamic 7-bit to 5-bit downgrade causes thrashing when warm set oscillates near the byte cap | Medium | Medium | Implement hysteresis (20% band). Only downgrade when above cap; only upgrade when below 80% of cap. Rate-limit downgrades to at most once per minute. |
pack_bits accumulator overflow for large inputs |
Low | Low | The 64-bit accumulator can hold up to 56 bits of pending data (7 bytes). Since we flush at 8 bits, the maximum pending bits is bits - 1 = 7, well within the 64-bit range. No overflow possible. |
| Tier 0 reconstruction from Delta/Factor introduces unbounded latency | Medium | Low | Set a maximum reconstruction depth (default: 3). If the base tensor is also Tier 0, fail with ReconstructionDepthExceeded rather than recursing indefinitely. |
WASM scalar max_abs is a bottleneck for large tensors |
Low | High | Expected. The WASM SIMD feature flag provides 3-4x improvement. For non-SIMD targets, max_abs cost is small relative to the full quantize pipeline. |
| Block size mismatch between encoder and decoder | High | Low | Block size is stored in the segment header (ADR-017 format). Decoder reads it from the header rather than assuming a default. |
- ADR-017: Temporal Tensor Compression with Tiered Quantization. RuVector Architecture Team, 2026.
- ADR-018: Block-Based Storage Engine for Temporal Tensor Segments (forthcoming).
- Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
- Lin, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024.
- Kim, S., et al. "SqueezeLLM: Dense-and-Sparse Quantization." ICML 2024.
- Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024.
- Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015.
- IEEE 754-2019. "IEEE Standard for Floating-Point Arithmetic."
- Lemire, D. and Boytsov, L. "Decoding billions of integers in milliseconds through vectorized bit packing." Software: Practice and Experience, 2015.
- WebAssembly SIMD Proposal. https://github.com/WebAssembly/simd. Finalized 2023.