Skip to content

Latest commit

 

History

History
880 lines (698 loc) · 32.8 KB

File metadata and controls

880 lines (698 loc) · 32.8 KB

ADR-019: Tiered Quantization Formats for Temporal Tensor Store

Status: Proposed Date: 2026-02-08 Parent: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine Author: System Architecture Team

Note: Tiered quantization formats are now implemented in the rvf-quant crate as part of ADR-029 (RVF). See the RVF temperature-tiering specification (docs/research/rvf/spec/03-temperature-tiering.md).

Version History

Version Date Author Changes
0.1 2026-02-08 Architecture Team Initial proposal

Abstract

This ADR defines the concrete quantization formats, bit-packing layouts, and codec interfaces for the five tiers of tensor storage established in ADR-017. Where ADR-017 introduced the concept of access-frequency-driven quantization and temporal scale reuse, this document specifies the exact byte-level formats for 8-bit (Tier 1 / Hot), 7-bit and 5-bit (Tier 2 / Warm), 3-bit (Tier 3 / Cold), and Compression-to-Zero (Tier 0 / Absent). It also resolves two open design questions from ADR-017: whether 5-bit quantization is permitted within the warm tier, and how Tier 0 reads behave when no reconstruction policy exists.

The codec_bits module provides a single allocation-free bit packer/unpacker that all sub-byte formats share. The quant module provides per-format quantize and dequantize functions, with SIMD-accelerated max_abs on native targets and a portable fallback for WASM. Rust trait interfaces are defined so that new bit widths can be added without modifying the core codec.


1. Context and Motivation

1.1 Gap in ADR-017

ADR-017 established the tiered compression architecture and segment binary format but left the per-tier quantization details at the algorithmic level. Implementers need exact byte layouts to write interoperable encoders and decoders, particularly for the sub-byte formats (7-bit, 5-bit, 3-bit) where values do not align on byte boundaries.

1.2 Sub-Byte Packing Complexity

Standard 8-bit quantization maps trivially to [u8] storage. Sub-byte formats require a bit-packing codec that can write and read arbitrary-width codes into a byte stream without wasting bits. The codec must:

  • Handle bit widths 3, 5, and 7 (with 8 as a degenerate identity case).
  • Operate without heap allocations (caller provides output slice).
  • Be deterministic and platform-independent (little-endian byte order).
  • Support WASM targets where SIMD is optional.

1.3 Outlier Handling in 3-Bit

At 3 bits per value, the quantization range is [-3, +3] (qmax = 3). Large outliers in the tensor distribution can cause severe clamping. ADR-017 noted this risk but did not specify a mitigation. This ADR introduces a two-level scale option for Tier 3 that uses a 1-bit flag per value to select between a primary scale (covering the majority of values) and a secondary scale (covering outliers), while keeping the packed format compact.

1.4 Tier 0 Semantics

ADR-017 listed Compression-to-Zero as a future possibility. This ADR formalizes it: Tier 0 stores no quantized data at all. Only metadata and an optional reconstruct_policy survive. This enables aggressive memory reclamation for tensors that are no longer accessed but may be reconstructable from other sources (deltas, factors, or recomputation).

1.5 Design Questions Resolved

Question Resolution
Allow 5-bit within warm tier? Yes. Dynamic downgrade from 7-bit to 5-bit when warm set exceeds a configurable byte cap (warm_byte_cap).
Tier 0 read semantics? Return zeros by default. If a reconstruct_policy (Delta or Factor) exists, reconstruct from stored representation.

2. Decision

We adopt the following five-tier quantization format hierarchy, each with a well-defined byte layout, packing strategy, and error budget:

Tier Name Bits Compression vs f32 Use Case
1 Hot 8 4.00x Active tensors, full fidelity
2a Warm 7 4.57x Default warm, near-lossless
2b Warm-aggressive 5 6.40x Warm set exceeds warm_byte_cap
3 Cold 3 10.67x Archived tensors, bounded error
0 Absent 0 Infinite No data stored; metadata only

All sub-byte formats share the codec_bits packer. All quantization formats use symmetric per-block quantization with scale = max_abs / qmax stored as f32 per block. The choice of f32 (rather than f16 as in ADR-017 segment headers) is deliberate at this layer: the segment encoder may convert to f16 for storage, but the quantizer operates in f32 for precision during the quantize/dequantize path.


3. Detailed Design

3.1 Tier 1: 8-Bit Quantization (Hot)

Algorithm: Symmetric per-block quantization.

Given: block of N f32 values, block_size typically 64 or 128
  scale    = max_abs(values) / 127
  q[i]     = round(values[i] / scale)
  q[i]     = clamp(q[i], -127, +127)      // i8 range
  store:     q as [i8; N] + scale as f32

Storage layout (one block, block_size = 8 for illustration):

Byte offset:  0    1    2    3    4    5    6    7    8    9   10   11
             [  scale (f32, LE)  ] [q0] [q1] [q2] [q3] [q4] [q5] [q6] [q7]
              ~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              4 bytes              8 bytes (1 byte per i8 value)

Total per block: 4 + block_size bytes

Effective compression (block_size = 64):

raw     = 64 * 4           = 256 bytes
quant   = 4 + 64 * 1       = 68 bytes
ratio   = 256 / 68         = 3.76x (single block)

With temporal amortization (100 frames sharing scales): 256*100 / (4 + 64*100) = 4.00x.

Dequantize:

values[i] = q[i] as f32 * scale

Error bound: max_error = scale / (2 * 127). See Section 3.7 for full analysis.

3.2 Tier 2a: 7-Bit Quantization (Warm)

Algorithm: Symmetric per-block, 7-bit codes packed into a bitstream.

Given: block of N f32 values
  scale    = max_abs(values) / 63          // qmax = 2^(7-1) - 1 = 63
  q[i]     = round(values[i] / scale)
  q[i]     = clamp(q[i], -63, +63)
  u[i]     = q[i] + 63                    // bias to unsigned [0, 126], fits 7 bits
  pack u[i] values using codec_bits at width=7

Bit-packing layout (8 values packed into 7 bytes):

Values:     u0       u1       u2       u3       u4       u5       u6       u7
Bits:    [6..0]   [6..0]   [6..0]   [6..0]   [6..0]   [6..0]   [6..0]   [6..0]
         7 bits   7 bits   7 bits   7 bits   7 bits   7 bits   7 bits   7 bits

Packed into 7 bytes (56 bits = 8 * 7 bits):

Byte 0:  [u0[6:0]              | u1[0]  ]   = u0(7) + u1(1) = 8 bits
         |<--- 7 bits --->|<1>|

Byte 1:  [u1[6:1]                | u2[1:0]]  = u1(6) + u2(2) = 8 bits
         |<--- 6 bits --->|<-2->|

Byte 2:  [u2[6:2]              | u3[2:0] ]   = u2(5) + u3(3) = 8 bits
         |<-- 5 bits -->|<--3-->|

Byte 3:  [u3[6:3]            | u4[3:0]   ]   = u3(4) + u4(4) = 8 bits
         |<- 4 bits ->|<--4--->|

Byte 4:  [u4[6:4]          | u5[4:0]     ]   = u4(3) + u5(5) = 8 bits
         |<-3->|<---- 5 bits ---->|

Byte 5:  [u5[6:5]        | u6[5:0]       ]   = u5(2) + u6(6) = 8 bits
         |<2>|<----- 6 bits ------>|

Byte 6:  [u6[6]        | u7[6:0]         ]   = u6(1) + u7(7) = 8 bits
         |1|<------- 7 bits ------->|

Total:   7 bytes for 8 values = 0.875 bytes/value

Storage per block (block_size = 64):

scale:   4 bytes (f32)
data:    ceil(64 * 7 / 8) = 56 bytes
total:   60 bytes
ratio:   256 / 60 = 4.27x

3.3 Tier 2b: 5-Bit Quantization (Warm Aggressive)

Algorithm: Symmetric per-block, 5-bit codes.

Given: block of N f32 values
  scale    = max_abs(values) / 15          // qmax = 2^(5-1) - 1 = 15
  q[i]     = round(values[i] / scale)
  q[i]     = clamp(q[i], -15, +15)
  u[i]     = q[i] + 15                    // bias to unsigned [0, 30], fits 5 bits
  pack u[i] values using codec_bits at width=5

Activation policy: 5-bit is used instead of 7-bit when the total warm set size exceeds warm_byte_cap (default: 64 MiB). The tier policy monitors aggregate warm storage and downgrades from 7-bit to 5-bit for the least recently accessed warm tensors until the cap is satisfied.

Bit-packing layout (8 values packed into 5 bytes):

Values:     u0       u1       u2       u3       u4       u5       u6       u7
Bits:    [4..0]   [4..0]   [4..0]   [4..0]   [4..0]   [4..0]   [4..0]   [4..0]
         5 bits   5 bits   5 bits   5 bits   5 bits   5 bits   5 bits   5 bits

Packed into 5 bytes (40 bits = 8 * 5 bits):

Byte 0:  [u0[4:0]        | u1[2:0]   ]   = u0(5) + u1(3) = 8 bits
         |<-- 5 bits -->|<--3-->|

Byte 1:  [u1[4:3]    | u2[4:0]   | u3[0]]  = u1(2) + u2(5) + u3(1) = 8 bits
         |<2>|<-- 5 bits -->|<1>|

Byte 2:  [u3[4:1]          | u4[3:0]  ]   = u3(4) + u4(4) = 8 bits
         |<-- 4 bits -->|<--4-->|

Byte 3:  [u4[4]    | u5[4:0]   | u6[1:0]] = u4(1) + u5(5) + u6(2) = 8 bits
         |1|<-- 5 bits -->|<-2->|

Byte 4:  [u6[4:2]        | u7[4:0]    ]   = u6(3) + u7(5) = 8 bits
         |<-3->|<--- 5 bits --->|

Total:   5 bytes for 8 values = 0.625 bytes/value

Storage per block (block_size = 64):

scale:   4 bytes (f32)
data:    ceil(64 * 5 / 8) = 40 bytes
total:   44 bytes
ratio:   256 / 44 = 5.82x

3.4 Tier 3: 3-Bit Quantization (Cold)

Algorithm: Symmetric per-block, 3-bit codes with optional two-level scale.

Standard Mode

Given: block of N f32 values
  scale    = max_abs(values) / 3           // qmax = 2^(3-1) - 1 = 3
  q[i]     = round(values[i] / scale)
  q[i]     = clamp(q[i], -3, +3)
  u[i]     = q[i] + 3                     // bias to unsigned [0, 6], fits 3 bits
  pack u[i] values using codec_bits at width=3

Two-Level Scale Mode (Outlier Handling)

When the value distribution has outliers (values significantly larger than the bulk of the distribution), a single scale wastes most of the 3-bit range on the long tail. The two-level scale splits the range:

Given: block of N f32 values, outlier_fraction (default: 0.05)
  sorted_abs     = sort(|values|, descending)
  outlier_count  = ceil(N * outlier_fraction)
  primary_max    = sorted_abs[outlier_count]      // excludes top 5%
  secondary_max  = sorted_abs[0]                  // full range

  primary_scale   = primary_max / 3               // covers bulk values
  secondary_scale = secondary_max / 3             // covers outliers

  For each value[i]:
    if |value[i]| > primary_max:
      flag[i] = 1                                  // use secondary scale
      q[i]    = round(value[i] / secondary_scale)
    else:
      flag[i] = 0                                  // use primary scale
      q[i]    = round(value[i] / primary_scale)
    q[i] = clamp(q[i], -3, +3)
    u[i] = q[i] + 3

  store: primary_scale (f32) + secondary_scale (f32) + flag bits + packed codes

Bit-packing layout (8 values packed into 3 bytes):

Values:     u0       u1       u2       u3       u4       u5       u6       u7
Bits:    [2..0]   [2..0]   [2..0]   [2..0]   [2..0]   [2..0]   [2..0]   [2..0]
         3 bits   3 bits   3 bits   3 bits   3 bits   3 bits   3 bits   3 bits

Packed into 3 bytes (24 bits = 8 * 3 bits):

Byte 0:  [u0[2:0]  | u1[2:0]  | u2[1:0] ]   = u0(3) + u1(3) + u2(2) = 8 bits
         |<-3->|<-3->|<2>|

Byte 1:  [u2[2]  | u3[2:0]  | u4[2:0]  | u5[0]]  = u2(1) + u3(3) + u4(3) + u5(1) = 8 bits
         |1|<-3->|<-3->|1|

Byte 2:  [u5[2:1]  | u6[2:0]  | u7[2:0]  ]   = u5(2) + u6(3) + u7(3) = 8 bits
         |<2>|<-3->|<-3->|

Total:   3 bytes for 8 values = 0.375 bytes/value

Two-level scale storage layout (one block, block_size = 64):

Byte offset:  0         3         7        8       9    ...   15       16  ...
             [primary_scale f32] [secondary_scale f32] [flag bytes  ] [packed codes]
              ~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~  ~~~~~~~~~~~~~
              4 bytes             4 bytes               ceil(64/8)=8   ceil(64*3/8)=24

Total per block (two-level): 4 + 4 + 8 + 24 = 40 bytes
Total per block (standard):  4 + 24          = 28 bytes
ratio (standard):  256 / 28 = 9.14x
ratio (two-level): 256 / 40 = 6.40x

The two-level mode trades compression ratio for outlier fidelity. It is selected automatically when the ratio max_abs / median_abs exceeds a configurable threshold (default: 5.0), indicating a heavy-tailed distribution.

3.5 Tier 0: Compression to Zero (Absent)

Algorithm: No quantized data is stored.

Tier 0 representation:
  metadata:            TensorMeta (id, shape, dtype, timestamps)
  reconstruct_policy:  Option<ReconstructPolicy>
  quantized_data:      None

enum ReconstructPolicy {
    None,                                    // reads return zeros
    Delta { base_id: TensorId, delta: ... }, // reconstruct as base + delta
    Factor { source_id: TensorId, ... },     // reconstruct via transformation
}

Read semantics:

reconstruct_policy Behavior
None Return a zero-filled tensor of the recorded shape. Fast-fail mode returns Err(TierZeroNoPolicy) instead.
Delta Load base tensor, apply stored delta. May trigger recursive decompression if base is also tiered.
Factor Load source tensor, apply stored transformation (scale, permutation, projection).

Transition to Tier 0: A tensor is eligible for Tier 0 when its tier score drops below absent_min_score (default: 1) and it has not been accessed for longer than absent_age_threshold (default: 24 hours). The transition is irreversible without external data: once quantized data is discarded, only the reconstruction policy (if any) can recover approximate values.

3.6 Bit Packing Module: codec_bits

The core packing and unpacking functions shared by all sub-byte formats.

/// Errors from bit codec operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum CodecErr {
    /// Output buffer too small. Contains the required size in bytes.
    OutputTooSmall { required: usize },
    /// Input buffer too small for the declared number of values.
    InputTooSmall { required: usize },
    /// Bit width must be in [1, 8].
    InvalidBitWidth { bits: u8 },
}

/// Pack `values.len()` signed codes into `out`, using `bits` bits per code.
///
/// Each value in `values` is treated as a signed integer in `[-(2^(bits-1)-1), 2^(bits-1)-1]`.
/// It is biased to unsigned before packing: `u = v + (2^(bits-1) - 1)`.
///
/// Returns the number of bytes written to `out`.
///
/// # Errors
/// - `CodecErr::OutputTooSmall` if `out` cannot hold the packed data.
/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
pub fn pack_bits(values: &[i8], bits: u8, out: &mut [u8]) -> Result<usize, CodecErr> {
    if bits == 0 || bits > 8 {
        return Err(CodecErr::InvalidBitWidth { bits });
    }
    let total_bits = values.len() as u64 * bits as u64;
    let required = ((total_bits + 7) / 8) as usize;
    if out.len() < required {
        return Err(CodecErr::OutputTooSmall { required });
    }

    let qmax = (1i8 << (bits - 1)) - 1;  // bias offset
    let mask: u64 = (1u64 << bits) - 1;
    let mut acc: u64 = 0;
    let mut acc_bits: u32 = 0;
    let mut pos: usize = 0;

    for &v in values {
        let u = (v as i16 + qmax as i16) as u64 & mask;
        acc |= u << acc_bits;
        acc_bits += bits as u32;
        while acc_bits >= 8 {
            out[pos] = (acc & 0xFF) as u8;
            pos += 1;
            acc >>= 8;
            acc_bits -= 8;
        }
    }
    // Flush remaining bits
    if acc_bits > 0 {
        out[pos] = (acc & 0xFF) as u8;
        pos += 1;
    }
    Ok(pos)
}

/// Unpack codes from `inp` into `out`, reading `bits` bits per code.
///
/// Reads exactly `out.len()` values. Each unsigned code is unbiased back to signed:
/// `v = u - (2^(bits-1) - 1)`.
///
/// Returns the number of bytes consumed from `inp`.
///
/// # Errors
/// - `CodecErr::InputTooSmall` if `inp` does not contain enough data.
/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
pub fn unpack_bits(inp: &[u8], bits: u8, out: &mut [i8]) -> Result<usize, CodecErr> {
    if bits == 0 || bits > 8 {
        return Err(CodecErr::InvalidBitWidth { bits });
    }
    let total_bits = out.len() as u64 * bits as u64;
    let required = ((total_bits + 7) / 8) as usize;
    if inp.len() < required {
        return Err(CodecErr::InputTooSmall { required });
    }

    let qmax = (1i8 << (bits - 1)) - 1;
    let mask: u64 = (1u64 << bits) - 1;
    let mut acc: u64 = 0;
    let mut acc_bits: u32 = 0;
    let mut byte_pos: usize = 0;
    let mut val_pos: usize = 0;

    while val_pos < out.len() {
        while acc_bits < bits as u32 {
            acc |= (inp[byte_pos] as u64) << acc_bits;
            acc_bits += 8;
            byte_pos += 1;
        }
        let u = (acc & mask) as i16;
        out[val_pos] = (u - qmax as i16) as i8;
        acc >>= bits;
        acc_bits -= bits as u32;
        val_pos += 1;
    }
    Ok(required)
}

Properties:

  • No heap allocations. Callers provide both input and output slices.
  • Single bit writer / bit reader using a 64-bit accumulator.
  • Deterministic little-endian byte order.
  • The pack_bits / unpack_bits pair is its own inverse: unpack(pack(v)) == v for all valid inputs.

3.7 Quant Module Functions

/// Block-level quantization configuration.
pub struct QuantConfig {
    pub block_size: usize,          // elements per quantization block (default: 64)
    pub two_level_threshold: f32,   // max/median ratio to trigger two-level (default: 5.0)
}

/// Quantized block result.
pub struct QuantizedBlock {
    pub scale: f32,
    pub secondary_scale: Option<f32>,    // only for two-level 3-bit
    pub flags: Option<Vec<u8>>,          // 1-bit-per-value flags for two-level
    pub codes: Vec<i8>,                  // signed quantized codes
    pub bits: u8,
}

/// Symmetric 8-bit quantization (Tier 1 - Hot).
///
/// Quantizes each block of `block_size` values independently.
/// scale = max_abs(block) / 127
/// q[i]  = clamp(round(x[i] / scale), -127, 127)
pub fn quantize_s8(
    values: &[f32],
    config: &QuantConfig,
) -> Vec<QuantizedBlock>;

/// Symmetric N-bit quantization (Tier 2/3 - Warm/Cold).
///
/// `bits` must be one of: 7, 5, 3.
/// qmax  = 2^(bits-1) - 1
/// scale = max_abs(block) / qmax
/// q[i]  = clamp(round(x[i] / scale), -qmax, qmax)
///
/// For bits=3 and config.two_level_threshold exceeded: uses two-level scale.
pub fn quantize_bits(
    values: &[f32],
    bits: u8,
    config: &QuantConfig,
) -> Vec<QuantizedBlock>;

/// Dequantize a block back to f32 values.
///
/// For standard mode:   x'[i] = codes[i] as f32 * scale
/// For two-level mode:  x'[i] = codes[i] as f32 * (if flags[i] then secondary_scale else scale)
pub fn dequantize(block: &QuantizedBlock) -> Vec<f32>;

/// Compute the maximum absolute value across a slice.
///
/// On native targets with `target_feature = "avx2"` or `target_feature = "neon"`:
///   uses SIMD intrinsics for 4-8x throughput.
/// On WASM with `target_feature = "simd128"` (optional):
///   uses wasm_simd128 intrinsics.
/// Fallback: portable scalar loop.
#[inline]
pub fn max_abs(values: &[f32]) -> f32;

SIMD implementation sketch for max_abs (AVX2):

#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn max_abs_avx2(values: &[f32]) -> f32 {
    use std::arch::x86_64::*;
    let sign_mask = _mm256_set1_ps(f32::from_bits(0x7FFF_FFFF)); // abs mask
    let mut vmax = _mm256_setzero_ps();
    let chunks = values.len() / 8;

    for i in 0..chunks {
        let v = _mm256_loadu_ps(values.as_ptr().add(i * 8));
        let abs_v = _mm256_and_ps(v, sign_mask);
        vmax = _mm256_max_ps(vmax, abs_v);
    }

    // Horizontal max reduction
    let hi128 = _mm256_extractf128_ps(vmax, 1);
    let lo128 = _mm256_castps256_ps128(vmax);
    let max128 = _mm_max_ps(hi128, lo128);
    let shuf = _mm_movehdup_ps(max128);
    let max64 = _mm_max_ps(max128, shuf);
    let shuf2 = _mm_movehl_ps(max64, max64);
    let max32 = _mm_max_ss(max64, shuf2);
    let mut result = _mm_cvtss_f32(max32);

    // Handle remainder
    for i in (chunks * 8)..values.len() {
        result = result.max(values[i].abs());
    }
    result
}

WASM portable fallback:

#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
pub fn max_abs(values: &[f32]) -> f32 {
    let mut m: f32 = 0.0;
    for &v in values {
        let a = v.abs();
        if a > m {
            m = a;
        }
    }
    m
}

When WASM SIMD is enabled via target_feature = "simd128", a vectorized path processes 4 f32 values per iteration using v128 types. This is optional and gated behind a cargo feature flag wasm-simd.

3.8 Error Bound Analysis

For symmetric quantization with bit width B, block scale s, and qmax = 2^(B-1) - 1:

quantization_step   = s / qmax
max_element_error   = quantization_step / 2         (from rounding)
max_relative_error  = 1 / (2 * qmax)                (per element, worst case)
rms_error           = quantization_step / sqrt(12)   (uniform quantization noise)

Per-tier error bounds:

Tier Bits qmax Max Rel. Error RMS Rel. Error Max Abs. Error (scale=1.0)
Hot (8-bit) 8 127 0.394% 0.228% 0.00394
Warm (7-bit) 7 63 0.794% 0.458% 0.00794
Warm-agg (5-bit) 5 15 3.333% 1.925% 0.03333
Cold (3-bit, std) 3 3 16.667% 9.623% 0.16667
Cold (3-bit, 2-level) 3 3 16.667% per scale 9.623% Reduced for bulk values

Two-level scale improvement for 3-bit: When 95% of values fall within primary_max and outliers use secondary_scale:

Component Fraction Scale Effective Max Error
Bulk values (95%) 0.95 primary_scale (smaller) 16.7% of primary_max
Outlier values (5%) 0.05 secondary_scale (larger) 16.7% of secondary_max

The bulk values achieve much lower absolute error because primary_scale is typically 3-10x smaller than the single-scale scale. The outliers retain the same relative error but are fewer in number.

Drift compounding: When drift tolerance is d (e.g., 10%), and a frame is quantized with scales from an earlier frame, the effective max relative error becomes (1 + d) / (2 * qmax). For 8-bit with 10% drift: 1.1 / 254 = 0.433%.

Cumulative error table with drift:

Tier Bits No Drift 10% Drift 20% Drift
Hot 8 0.394% 0.433% 0.472%
Warm 7 0.794% 0.873% 0.952%
Warm-agg 5 3.333% 3.667% 4.000%
Cold 3 16.667% 18.333% 20.000%

3.9 Complete Quantizer and Packer Traits

/// Trait for quantization formats that can encode and decode tensor blocks.
pub trait TensorQuantizer {
    /// The bit width of this quantizer.
    fn bit_width(&self) -> u8;

    /// Quantize a block of f32 values into signed codes and scale(s).
    fn quantize_block(
        &self,
        values: &[f32],
        config: &QuantConfig,
    ) -> QuantizedBlock;

    /// Dequantize a block back to f32 values.
    fn dequantize_block(
        &self,
        block: &QuantizedBlock,
        out: &mut [f32],
    ) -> Result<(), CodecErr>;

    /// Returns the packed byte size for `num_values` at this bit width,
    /// excluding scale storage.
    fn packed_data_size(&self, num_values: usize) -> usize {
        (num_values * self.bit_width() as usize + 7) / 8
    }

    /// Returns total block storage size including scale(s) and flags.
    fn block_storage_size(&self, block_size: usize) -> usize;
}

/// Trait for bit-level packing codecs.
pub trait BitCodec {
    /// Pack signed codes into a byte buffer.
    fn pack(
        &self,
        codes: &[i8],
        bits: u8,
        out: &mut [u8],
    ) -> Result<usize, CodecErr>;

    /// Unpack codes from a byte buffer.
    fn unpack(
        &self,
        data: &[u8],
        bits: u8,
        out: &mut [i8],
    ) -> Result<usize, CodecErr>;
}

/// Standard implementation using the accumulator-based codec_bits functions.
pub struct StandardBitCodec;

impl BitCodec for StandardBitCodec {
    fn pack(
        &self,
        codes: &[i8],
        bits: u8,
        out: &mut [u8],
    ) -> Result<usize, CodecErr> {
        pack_bits(codes, bits, out)
    }

    fn unpack(
        &self,
        data: &[u8],
        bits: u8,
        out: &mut [i8],
    ) -> Result<usize, CodecErr> {
        unpack_bits(data, bits, out)
    }
}

3.10 Block Storage Summary Diagram

TIER 1 (8-bit):
+--------+-------+-------+-------+-----+-------+
| scale  | q[0]  | q[1]  | q[2]  | ... | q[63] |
| f32 LE | i8    | i8    | i8    |     | i8    |
+--------+-------+-------+-------+-----+-------+
  4 bytes   1       1       1             1        = 68 bytes / block

TIER 2a (7-bit):
+--------+--------------------------------------------+
| scale  | packed 7-bit codes (56 bytes for 64 vals)   |
| f32 LE | bitstream, little-endian accumulator        |
+--------+--------------------------------------------+
  4 bytes   ceil(64*7/8) = 56 bytes                    = 60 bytes / block

TIER 2b (5-bit):
+--------+--------------------------------------------+
| scale  | packed 5-bit codes (40 bytes for 64 vals)   |
| f32 LE | bitstream, little-endian accumulator        |
+--------+--------------------------------------------+
  4 bytes   ceil(64*5/8) = 40 bytes                    = 44 bytes / block

TIER 3 standard (3-bit):
+--------+--------------------------------------------+
| scale  | packed 3-bit codes (24 bytes for 64 vals)   |
| f32 LE | bitstream, little-endian accumulator        |
+--------+--------------------------------------------+
  4 bytes   ceil(64*3/8) = 24 bytes                    = 28 bytes / block

TIER 3 two-level (3-bit):
+--------+--------+----------+-------------------------------+
| pscale | sscale | flags    | packed 3-bit codes            |
| f32 LE | f32 LE | ceil(N/8)| bitstream                     |
+--------+--------+----------+-------------------------------+
  4         4        8 bytes    24 bytes                       = 40 bytes / block

TIER 0 (absent):
+--------------------------------------+
| TensorMeta + ReconstructPolicy only  |
| NO quantized data                    |
+--------------------------------------+
  variable (typically 32-128 bytes metadata)

4. Alternatives Considered

4.1 4-Bit as the Warm Tier

4-bit quantization (qmax = 7, 8.00x compression) is the most widely studied format (GPTQ, AWQ). We considered using 4-bit instead of 7-bit for the warm tier. Rejected because: (a) the jump from 8-bit to 4-bit is too large for tensors that were recently hot, causing unnecessary quality loss; (b) 7-bit provides a gentler step-down; (c) 5-bit is available as an intermediate when memory pressure increases.

4.2 Uniform 4-Bit Across All Non-Hot Tiers

A simpler design with only two quantization levels (8-bit hot, 4-bit everything else). Rejected because: (a) cold tensors waste 1 extra bit per value when 3-bit suffices; (b) no path to aggressive compression under memory pressure; (c) loses the granularity that enables smooth quality degradation.

4.3 Asymmetric Quantization for 3-Bit

Using asymmetric quantization (with zero-point) for 3-bit to better utilize the [0, 7] unsigned range when distributions are not centered. Rejected because: (a) adds 4 bytes of zero-point storage per block; (b) requires an additional subtraction in the dequantize path; (c) the two-level scale approach handles asymmetric distributions more effectively by splitting the scale rather than shifting the range.

4.4 Lookup Table (Codebook) Quantization for Cold

Using a small codebook (e.g., 8 centroids) instead of uniform 3-bit levels. Rejected because: (a) requires a per-block or per-tensor codebook training step that is expensive for streaming data; (b) codebook storage overhead is comparable to scale storage but with higher decode complexity; (c) uniform quantization is simpler to implement and reason about.

4.5 No Two-Level Scale (Simpler 3-Bit)

Omitting the two-level scale option entirely. Considered but rejected because agent embedding tensors frequently exhibit heavy-tailed distributions where a few dimensions carry disproportionate magnitude. Without two-level scale, these outliers cause the single scale to be too large, wasting most of the 3-bit range on the bulk of near-zero values.


5. Acceptance Criteria

5.1 Format Correctness

  • pack_bits followed by unpack_bits is a lossless round-trip for all bit widths (3, 5, 7, 8) and all valid signed input ranges.
  • quantize_s8 followed by dequantize produces values within the theoretical error bound (scale / 254) of the originals.
  • quantize_bits(7, ...) followed by dequantize produces values within scale / 126 of the originals.
  • quantize_bits(5, ...) followed by dequantize produces values within scale / 30 of the originals.
  • quantize_bits(3, ...) followed by dequantize produces values within scale / 6 of the originals (standard mode).
  • Two-level 3-bit mode activates when max/median > two_level_threshold.
  • Tier 0 reads return zeros when reconstruct_policy is None.
  • Tier 0 reads invoke reconstruction when a policy exists.

5.2 Performance

  • pack_bits throughput >= 2 GB/s on native (AVX2-capable hardware).
  • unpack_bits throughput >= 2 GB/s on native.
  • max_abs with SIMD is >= 3x faster than the scalar fallback on 512+ element blocks.
  • WASM pack_bits / unpack_bits throughput >= 500 MB/s (without SIMD).
  • No heap allocations in pack_bits, unpack_bits, or max_abs.

5.3 Storage Efficiency

  • 8-bit block storage: exactly 4 + block_size bytes.
  • 7-bit block storage: exactly 4 + ceil(block_size * 7 / 8) bytes.
  • 5-bit block storage: exactly 4 + ceil(block_size * 5 / 8) bytes.
  • 3-bit block storage (standard): exactly 4 + ceil(block_size * 3 / 8) bytes.
  • 3-bit block storage (two-level): exactly 8 + ceil(block_size / 8) + ceil(block_size * 3 / 8) bytes.
  • No padding bits between consecutive blocks in a segment.

5.4 Dynamic Tier 2 Downgrade

  • When aggregate warm storage exceeds warm_byte_cap, the least recently accessed warm tensors are re-encoded from 7-bit to 5-bit.
  • The downgrade is reversible: if warm storage drops below warm_byte_cap * 0.8 (hysteresis), tensors can be re-promoted to 7-bit on next access.

6. Risks and Mitigations

Risk Severity Likelihood Mitigation
3-bit two-level scale adds format complexity without sufficient accuracy gain for most distributions Medium Medium Gate behind a cargo feature two-level-cold; default to standard 3-bit. Benchmark on real agent embeddings before enabling by default.
Dynamic 7-bit to 5-bit downgrade causes thrashing when warm set oscillates near the byte cap Medium Medium Implement hysteresis (20% band). Only downgrade when above cap; only upgrade when below 80% of cap. Rate-limit downgrades to at most once per minute.
pack_bits accumulator overflow for large inputs Low Low The 64-bit accumulator can hold up to 56 bits of pending data (7 bytes). Since we flush at 8 bits, the maximum pending bits is bits - 1 = 7, well within the 64-bit range. No overflow possible.
Tier 0 reconstruction from Delta/Factor introduces unbounded latency Medium Low Set a maximum reconstruction depth (default: 3). If the base tensor is also Tier 0, fail with ReconstructionDepthExceeded rather than recursing indefinitely.
WASM scalar max_abs is a bottleneck for large tensors Low High Expected. The WASM SIMD feature flag provides 3-4x improvement. For non-SIMD targets, max_abs cost is small relative to the full quantize pipeline.
Block size mismatch between encoder and decoder High Low Block size is stored in the segment header (ADR-017 format). Decoder reads it from the header rather than assuming a default.

7. References

  1. ADR-017: Temporal Tensor Compression with Tiered Quantization. RuVector Architecture Team, 2026.
  2. ADR-018: Block-Based Storage Engine for Temporal Tensor Segments (forthcoming).
  3. Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
  4. Lin, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024.
  5. Kim, S., et al. "SqueezeLLM: Dense-and-Sparse Quantization." ICML 2024.
  6. Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024.
  7. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015.
  8. IEEE 754-2019. "IEEE Standard for Floating-Point Arithmetic."
  9. Lemire, D. and Boytsov, L. "Decoding billions of integers in milliseconds through vectorized bit packing." Software: Practice and Experience, 2015.
  10. WebAssembly SIMD Proposal. https://github.com/WebAssembly/simd. Finalized 2023.