Status: Proposed Date: 2026-02-06 Parent: ADR-001 RuVector Core Architecture, ADR-004 KV Cache Management, ADR-005 WASM Runtime Integration Author: System Architecture Team SDK: Claude-Flow
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | 2026-02-06 | Architecture Team | Initial SOTA research and design proposal |
This ADR introduces a temporal tensor compression system with tiered quantization for RuVector. The system exploits two key observations: (1) tensors accessed at different frequencies can tolerate different precision levels, and (2) quantization parameters (scales) can be amortized across consecutive time frames when the underlying distribution is stable. Together these yield 4x-10.67x compression over f32 while keeping reconstruction error within configurable bounds.
The implementation targets Rust with a zero-dependency WASM-compatible core, matching the sandboxed execution model established in ADR-005.
Memory size and memory bandwidth dominate deployment cost for tensor-heavy workloads. ADR-004 established a three-tier KV cache (FP16 / 4-bit / 2-bit) but addresses only static snapshots of key-value pairs. Modern agent systems (RuVector's primary workload) produce streams of tensor frames - embeddings, activations, gradient sketches, coherence vectors - that evolve over time. Storing each frame independently wastes metadata and misses temporal redundancy.
Memory scaling for agent tensor streams:
| Tensor Dim | Frames/sec | Duration | Raw f32 | 8-bit | 5-bit | 3-bit |
|---|---|---|---|---|---|---|
| 512 | 10 | 1 hour | 73.7 MB | 18.4 MB | 11.5 MB | 6.9 MB |
| 2048 | 10 | 1 hour | 294.9 MB | 73.7 MB | 46.1 MB | 27.6 MB |
| 4096 | 50 | 1 hour | 2.95 GB | 737 MB | 461 MB | 276 MB |
The existing quantization.rs in ruvector-core provides:
| Method | Compression | Limitation |
|---|---|---|
| Scalar (u8) | 4x | Per-vector min/max scales; no temporal reuse |
| Int4 | 8x | Fixed 4-bit; no adaptive tier selection |
| Product | 8-16x | Requires codebook training; high latency |
| Binary | 32x | Too lossy for reconstruction-sensitive paths |
Missing capabilities:
- No temporal scale reuse across frames
- No access-pattern-driven tier selection
- No sub-byte bit packing (5-bit, 7-bit)
- No drift-aware segment boundaries
- No WASM-native compression path
The core insight: when a tensor's value distribution is stable over consecutive frames, the quantization scales computed for frame t remain valid for frames t+1, t+2, ..., t+k. Reusing scales across k frames amortizes the per-group scale overhead by kx and avoids redundant calibration passes.
This is the same principle behind:
- Video codecs (H.264/H.265): I-frames carry full parameters; P-frames reuse them until a scene change
- Time-series databases (Gorilla, InfluxDB): Delta-of-delta encoding reuses a base until drift exceeds a threshold
- Streaming quantization (QuaRot, KIVI): Per-channel parameters reused across tokens until attention pattern shifts
Modern quantization systems converge on per-group symmetric quantization as the optimal accuracy-metadata tradeoff:
| System | Year | Approach | Key Innovation |
|---|---|---|---|
| GPTQ | 2023 | Per-column Hessian-weighted quantization | OBQ with lazy batch updates; group_size=128 standard |
| AWQ | 2024 | Activation-aware weight quantization | Protects salient channels via per-channel scaling |
| SqueezeLLM | 2024 | Non-uniform with sensitivity grouping | Dense-and-sparse decomposition for outliers |
| QuIP# | 2024 | Incoherence via random Hadamard | Enables high-quality 2-bit with lattice codebooks |
| AQLM | 2024 | Additive multi-codebook quantization | 2-bit with learned codebooks; beam search optimization |
| SpinQuant | 2024 | Rotation-based Cayley optimization | Learnable rotation matrices; Llama-2-7B 4-bit = FP16 parity |
| KIVI | 2024 | Per-channel key, per-token value | 2-bit KV cache with <0.1 ppl increase on Llama-2 |
| Atom | 2024 | Mixed-precision with reordering | Handles activation outliers via channel reordering |
Consensus finding: Group sizes of 32-128 provide the best accuracy-metadata tradeoff. Symmetric quantization (no zero-point) is sufficient when distribution is roughly centered, which holds for most intermediate tensors. The scale storage cost is ceil(tensor_len / group_len) * sizeof(scale).
| Bits | Compression vs f32 | Typical Quality Impact | Viable For |
|---|---|---|---|
| 8 | 4.00x | Negligible (<0.01 ppl) | Hot path, full fidelity |
| 7 | 4.57x | Negligible (<0.02 ppl) | Warm path, near-lossless |
| 5 | 6.40x | Minor (0.05-0.1 ppl) | Warm path, acceptable loss |
| 4 | 8.00x | Moderate (0.1-0.3 ppl) | Well-studied; GPTQ/AWQ standard |
| 3 | 10.67x | Significant (0.3-1.0 ppl) | Cold path with bounded error |
| 2 | 16.00x | Large (1.0-3.0 ppl) | Archive only; KIVI/QuIP# needed |
Key finding: 3-bit symmetric quantization is the practical floor for reconstruction-required tensors. Below 3-bit, non-uniform or lattice codebook methods (QuIP#, AQLM) are needed to maintain quality, at much higher complexity.
No widely published system directly addresses temporal reuse of quantization scales for streaming tensor data. The closest analogs are:
- Gorilla (Facebook, 2015): XOR-based delta encoding for time-series floats; reuses a base encoding until delta exceeds threshold
- KIVI token reuse: Per-channel scales for keys are computed once and applied to all tokens in the channel
- QuaRot (2024): Rotation matrices computed once per layer, reused for all tokens
- Streaming quantization in video: DCT coefficients reused across P-frames until I-frame refresh
Our temporal segment approach generalizes these: compute group scales once per segment, emit packed codes for each frame, start a new segment on tier change or drift exceedance.
Standard bitstream packing (accumulator + shift) is the established approach for arbitrary-width codes:
For each code of width B bits:
accumulator |= code << acc_bits
acc_bits += B
while acc_bits >= 8:
emit(accumulator & 0xFF)
accumulator >>= 8
acc_bits -= 8
SIMD acceleration: For fixed widths (3, 5, 7, 8), vectorized pack/unpack can process 16-32 codes per SIMD iteration using shuffles and masks. The bitpacking crate achieves 4-8 GB/s on AVX2 for fixed-width packing. For WASM, the 128-bit SIMD proposal (widely supported since 2023) enables similar throughput.
| Aspect | Status (2026) |
|---|---|
| wasm32-unknown-unknown | Stable, widely deployed |
| WASM SIMD (128-bit) | Supported in all major browsers and runtimes |
| wasm32-wasi | Stable, server-side WASM standard |
| Linear memory model | Single contiguous address space; 32-bit pointers |
#[no_mangle] extern "C" |
Standard FFI pattern for WASM exports |
| Static mut in single-threaded WASM | Sound (no data races possible) but future-fragile |
Relevant Rust WASM tensor libraries: candle (Hugging Face), burn, tract. All demonstrate that high-performance tensor operations are viable in Rust/WASM with careful memory management.
We introduce ruvector-temporal-tensor (with WASM variant ruvector-temporal-tensor-wasm) implementing:
- Groupwise symmetric quantization with f16 scales
- Temporal segments that amortize scales across frames
- Three-tier access-driven bit-width selection (8/7-or-5/3)
- Bitstream packing with no byte-alignment waste
- WASM-compatible FFI with handle-based resource management
+===========================================================================+
| TEMPORAL TENSOR COMPRESSION ARCHITECTURE |
+===========================================================================+
| |
| Input Frame (f32[N]) |
| | |
| v |
| +----------------+ +-----------------+ +--------------------+ |
| | Tier Policy |---->| Segment Manager |---->| Segment Store | |
| | | | | | (Vec<u8> blobs) | |
| | score = count | | - drift check | | | |
| | * 1024 / age | | - scale reuse | | Magic: TQTC | |
| | | | - bit-width sel | | Version: 1 | |
| | Hot: 8-bit | | | | Bits, GroupLen, | |
| | Warm: 7/5-bit | +---------+-------+ | TensorLen, Frames, | |
| | Cold: 3-bit | | | Scales[], Data[] | |
| +----------------+ | +--------------------+ |
| v |
| +----------------------------------------------------------------+ |
| | QUANTIZATION PIPELINE | |
| | | |
| | f32 values | |
| | | | |
| | v | |
| | [Group 0: max_abs -> scale_f16] [Group 1: ...] [Group K: ...] | |
| | | | |
| | v | |
| | q_i = round(v_i / scale) // symmetric, no zero-point | |
| | q_i = clamp(q_i, -qmax, +qmax) | |
| | | | |
| | v | |
| | u_i = q_i + bias // signed -> unsigned mapping | |
| | | | |
| | v | |
| | [Bitstream Packer: B-bit codes, no alignment padding] | |
| +----------------------------------------------------------------+ |
| |
| Decode: bitstream unpack -> unsigned -> signed -> scale multiply |
+===========================================================================+
Offset Size Field Description
------ ------ --------------- -----------------------------------------
0 4 magic 0x43545154 ("TQTC" in LE ASCII)
4 1 version Format version (currently 1)
5 1 bits Bit width for this segment (3, 5, 7, or 8)
6 4 group_len Elements per quantization group
10 4 tensor_len Number of f32 elements per frame
14 4 frame_count Number of frames in this segment
18 4 scale_count Number of f16 group scales
22 2*S scales f16 scale values (S = scale_count)
22+2S 4 data_len Length of packed bitstream in bytes
26+2S D data Packed quantized codes (D = data_len)
Segment size formula:
segment_bytes = 26 + 2*ceil(tensor_len/group_len) + ceil(tensor_len * frame_count * bits / 8)
Score = access_count * 1024 / (now_ts - last_access_ts + 1)
Tier 1 (Hot): score >= hot_min_score -> 8-bit (~4.0x compression)
Tier 2 (Warm): score >= warm_min_score -> 7-bit (~4.57x) or 5-bit (~6.4x)
Tier 3 (Cold): score < warm_min_score -> 3-bit (~10.67x compression)
Default thresholds:
| Parameter | Default | Rationale |
|---|---|---|
hot_min_score |
512 | ~2 accesses/sec for recent data |
warm_min_score |
64 | ~1 access every 16 seconds |
warm_bits |
7 | Conservative warm tier; set to 5 for aggressive |
drift_pct_q8 |
26 | ~10.2% drift tolerance (26/256) |
group_len |
64 | 64 elements per group; 128 bytes of f16 scales per 256 values |
Drift detection: Before appending a frame to the current segment, compute max_abs per group and compare against scale * qmax * drift_factor. If any group exceeds this, flush the current segment and start a new one with recomputed scales. This bounds reconstruction error to drift_factor * original_error.
Effective compression ratios including scale overhead (group_len=64, f16 scales):
| Bits | Raw Ratio | Scale Overhead | Effective Ratio | Effective Ratio (100 frames) |
|---|---|---|---|---|
| 8 | 4.00x | 1 f16 per 64 vals | 3.76x | 3.99x |
| 7 | 4.57x | same | 4.27x | 4.56x |
| 5 | 6.40x | same | 5.82x | 6.38x |
| 3 | 10.67x | same | 9.14x | 10.63x |
Temporal amortization: with 100 frames per segment, scale overhead becomes negligible (~0.03% of segment size).
crates/ruvector-temporal-tensor/
├── Cargo.toml
└── src/
├── lib.rs # Public API, re-exports
├── tier_policy.rs # TierPolicy: score calculation, tier selection
├── f16.rs # Software f32<->f16 conversion (no external deps)
├── bitpack.rs # Bitstream packer/unpacker for arbitrary widths
├── quantizer.rs # Groupwise symmetric quantization + dequantization
├── segment.rs # Segment encode/decode, binary format
├── compressor.rs # TemporalTensorCompressor: drift, segmentation
└── ffi.rs # WASM/C FFI: handle store, extern "C" exports
crates/ruvector-temporal-tensor-wasm/
├── Cargo.toml # wasm32-unknown-unknown target
└── src/
└── lib.rs # Re-exports FFI functions, WASM-specific config
For a group of G values from frame f:
scale = max(|v_i| for i in group) / qmax
qmax = 2^(bits-1) - 1 // e.g., bits=8 -> qmax=127, bits=3 -> qmax=3
q_i = round(v_i / scale)
q_i = clamp(q_i, -qmax, +qmax)
u_i = q_i + qmax // bias to unsigned for packing
Reconstruction:
q_i = u_i - qmax // unbias
v_i' = q_i * scale // dequantize
Why symmetric: No zero-point storage needed. For centered distributions (which agent tensors typically are), symmetric quantization loses minimal accuracy vs asymmetric while halving metadata and simplifying the dequantize multiply.
Why f16 scales: 2 bytes per group vs 4 bytes for f32. For typical tensor magnitudes (1e-3 to 1e3), f16 provides sufficient precision for scales. The f16 dynamic range (6.1e-5 to 65504) covers the relevant scale values. Software f16 conversion is fast (~5ns per conversion) and avoids external crate dependencies.
Frame 0 Frame 1 Frame 2 ... Frame K Frame K+1
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ f32 │ │ f32 │ │ f32 │ │ f32 │ │ f32 │
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
│ │ │ │ │
v v v v v
┌────────────────────────────────────────────────────────┐ ┌──────────
│ SEGMENT 1 (same scales) │ │ SEGMENT 2
│ │ │ (new
│ scales: [s0, s1, ..., sG] (computed from frame 0) │ │ scales)
│ data: [packed frame 0][packed frame 1]...[frame K] │ │
└────────────────────────────────────────────────────────┘ └──────────
^
|
Drift exceeded OR
tier changed at K+1
Segment boundary triggers:
- First frame (no active segment)
- Tier bit-width changed (e.g., tensor went from hot to warm)
- Any group's
max_abs > scale * qmax * drift_factor
fn frame_fits_current_scales(frame, scales, qmax, drift_factor) -> bool {
for each group (idx, scale) in scales:
max_abs = max(|v| for v in group_slice(frame, idx))
allowed = f16_to_f32(scale) * qmax * drift_factor
if max_abs > allowed:
return false // Distribution has drifted
return true
}The drift_factor is 1 + drift_pct_q8/256. With drift_pct_q8=26, this is 1.1015625 (~10% tolerance). This means a group's maximum absolute value can grow by up to ~10% beyond the original calibration before triggering a new segment.
Tradeoff: Lower drift tolerance = more segment boundaries = more accurate but more metadata. Higher drift tolerance = fewer segments = better compression but more quantization error. The 10% default is conservative; for cold tensors, 20-30% may be acceptable.
The packer uses a 64-bit accumulator for sub-byte codes:
For each quantized unsigned code u of width B bits:
acc |= (u as u64) << acc_bits
acc_bits += B
while acc_bits >= 8:
emit byte: acc & 0xFF
acc >>= 8
acc_bits -= 8
// After all codes: flush remaining bits
if acc_bits > 0:
emit byte: acc & 0xFF
Packing density (no wasted bits):
| Bits | Codes per 8 bytes | Utilization |
|---|---|---|
| 8 | 8 | 100% |
| 7 | 9.14 | 100% (no padding) |
| 5 | 12.8 | 100% (no padding) |
| 3 | 21.33 | 100% (no padding) |
The implementation provides bit-exact IEEE 754 half-precision conversion without external crates:
- f32 -> f16: Extract sign/exponent/mantissa, remap exponent bias (127 -> 15), handle denormals with rounding, infinity, NaN propagation
- f16 -> f32: Reverse the bias remapping, reconstruct denormals, handle special values
Accuracy: Round-to-nearest-even for normals. Denormal handling preserves gradual underflow. The conversion pair is not bit-exact round-trip for all f32 values (f16 has 10 mantissa bits vs f32's 23), but for scale values in the range [1e-4, 1e4], relative error is bounded by 2^-10 (~0.1%).
┌─────────────────────────────────────────────────────┐
│ WASM Linear Memory │
│ │
│ Host allocates via ttc_alloc() │
│ Host writes f32 frames into allocated buffers │
│ Host calls ttc_push_frame(handle, ts, ptr, len, │
│ out_ptr, out_cap, &out_written) │
│ Host reads segment bytes from out_ptr │
│ Host frees via ttc_dealloc() │
│ │
│ ┌──────────────────────────────┐ │
│ │ STORE: Vec<Option<Compressor>>│ │
│ │ [0] = Some(comp_a) │ │
│ │ [1] = None (freed) │ │
│ │ [2] = Some(comp_b) │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
FFI function table:
| Function | Purpose | Parameters |
|---|---|---|
ttc_create |
Create compressor | (len, now_ts, &out_handle) |
ttc_free |
Destroy compressor | (handle) |
ttc_touch |
Record access | (handle, now_ts) |
ttc_set_access |
Set access stats | (handle, count, last_ts) |
ttc_push_frame |
Compress a frame | (handle, ts, in_ptr, len, out_ptr, out_cap, &out_written) |
ttc_flush |
Flush current segment | (handle, out_ptr, out_cap, &out_written) |
ttc_decode_segment |
Decompress segment | (seg_ptr, seg_len, out_ptr, out_cap, &out_written) |
ttc_alloc |
Allocate WASM memory | (size, &out_ptr) |
ttc_dealloc |
Free WASM memory | (ptr, cap) |
Handle-based store: Compressors are stored in a global Vec<Option<TemporalTensorCompressor>>. Handles are indices. Freed slots are reused. This pattern is standard for WASM FFI where the host cannot hold Rust references.
ruvector-temporal-tensor
(no external deps - pure Rust, WASM-safe)
ruvector-temporal-tensor-wasm
└── ruvector-temporal-tensor
ruvector-core (future integration)
└── ruvector-temporal-tensor (optional feature)
extends QuantizedVector trait
Compressed segments are stored as byte blobs in AgenticDB, keyed by:
Key: {tensor_id}:{segment_start_ts}:{segment_end_ts}
Value: segment bytes (TQTC format)
Tags: tier={hot|warm|cold}, bits={3|5|7|8}, frames={N}
AgenticDB's HNSW index is not used for segment lookup (segments are accessed by time range, not similarity). Instead, a B-tree or time-range index over segment keys provides O(log N) lookup.
The coherence engine (ADR-014, ADR-015) can trigger segment boundaries via a coherence-gated refresh:
if coherence_score(tensor_id) < coherence_threshold:
compressor.flush() // Force segment boundary
// New segment will recompute scales from fresh data
This ensures that when the coherence engine detects structural disagreement (e.g., between an agent's embedding and the graph's expected embedding), the compression system refreshes its calibration even if drift is still within the numerical threshold.
Each segment can be represented as a node in RuVector's DAG (ADR-016 delta system):
- Edges:
tensor_id -> segment_1 -> segment_2 -> ...(temporal lineage) - Metadata: Which agent/workflow produced the tensor, tier at time of compression
- Provenance: Full reconstruction path from segments back to original f32 data
| Component | Status | Notes |
|---|---|---|
| Groupwise symmetric quant | Correct | qmax = 2^(bits-1) - 1; symmetric range [-qmax, +qmax] |
| f16 conversion | Correct with caveats | Rounding mode is round-half-up (not round-half-even); acceptable for scales |
| Bit-packing | Correct | 64-bit accumulator handles all widths 1-8 without overflow |
| Drift detection | Correct | Per-group max-abs comparison against scaled threshold |
| Segment encode/decode | Correct | Round-trip verified for all tier widths |
| Bias mapping | Correct | bias = qmax; unsigned range is [0, 2*qmax] which fits in bits bits |
| Pattern | Risk | Mitigation |
|---|---|---|
static mut STORE |
UB in multi-threaded context | WASM is single-threaded; safe in practice. Migrate to thread_local! or OnceCell for native targets. |
from_raw_parts in FFI |
UB if host passes invalid pointers | Host is responsible for valid pointers; standard WASM FFI contract. Add debug assertions. |
std::mem::forget in ttc_alloc |
Memory managed by host | Correct pattern; host calls ttc_dealloc to reconstruct and drop the Vec. |
| Null pointer checks | Partial | FFI functions check out_written.is_null() but not all out_ptr. Add null checks. |
Recommended safety improvements for production:
- Replace
static mutwiththread_local!for native target compatibility - Add
#[cfg(debug_assertions)]bounds checks in decode loops - Validate segment magic/version before parsing
- Add
ttc_last_errorfunction for error reporting to host
| Operation | Complexity | Estimated Latency (512-dim tensor) |
|---|---|---|
| Tier selection | O(1) | <10ns |
| Drift check | O(N/G) where G=group_len | ~50ns |
| Scale computation | O(N) | ~100ns |
| Quantize + pack | O(N) | ~200ns |
| Decode + unpack | O(N) | ~200ns |
| f16 conversion | O(1) per scale | ~5ns |
SIMD opportunity: The inner quantize loop (v * inv_scale, round, clamp, pack) is highly vectorizable. With WASM SIMD (128-bit), processing 4 f32s per iteration yields ~4x speedup on the hot loop.
Rejected: The existing QuantizedVector trait assumes single-frame quantization with per-vector scales. Temporal segments require fundamentally different state management (multi-frame, drift-aware). Adding this to ruvector-core would violate single-responsibility and complicate the existing, well-tested code.
Rejected: GPTQ and AWQ are designed for static weight quantization with Hessian-based sensitivity. Our use case is streaming activations/embeddings that change every frame. The calibration cost of GPTQ (~minutes per layer) is prohibitive for real-time streams.
Considered but deferred: XOR-based or arithmetic delta encoding (frame[t] - frame[t-1]) could further compress within a segment. However, this adds complexity and makes random access within a segment O(N) instead of O(1). We may add this as an optional mode in a future version.
Rejected for default: Asymmetric quantization (with zero-point) adds 2 bytes of metadata per group and requires an additional subtraction in the dequantize path. For centered distributions (typical of embeddings and activations), the accuracy improvement is marginal (<0.5% relative error reduction) while the metadata cost is significant at small group sizes.
Rejected: Adding an external dependency for f16 conversion would complicate WASM builds and increase binary size. The software f16 conversion is ~50 lines and has no performance-critical path (scales are converted once per segment, not per frame).
| Tier | Bits | Target Compression (vs f32) | Measurement |
|---|---|---|---|
| Hot | 8 | >= 3.7x (single frame), >= 3.99x (100 frames) | Segment size / raw f32 size |
| Warm (7-bit) | 7 | >= 4.2x (single frame), >= 4.56x (100 frames) | Same |
| Warm (5-bit) | 5 | >= 5.8x (single frame), >= 6.38x (100 frames) | Same |
| Cold | 3 | >= 9.0x (single frame), >= 10.6x (100 frames) | Same |
Primary target: On a representative 1-hour trace, achieve >= 6x reduction for warm tensors and >= 10x for cold tensors in resident bytes.
| Tier | Max Relative Error | Measurement |
|---|---|---|
| Hot (8-bit) | < 0.8% | max( |
| Warm (7-bit) | < 1.6% | Same |
| Warm (5-bit) | < 6.5% | Same |
| Cold (3-bit) | < 30% | Same; bounded error, not bit-exact |
| Metric | Target |
|---|---|
| Quantize latency (512-dim, native) | < 500ns per frame |
| Quantize latency (512-dim, WASM) | < 2us per frame |
| Decode latency (512-dim, native) | < 500ns per frame |
| WASM binary size | < 100KB (release, wasm-opt) |
| Memory overhead per compressor | < 1KB + segment data |
- Round-trip encode/decode produces correct results for all tier widths (3, 5, 7, 8)
- Drift detection correctly triggers segment boundaries
- Tier transitions produce valid segment boundaries
- Multiple compressors can coexist via handle system
- Segment binary format is platform-independent (little-endian)
- WASM FFI functions handle null pointers and size mismatches gracefully
- No external crate dependencies in core library
| Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|
| 3-bit quantization too lossy for some tensor types | High | Medium | Make tier policy configurable; allow per-tensor overrides; add quality monitoring |
| Drift detection false positives cause excessive segments | Medium | Medium | Tune drift_pct_q8; add hysteresis (require N consecutive drifts) |
| f16 scale precision insufficient for very small tensors | Medium | Low | Detect near-zero scales; fall back to f32 scales when f16 underflows |
| WASM performance 3-5x slower than native | Medium | High | Expected; optimize hot loops with WASM SIMD; acceptable for non-realtime paths |
static mut unsound if WASM threading arrives |
Low | Low | Replace with thread_local! or atomic cell before enabling shared memory |
| Segment format not forward-compatible | Medium | Low | Version field enables format evolution; decode rejects unknown versions |
- Typical tensor dimensions: What are the representative dimensions for RuVector agent tensors? (Impacts group_len tuning and SIMD strategy)
- Update frequency: How many frames per second for hot vs warm vs cold tensors? (Impacts segment size expectations)
- Cold tier error tolerance: Is bounded relative error (up to 30% at 3-bit) acceptable, or do some cold tensors need bit-exact reversibility?
- Integration priority: Should AgenticDB integration (segment storage) or coherence engine integration (drift gating) come first?
- SIMD tier: Should the initial implementation include WASM SIMD, or start scalar-only and add SIMD in a follow-up?
- Create
ruvector-temporal-tensorcrate with zero dependencies - Implement
tier_policy.rs,f16.rs,bitpack.rs,quantizer.rs - Implement
segment.rs(encode/decode) andcompressor.rs - Unit tests: round-trip correctness for all bit widths
- Unit tests: drift detection boundary conditions
- Unit tests: segment binary format parsing
- Implement
ffi.rswith handle-based store - Create
ruvector-temporal-tensor-wasmcrate - WASM integration tests via wasm-pack
- Binary size validation (< 100KB target)
- Performance benchmarks (native vs WASM)
- AgenticDB segment storage adapter
- Coherence engine refresh hook
- DAG lineage edges for segments
- End-to-end benchmark on representative trace
- Acceptance test: 6x warm, 10x cold compression
- WASM SIMD for quantize/dequantize hot loops
- Native AVX2/NEON specialization
- Optional delta encoding within segments
- Streaming decode (partial segment access)
- Add to workspace
Cargo.toml
- Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
- Lin, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024.
- Kim, S., et al. "SqueezeLLM: Dense-and-Sparse Quantization." ICML 2024.
- Chee, J., et al. "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks." ICML 2024.
- Egiazarian, V., et al. "AQLM: Extreme Compression of Large Language Models via Additive Quantization." ICML 2024.
- Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024.
- Zhao, Y., et al. "Atom: Low-bit Quantization for Efficient and Accurate LLM Serving." MLSys 2024.
- Liu, R., et al. "SpinQuant: LLM Quantization with Learned Rotations." NeurIPS 2024.
- Ma, S., et al. "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits." arXiv:2402.17764, 2024.
- Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015.
- Ashkboos, S., et al. "QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs." NeurIPS 2024.
For a tensor of dimension N with group size G, bit width B, and F frames per segment:
raw_size = N * 4 * F // f32 bytes per segment
scale_size = ceil(N/G) * 2 // f16 scales (shared across frames)
header_size = 26 // fixed segment header
data_size = ceil(N * F * B / 8) // packed bitstream
segment_size = header_size + scale_size + data_size
compression_ratio = raw_size / segment_size
Example: N=512, G=64, B=3, F=100:
raw_size = 512 * 4 * 100 = 204,800 bytes
scale_size = ceil(512/64) * 2 = 16 bytes
header_size = 26 = 26 bytes
data_size = ceil(512 * 100 * 3/8) = 19,200 bytes
segment_size = 26 + 16 + 19,200 = 19,242 bytes
ratio = 204,800 / 19,242 = 10.64x
| Scenario | access_count | age (ticks) | Score | Tier |
|---|---|---|---|---|
| Actively used | 100 | 10 | 10,240 | Hot (8-bit) |
| Recently used | 50 | 100 | 512 | Hot (8-bit) |
| Moderate use | 10 | 100 | 102 | Warm (7-bit) |
| Infrequent | 5 | 200 | 25 | Cold (3-bit) |
| Stale | 1 | 1000 | 1 | Cold (3-bit) |
For symmetric quantization with bit width B and group scale s:
quantization_step = s / qmax = s / (2^(B-1) - 1)
max_error = quantization_step / 2 // from rounding
relative_error = max_error / s = 1 / (2 * qmax)
| Bits | qmax | Max Relative Error |
|---|---|---|
| 8 | 127 | 0.39% |
| 7 | 63 | 0.79% |
| 5 | 15 | 3.33% |
| 3 | 3 | 16.7% |
Note: These are worst-case per-element errors. RMS error across a group is typically sqrt(1/12) * quantization_step, which is ~0.29x the max error.