Skip to content

Latest commit

 

History

History
999 lines (785 loc) · 27.2 KB

File metadata and controls

999 lines (785 loc) · 27.2 KB

PT-BitNet Quantizer Module Architecture Design

Version: 1.0 Date: 2026-02-03 Status: Design Specification Relates to: ADR-017 (AD-1, AD-5, AD-18, AD-19), DDD Section 3.4/4.2/4.3


Executive Summary

This document specifies the architecture for the PT-BitNet post-training quantizer module that converts FP16/BF16 GLM-4.7-Flash weights to BitNet b1.58 ternary {-1, 0, +1} format via absmean quantization. This is a design-only specification — implementation follows in Phase 0.

Design Scope:

  • Module layout and file organization
  • Complete struct definitions with field types
  • Full function signatures (no implementations)
  • GGUF integration points and format extensions
  • Error handling strategy
  • Testing approach

Out of Scope:

  • Actual implementation code
  • Performance benchmarks
  • Calibration dataset selection

A. Module Layout

Directory Structure

crates/ruvllm/src/
├── bitnet/                          # NEW module
│   ├── mod.rs                       # Module exports and public API
│   ├── quantizer.rs                 # PtBitnetQuantizer + absmean algorithm
│   ├── ternary_tensor.rs            # TernaryTensor value object
│   ├── dequantize.rs                # BITNET_T158 dequantization kernel
│   └── config.rs                    # PtBitnetConfig configuration
│
├── gguf/
│   ├── mod.rs                       # Add pub mod bitnet export
│   ├── quantization.rs              # MODIFIED: Add BITNET_T158 enum variant
│   ├── parser.rs                    # Unchanged (reused as-is)
│   └── ...
│
└── kernels/
    └── matmul.rs                    # Reference for dispatch patterns

Modified Files

src/gguf/quantization.rs

Changes:

  1. Add BITNET_T158 = 30 variant to GgufQuantType enum (after Bf16 = 29)
  2. Update try_from() impl to handle type 30
  3. Update block_size() to return 256 for BITNET_T158
  4. Update type_size() to return 66 for BITNET_T158 (64 bytes packed + 2 bytes FP16 scale)
  5. Update is_quantized() to include BITNET_T158
  6. Update bits_per_weight() to return 2.06 for BITNET_T158
  7. Add new match arm in dequantize_tensor()BITNET_T158 => dequantize_bitnet_t158(data, output)

Exact enum addition:

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[repr(u32)]
pub enum GgufQuantType {
    // ... existing variants 0-29 ...
    /// BitNet b1.58 ternary quantization (2-bit packed + FP16 scale per 256-element block)
    BITNET_T158 = 30,
}

B. Struct Definitions

1. PtBitnetConfig (in bitnet/config.rs)

Purpose: Configuration for PT-BitNet quantization process

/// Configuration for PT-BitNet post-training quantization
#[derive(Debug, Clone)]
pub struct PtBitnetConfig {
    /// Block size for absmean scale computation (default: 256)
    pub block_size: usize,

    /// Epsilon for numerical stability in scale computation (default: 1e-8)
    pub epsilon: f32,

    /// Whether to run calibration pass to optimize scale factors
    pub use_calibration: bool,

    /// Number of calibration samples (if use_calibration = true)
    pub calibration_samples: usize,

    /// Maximum sequence length for calibration (default: 2048)
    pub calibration_max_seq_len: usize,

    /// Device for calibration pass ("cpu", "metal", "cuda:0")
    pub calibration_device: String,

    /// Clipping threshold for normalized weights before rounding
    /// (default: 1.0, range typically 0.95-1.05)
    pub clip_threshold: f32,

    /// Sparsity target: if > 0.0, bias rounding toward zero to achieve target sparsity
    pub target_sparsity: Option<f32>,
}

impl Default for PtBitnetConfig {
    fn default() -> Self {
        Self {
            block_size: 256,
            epsilon: 1e-8,
            use_calibration: false,
            calibration_samples: 1000,
            calibration_max_seq_len: 2048,
            calibration_device: "metal".to_string(),
            clip_threshold: 1.0,
            target_sparsity: None,
        }
    }
}

2. TernaryTensor (in bitnet/ternary_tensor.rs)

Purpose: Immutable value object for packed ternary weights

/// Packed ternary tensor with per-block FP16 scales
#[derive(Debug, Clone)]
pub struct TernaryTensor {
    /// Packed 2-bit ternary values (4 weights per byte)
    /// Encoding: 00 = -1, 01 = 0, 10 = +1, 11 = reserved
    pub packed_data: Vec<u8>,

    /// Per-block FP16 scale factors (absmean values)
    pub scales: Vec<f16>,

    /// Tensor shape [out_features, in_features] or [rows, cols]
    pub shape: [usize; 2],

    /// Block size (always 256 for BitNet b1.58)
    pub block_size: usize,

    /// Total number of weights
    pub num_elements: usize,

    /// Number of blocks
    pub num_blocks: usize,

    /// Measured sparsity (fraction of zero weights)
    pub sparsity: f32,
}

impl TernaryTensor {
    /// Calculate total storage size in bytes
    pub fn storage_size(&self) -> usize;

    /// Get expected packed_data size for validation
    pub fn expected_packed_size(&self) -> usize;

    /// Validate internal consistency
    pub fn validate(&self) -> Result<()>;
}

3. TernaryBlock (in bitnet/ternary_tensor.rs)

Purpose: Single block of 256 ternary weights with scale

/// A single 256-element block with ternary weights and FP16 scale
#[derive(Debug, Clone)]
pub struct TernaryBlock {
    /// 64 bytes of packed 2-bit values (256 weights × 2 bits ÷ 8 bits/byte)
    pub packed: [u8; 64],

    /// FP16 absmean scale factor
    pub scale: f16,
}

impl TernaryBlock {
    /// Size in bytes when stored in GGUF (64 + 2 = 66)
    pub const STORAGE_SIZE: usize = 66;

    /// Number of elements in a block
    pub const BLOCK_SIZE: usize = 256;
}

4. AbsmeanResult (in bitnet/quantizer.rs)

Purpose: Result of absmean quantization on a single block

/// Result of absmean ternary quantization on a block
#[derive(Debug, Clone)]
pub struct AbsmeanResult {
    /// Ternary values {-1, 0, +1} for each weight in the block
    pub ternary_weights: Vec<i8>,

    /// Computed absmean scale factor (gamma = mean(|W|))
    pub scale: f32,

    /// Measured sparsity (fraction of zeros)
    pub sparsity: f32,

    /// Mean squared error vs original FP16 values (for calibration)
    pub mse: f32,
}

5. QuantizationStats (in bitnet/quantizer.rs)

Purpose: Statistics collected during quantization

/// Statistics from quantizing a single tensor
#[derive(Debug, Clone)]
pub struct QuantizationStats {
    /// Tensor name
    pub name: String,

    /// Mean of all block scales
    pub mean_scale: f32,

    /// Std dev of block scales
    pub std_scale: f32,

    /// Overall sparsity across all blocks
    pub sparsity: f32,

    /// Mean MSE across all blocks
    pub mean_mse: f32,

    /// Number of blocks
    pub num_blocks: usize,
}

C. Function Signatures

Core Quantization Functions (in bitnet/quantizer.rs)

1. Primary Quantization Entry Point

/// Quantize an FP16/F32 tensor to ternary format using absmean quantization
///
/// # Arguments
/// * `tensor` - Input FP16 or F32 tensor data (flat vector)
/// * `shape` - Tensor shape [out_features, in_features]
/// * `config` - Quantization configuration
///
/// # Returns
/// * `TernaryTensor` - Packed ternary representation
/// * `QuantizationStats` - Statistics about the quantization process
///
/// # Errors
/// * `RuvLLMError::Quantization` if tensor size is not divisible by block_size
/// * `RuvLLMError::Quantization` if shape product doesn't match tensor length
pub fn quantize_tensor(
    tensor: &[f32],
    shape: [usize; 2],
    config: &PtBitnetConfig,
) -> Result<(TernaryTensor, QuantizationStats)>;

2. Per-Block Quantization

/// Apply absmean quantization to a single block of weights
///
/// Algorithm:
/// 1. gamma = mean(|block|) + epsilon
/// 2. normalized = block / gamma
/// 3. ternary = round(clamp(normalized, -clip_threshold, +clip_threshold))
/// 4. Map to {-1, 0, +1}
///
/// # Arguments
/// * `block` - Block of FP16/F32 values (length = config.block_size)
/// * `config` - Configuration with epsilon and clip_threshold
///
/// # Returns
/// * `AbsmeanResult` with ternary values, scale, sparsity, MSE
///
/// # Panics
/// * If block.len() != config.block_size
pub fn absmean_ternary(
    block: &[f32],
    config: &PtBitnetConfig,
) -> AbsmeanResult;

3. Packing Functions

/// Pack ternary {-1, 0, +1} values into 2-bit representation
///
/// Encoding: 00 = -1, 01 = 0, 10 = +1, 11 = reserved (unused)
/// 4 values packed per byte: [v3 v2 v1 v0] → byte
///
/// # Arguments
/// * `values` - Ternary values (must be {-1, 0, +1} only)
///
/// # Returns
/// * Packed bytes (length = ceil(values.len() / 4))
///
/// # Errors
/// * If any value is not in {-1, 0, +1}
pub fn pack_ternary(values: &[i8]) -> Result<Vec<u8>>;

/// Unpack 2-bit representation to ternary {-1, 0, +1} values
///
/// # Arguments
/// * `packed` - Packed 2-bit data
/// * `n` - Number of values to extract
///
/// # Returns
/// * Vector of ternary values (length = n)
pub fn unpack_ternary(packed: &[u8], n: usize) -> Vec<i8>;

4. Calibration (Optional)

/// Run calibration pass to optimize scale factors
///
/// # Arguments
/// * `tensor` - Input FP16 tensor
/// * `shape` - Tensor shape
/// * `config` - Config with calibration settings
/// * `calibration_data` - Sample activations for this layer
///
/// # Returns
/// * Optimized `TernaryTensor` with calibrated scales
///
/// # Note
/// This is optional - if not used, falls back to plain absmean
pub fn quantize_with_calibration(
    tensor: &[f32],
    shape: [usize; 2],
    config: &PtBitnetConfig,
    calibration_data: &[Vec<f32>],
) -> Result<(TernaryTensor, QuantizationStats)>;

Dequantization Functions (in bitnet/dequantize.rs)

/// Dequantize BITNET_T158 tensor to FP32
///
/// # Arguments
/// * `data` - Raw GGUF tensor bytes (packed ternary + scales)
/// * `scales` - Per-block FP16 scales (extracted from data)
/// * `n` - Total number of elements to dequantize
///
/// # Returns
/// * Vec<f32> of dequantized values
///
/// # Format
/// Each block: [64 bytes packed ternary][2 bytes FP16 scale]
pub fn dequantize_bitnet_t158(
    data: &[u8],
    scales: &[f16],
    n: usize,
) -> Vec<f32>;

/// Dequantize a single BITNET_T158 block
///
/// # Arguments
/// * `block_data` - 64 bytes of packed ternary data
/// * `scale` - FP16 scale factor
/// * `output` - Output buffer (must have capacity for 256 elements)
pub fn dequantize_bitnet_t158_block(
    block_data: &[u8; 64],
    scale: f16,
    output: &mut [f32],
);

Tensor Conversion (in bitnet/ternary_tensor.rs)

impl TernaryTensor {
    /// Convert from packed storage to FP32 (for validation/testing)
    pub fn to_fp32(&self) -> Vec<f32>;

    /// Create from existing GGUF tensor data
    pub fn from_gguf_data(
        data: &[u8],
        shape: [usize; 2],
        block_size: usize,
    ) -> Result<Self>;

    /// Serialize to GGUF tensor bytes
    pub fn to_gguf_data(&self) -> Vec<u8>;
}

D. GGUF Integration Points

1. New Quantization Type Variant

File: crates/ruvllm/src/gguf/quantization.rs

Changes to GgufQuantType enum:

#[repr(u32)]
pub enum GgufQuantType {
    // ... existing 0-29 ...

    /// BitNet b1.58 ternary quantization
    /// Block size: 256 elements
    /// Storage: 64 bytes packed (2-bit) + 2 bytes FP16 scale = 66 bytes/block
    /// Bits per weight: 2.06 bpw
    BITNET_T158 = 30,
}

impl GgufQuantType {
    pub fn block_size(&self) -> usize {
        match self {
            // ... existing cases ...
            Self::BITNET_T158 => 256,
        }
    }

    pub fn type_size(&self) -> usize {
        match self {
            // ... existing cases ...
            Self::BITNET_T158 => 66,  // 64 + 2
        }
    }

    pub fn name(&self) -> &'static str {
        match self {
            // ... existing cases ...
            Self::BITNET_T158 => "BITNET_T158",
        }
    }
}

impl TryFrom<u32> for GgufQuantType {
    fn try_from(value: u32) -> Result<Self> {
        match value {
            // ... existing 0-29 ...
            30 => Ok(Self::BITNET_T158),
            _ => Err(/* ... */),
        }
    }
}

2. Dequantization Dispatch

File: crates/ruvllm/src/gguf/quantization.rs

Modification to dequantize_tensor() function:

pub fn dequantize_tensor(
    data: &[u8],
    dtype: GgufQuantType,
    num_elements: usize,
) -> Result<Vec<f32>> {
    let mut output = vec![0.0f32; num_elements];

    match dtype {
        // ... existing cases ...
        GgufQuantType::BITNET_T158 => {
            // Extract scales and packed data
            let num_blocks = (num_elements + 255) / 256;
            let mut scales = Vec::with_capacity(num_blocks);

            for i in 0..num_blocks {
                let block_offset = i * 66;
                let scale_offset = block_offset + 64;
                let scale_bytes = [data[scale_offset], data[scale_offset + 1]];
                scales.push(f16::from_le_bytes(scale_bytes));
            }

            crate::bitnet::dequantize::dequantize_bitnet_t158(
                data,
                &scales,
                num_elements,
            );
        }
        _ => {
            return Err(RuvLLMError::Model(format!(
                "Dequantization not implemented for {:?}",
                dtype
            )));
        }
    }

    Ok(output)
}

3. GGUF Metadata Keys

New metadata keys for BitNet models (written during quantization, read during load):

// In quantizer when exporting GGUF
pub const BITNET_METADATA_KEYS: &[(&str, &str)] = &[
    ("craftsman.bitnet.version", "1"),
    ("craftsman.bitnet.weight_encoding", "absmean_ternary"),
    ("craftsman.bitnet.activation_bits", "8"),
    ("craftsman.bitnet.block_size", "256"),
    ("craftsman.bitnet.kernel_hint", "tl1"),  // or "tl2", "i2s"
];

Metadata reading in model loader:

// In backend when loading model
fn detect_bitnet_model(metadata: &HashMap<String, GgufValue>) -> bool {
    metadata.get("craftsman.bitnet.version")
        .and_then(|v| v.as_str())
        .map(|v| v == "1")
        .unwrap_or(false)
}

4. Tensor Info Extension

No changes needed - existing TensorInfo struct in parser.rs already supports:

  • name: String
  • shape: Vec<usize>
  • dtype: GgufQuantType ← Will now include BITNET_T158
  • offset: u64

E. Error Handling Strategy

Error Types

All errors use existing RuvLLMError enum from crates/ruvllm/src/error.rs:

pub enum RuvLLMError {
    // Existing variants...

    // Quantization-specific errors
    Quantization(String),  // Use this variant for all quantization errors
    Model(String),         // For GGUF format issues
    Config(String),        // For invalid configuration
}

Error Scenarios and Handling

Scenario Error Type Recovery Strategy
Tensor size not divisible by block_size Quantization Pad last block with zeros
Invalid ternary value during packing Quantization Fail-fast - indicates bug
GGUF file has wrong BITNET_T158 block size Model Fail-fast - corrupted file
Calibration device unavailable Config Fall back to non-calibrated quantization
Out of memory during quantization System panic Let Rust OOM handler catch
Shape mismatch in tensor Quantization Fail-fast - validate before processing
FP16 scale is NaN/Inf Quantization Clamp to epsilon value
Empty tensor / zero elements Quantization Skip with warning

Validation Functions

/// Validate quantization config
pub fn validate_config(config: &PtBitnetConfig) -> Result<()> {
    if config.block_size == 0 || config.block_size % 4 != 0 {
        return Err(RuvLLMError::Config(
            "block_size must be non-zero and divisible by 4".into()
        ));
    }

    if config.epsilon <= 0.0 {
        return Err(RuvLLMError::Config(
            "epsilon must be positive".into()
        ));
    }

    if config.clip_threshold <= 0.0 || config.clip_threshold > 2.0 {
        return Err(RuvLLMError::Config(
            "clip_threshold must be in range (0.0, 2.0]".into()
        ));
    }

    Ok(())
}

/// Validate tensor shape and size
pub fn validate_tensor(
    tensor: &[f32],
    shape: [usize; 2],
    block_size: usize,
) -> Result<()> {
    let expected_size = shape[0] * shape[1];

    if tensor.len() != expected_size {
        return Err(RuvLLMError::Quantization(format!(
            "Tensor length {} doesn't match shape {:?} (expected {})",
            tensor.len(), shape, expected_size
        )));
    }

    if expected_size % block_size != 0 {
        // Could pad, but for simplicity require exact multiple
        return Err(RuvLLMError::Quantization(format!(
            "Tensor size {} is not divisible by block_size {}",
            expected_size, block_size
        )));
    }

    Ok(())
}

F. Testing Strategy

Unit Tests

1. Absmean Quantization Correctness

File: crates/ruvllm/src/bitnet/tests/quantizer_tests.rs

#[test]
fn test_absmean_ternary_basic() {
    // Test that absmean correctly quantizes known values
    let config = PtBitnetConfig::default();

    // Block with known mean(|x|) = 1.0
    let block: Vec<f32> = vec![
        2.0, -2.0, 1.0, -1.0,  // gamma = mean(2,2,1,1,...) ≈ 1.0
        0.5, -0.5, 0.0, 0.0,
        // ... (pad to 256 elements)
    ];

    let result = absmean_ternary(&block, &config);

    // After normalization: 2.0/1.0 = 2.0 → clamp to 1.0 → round to +1
    assert_eq!(result.ternary_weights[0], 1);  // 2.0 → +1
    assert_eq!(result.ternary_weights[1], -1); // -2.0 → -1
    assert_eq!(result.ternary_weights[2], 1);  // 1.0 → +1
    assert_eq!(result.ternary_weights[6], 0);  // 0.0 → 0

    assert!(result.scale > 0.9 && result.scale < 1.1); // gamma ≈ 1.0
}

#[test]
fn test_absmean_all_zeros() {
    let config = PtBitnetConfig::default();
    let block = vec![0.0; 256];

    let result = absmean_ternary(&block, &config);

    // All zeros → scale = epsilon, all ternary = 0
    assert_eq!(result.scale, config.epsilon);
    assert!(result.ternary_weights.iter().all(|&x| x == 0));
    assert_eq!(result.sparsity, 1.0);
}

2. Pack/Unpack Round-Trip

#[test]
fn test_pack_unpack_roundtrip() {
    let original = vec![1i8, -1, 0, 1, 0, -1, 1, 0];

    let packed = pack_ternary(&original).unwrap();
    assert_eq!(packed.len(), 2); // 8 values → 2 bytes

    let unpacked = unpack_ternary(&packed, 8);
    assert_eq!(unpacked, original);
}

#[test]
fn test_pack_invalid_value() {
    let invalid = vec![1i8, 2, 0]; // 2 is not ternary

    let result = pack_ternary(&invalid);
    assert!(result.is_err());
}

3. Tensor Validation

#[test]
fn test_validate_tensor_shape_mismatch() {
    let tensor = vec![1.0; 100];
    let shape = [10, 11]; // 10*11 = 110 ≠ 100

    let result = validate_tensor(&tensor, shape, 256);
    assert!(result.is_err());
}

#[test]
fn test_validate_tensor_block_alignment() {
    let tensor = vec![1.0; 257]; // Not divisible by 256
    let shape = [1, 257];

    let result = validate_tensor(&tensor, shape, 256);
    assert!(result.is_err());
}

Integration Tests

4. Full Quantization Pipeline

#[test]
fn test_quantize_tensor_full_pipeline() {
    let config = PtBitnetConfig::default();

    // Create a 512-element tensor (2 blocks)
    let tensor: Vec<f32> = (0..512).map(|i| (i as f32) / 512.0).collect();
    let shape = [2, 256];

    let (ternary, stats) = quantize_tensor(&tensor, shape, &config).unwrap();

    assert_eq!(ternary.num_blocks, 2);
    assert_eq!(ternary.packed_data.len(), 2 * 64); // 2 blocks × 64 bytes
    assert_eq!(ternary.scales.len(), 2);
    assert_eq!(stats.num_blocks, 2);

    // Verify reconstruction quality
    let reconstructed = ternary.to_fp32();
    assert_eq!(reconstructed.len(), 512);
}

5. GGUF Round-Trip

#[test]
fn test_gguf_serialization_roundtrip() {
    let config = PtBitnetConfig::default();
    let tensor = vec![1.0; 256];
    let shape = [1, 256];

    let (ternary, _) = quantize_tensor(&tensor, shape, &config).unwrap();

    // Serialize to GGUF format
    let gguf_data = ternary.to_gguf_data();
    assert_eq!(gguf_data.len(), 66); // 1 block = 66 bytes

    // Deserialize
    let recovered = TernaryTensor::from_gguf_data(&gguf_data, shape, 256).unwrap();

    assert_eq!(recovered.packed_data, ternary.packed_data);
    assert_eq!(recovered.scales, ternary.scales);
}

Benchmark Tests

6. Performance Regression

#[bench]
fn bench_absmean_ternary_256(b: &mut Bencher) {
    let config = PtBitnetConfig::default();
    let block: Vec<f32> = (0..256).map(|i| (i as f32) / 256.0).collect();

    b.iter(|| {
        let _ = absmean_ternary(&block, &config);
    });
}

#[bench]
fn bench_pack_ternary_1024(b: &mut Bencher) {
    let values = vec![1i8; 1024];

    b.iter(|| {
        let _ = pack_ternary(&values);
    });
}

Correctness Validation Tests

7. Bit-Exact Validation Against Reference

#[test]
fn test_dequantize_matches_reference() {
    // Reference implementation (naive)
    fn reference_dequant(ternary: &[i8], scale: f32) -> Vec<f32> {
        ternary.iter().map(|&t| (t as f32) * scale).collect()
    }

    let config = PtBitnetConfig::default();
    let tensor = vec![1.5, -2.3, 0.1, -0.4]; // Extend to 256
    let tensor_256 = /* pad to 256 */;
    let shape = [1, 256];

    let (ternary, _) = quantize_tensor(&tensor_256, shape, &config).unwrap();

    // Unpack and dequantize
    let unpacked = unpack_ternary(&ternary.packed_data, 256);
    let reference = reference_dequant(&unpacked, ternary.scales[0].to_f32());
    let optimized = ternary.to_fp32();

    // Allow small floating-point error
    for (r, o) in reference.iter().zip(optimized.iter()) {
        assert!((r - o).abs() < 1e-5);
    }
}

Test Organization

crates/ruvllm/src/bitnet/tests/
├── quantizer_tests.rs     # absmean, pack/unpack
├── tensor_tests.rs        # TernaryTensor validation
├── dequantize_tests.rs    # BITNET_T158 dequant
├── integration_tests.rs   # Full pipeline, GGUF round-trip
└── benches.rs             # Performance benchmarks

G. Implementation Phases

Phase 0.1: Core Data Structures (~2-3 days)

  1. bitnet/mod.rs - module structure
  2. bitnet/config.rs - PtBitnetConfig
  3. bitnet/ternary_tensor.rs - TernaryTensor, TernaryBlock
  4. Unit tests for validation

Phase 0.2: Quantization Algorithm (~3-4 days)

  1. bitnet/quantizer.rs - absmean_ternary()
  2. Pack/unpack functions
  3. quantize_tensor() main entry point
  4. Unit tests for correctness

Phase 0.3: Dequantization (~2 days)

  1. bitnet/dequantize.rs - block and tensor dequant
  2. Integration with existing quantization.rs
  3. Round-trip tests

Phase 0.4: GGUF Integration (~2-3 days)

  1. Modify gguf/quantization.rs - add BITNET_T158 enum variant
  2. Add metadata keys
  3. GGUF serialization/deserialization
  4. Integration tests

Phase 0.5: Validation & Benchmarks (~2 days)

  1. Full pipeline integration tests
  2. Performance benchmarks
  3. Bit-exact validation
  4. Documentation

Total Estimated Effort: ~13-16 days for clean, well-tested implementation


H. Open Design Questions

# Question Impact Recommendation
1 Use IQ1_S (type 19) or new BITNET_T158 (type 30)? Compatibility New type 30 - cleaner separation, avoids confusion with IQ1_S's codebook format
2 Padding strategy for last block if not aligned? Correctness Zero-pad - simplest, matches BitNet spec
3 Should calibration be mandatory or optional? Quality vs Speed Optional - Phase 0 can work without it, add later if needed
4 F16 or F32 for internal scale computation? Precision F32 internally, store as F16 - extra precision during compute
5 Handle NaN/Inf in input tensors? Robustness Fail-fast - corrupted weights should not be silently ignored
6 Support block sizes other than 256? Flexibility No - BitNet spec is 256, simplifies code
7 Multi-threading for per-block quantization? Performance Not in Phase 0 - can add via rayon later
8 Store sparsity per-block in GGUF? Kernel optimization No - compute on-the-fly during dequant, saves space

I. Dependencies and Prerequisites

Existing RuvLLM Components (Reused)

  • crates/ruvllm/src/error.rs - RuvLLMError enum
  • crates/ruvllm/src/gguf/parser.rs - GGUF parsing (unchanged)
  • crates/ruvllm/src/gguf/quantization.rs - Enum + dispatch (modified)
  • half crate - FP16 support (already in Cargo.toml)

New External Dependencies

None - uses only existing dependencies

Minimum Rust Version

Same as RuvLLM (likely 1.70+)


J. Non-Goals (Out of Scope)

  1. Calibration implementation - Deferred to future phase
  2. TL1/TL2 kernel implementation - Separate ADR/DDD
  3. Model loader integration - Separate backend implementation
  4. Performance optimization - Phase 0 is correctness-first
  5. WASM support - Desktop/server only for Phase 0
  6. Dynamic quantization - Only post-training static
  7. Mixed-precision strategies - All-or-nothing ternary for Phase 0

K. Success Criteria

This design is complete when:

  1. All struct definitions have complete field specifications
  2. All function signatures are documented with arguments, returns, errors
  3. Module organization is clear and follows Rust conventions
  4. GGUF integration points are precisely specified
  5. Error handling covers all failure modes
  6. Test plan covers correctness, integration, and performance
  7. Implementation phases are realistic and sequenced
  8. Open questions are documented with recommendations

Implementation is successful when:

  1. All unit tests pass
  2. Round-trip GGUF serialization is bit-exact
  3. Dequantization produces correct FP32 output
  4. Integration with existing GGUF pipeline works
  5. Quantization of GLM-4.7-Flash completes without errors
  6. Exported GGUF file is loadable by model loader

Appendix A: Code Size Estimates

File Estimated Lines Complexity
bitnet/mod.rs ~50 Low
bitnet/config.rs ~80 Low
bitnet/ternary_tensor.rs ~200 Medium
bitnet/quantizer.rs ~350 High
bitnet/dequantize.rs ~150 Medium
gguf/quantization.rs (changes) ~100 Low
Tests ~800 Medium
Total ~1,730 lines

Comparison to ADR-018 estimate: ~200-300 lines core quantizer → Actual ~350 lines (reasonable given struct overhead)


Appendix B: Memory Layout Examples

TernaryBlock Storage (66 bytes)

Byte Offset | Content
------------|--------
0-63        | Packed 2-bit ternary (256 values)
64-65       | FP16 scale (little-endian)

2-Bit Packing Example

Values: [+1, -1, 0, +1]
Encoding: [10, 00, 01, 10]
Packed byte: 10_00_01_10 = 0x86

GGUF Tensor Data Layout

[TensorInfo] (in header)
  name: "model.layers.0.mlp.gate_proj.weight"
  shape: [4096, 11008]
  dtype: BITNET_T158 (30)
  offset: 0x1000

[Tensor Data] (at offset 0x1000)
  Block 0: [64 bytes packed][2 bytes scale]
  Block 1: [64 bytes packed][2 bytes scale]
  ...
  Block N: [64 bytes packed][2 bytes scale]

End of Design Document