Skip to content

Latest commit

 

History

History
521 lines (412 loc) · 13.9 KB

File metadata and controls

521 lines (412 loc) · 13.9 KB

RuvLLM Optimization Guide (v2.0.0)

This guide covers performance optimization strategies for RuvLLM, including SONA learning loops, batch sizing, KV cache management, and hardware-specific tuning.

v2.0.0 Performance Highlights

Feature Improvement Notes
Multi-threaded GEMM 12.7x speedup Rayon on M4 Pro 10-core
Flash Attention 2 +10% throughput Auto block sizing
Quantized Inference 4-8x memory INT8/INT4/Q4_K
Metal GPU 3x speedup simdgroup_matrix
Memory Pool Zero-alloc Arena allocator

Performance Overview

Key Metrics

Metric Target (M4 Pro) Achieved (v2.0.0) Description
Prefill >2000 tok/s 3500 tok/s Processing input tokens
Decode >80 tok/s 120 tok/s Generating output tokens
TTFT <50ms 35ms Time to first token
Memory <8GB for 7B 3.4GB (Q4K) Peak memory usage
MicroLoRA <1ms 8.56us Per-request adaptation

Architecture Impact

┌─────────────────────────────────────────────────────────┐
│                    Optimization Layers                   │
├─────────────────────────────────────────────────────────┤
│  SONA Learning      │ Real-time adaptation, routing     │
├─────────────────────────────────────────────────────────┤
│  Attention          │ Flash, Paged, GQA - 2-4x speedup │
├─────────────────────────────────────────────────────────┤
│  KV Cache           │ Two-tier, quantized - 4x memory  │
├─────────────────────────────────────────────────────────┤
│  Quantization       │ Q4K, Q8 - 4-8x smaller           │
├─────────────────────────────────────────────────────────┤
│  SIMD/GPU           │ NEON, Metal - hardware accel     │
└─────────────────────────────────────────────────────────┘

SONA Learning Optimization

Instant Loop Tuning

The instant loop runs per-request with <1ms target latency.

let config = SonaLlmConfig {
    // Learning rate for instant updates
    // Higher = faster adaptation, more variance
    // Lower = slower adaptation, more stable
    instant_lr: 0.01,

    // Quality threshold - skip low-quality samples
    training: TrainingConfig {
        quality_threshold: 0.5,  // 0.0-1.0
        ..Default::default()
    },
    ..Default::default()
};

Tuning Guidelines:

Use Case instant_lr quality_threshold
High variance tasks 0.005 0.7
Stable domains 0.02 0.3
User personalization 0.01 0.5

Background Loop Tuning

Consolidates patterns without blocking inference.

let config = SonaLlmConfig {
    // How often to run (milliseconds)
    background_interval_ms: 100,

    // Minimum samples before consolidation
    background_min_samples: 10,

    // Maximum pending (triggers forced consolidation)
    max_pending_samples: 1000,

    // Consolidation strategy
    consolidation_strategy: ConsolidationStrategy::EwcMerge,
    ..Default::default()
};

Tuning Guidelines:

Priority interval_ms min_samples Strategy
Latency 200 20 Average
Quality 50 5 EwcMerge
Memory 100 50 BestOnly

Deep Loop Optimization

Triggered periodically for full optimization.

let config = SonaLlmConfig {
    // Accumulated quality threshold to trigger
    deep_trigger_threshold: 100.0,
    ..Default::default()
};

// Manual trigger for scheduled optimization
if sona.should_trigger_deep() || is_scheduled_time() {
    let samples = collect_high_quality_samples();
    let result = sona.deep_optimize(&samples);

    // Log improvement
    println!("Deep optimization: quality delta = {:.3}", result.quality_delta);
}

Batch Size Optimization

Dynamic Batching

// Optimal batch sizes vary by operation
struct BatchConfig {
    prefill_batch: usize,   // Process multiple prompts together
    decode_batch: usize,    // Parallel token generation
    lora_batch: usize,      // LoRA adaptation batch
}

impl BatchConfig {
    fn for_memory(available_gb: f32) -> Self {
        match available_gb {
            x if x < 8.0 => Self {
                prefill_batch: 1,
                decode_batch: 4,
                lora_batch: 16,
            },
            x if x < 16.0 => Self {
                prefill_batch: 2,
                decode_batch: 8,
                lora_batch: 32,
            },
            _ => Self {
                prefill_batch: 4,
                decode_batch: 16,
                lora_batch: 64,
            },
        }
    }
}

Batch Size Impact

Batch Size Throughput Latency Memory
1 Low Lowest Lowest
4 Medium Low Medium
8 High Medium High
16+ Highest Higher Highest

Rule of thumb: Increase batch size until memory pressure or latency constraints are hit.

KV Cache Optimization

Two-Tier Configuration

let config = KvCacheConfig {
    // Tokens in high-precision tail
    // More = better attention quality for recent context
    // Less = less memory usage
    tail_length: 256,

    // Tail precision (FP16 recommended)
    tail_precision: Precision::FP16,

    // Store precision (Q4 for 4x compression)
    store_precision: Precision::Q4,

    // Maximum context length
    max_tokens: 4096,

    // KV heads (depends on model architecture)
    num_kv_heads: 8,
    head_dim: 128,

    // Batch size for migration (affects latency spikes)
    migration_batch: 64,
};

Memory Calculation

KV Cache Memory = num_layers * 2 * max_tokens * num_kv_heads * head_dim * bytes_per_element

Example (Qwen2.5-7B with 4096 context):
- Layers: 32
- KV heads: 8
- Head dim: 128
- FP16 tail (256 tokens): 32 * 2 * 256 * 8 * 128 * 2 = 33.5 MB
- Q4 store (3840 tokens): 32 * 2 * 3840 * 8 * 128 * 0.5 = 125.8 MB
- Total: ~160 MB (vs ~672 MB for full FP16)

Cache Strategies by Use Case

Use Case tail_length store_precision max_tokens
Chat (short) 128 Q8 2048
Chat (long) 256 Q4 8192
Document QA 512 Q4 16384
Code completion 128 Q8 4096

Attention Optimization

Grouped-Query Attention (GQA)

let config = AttentionConfig {
    num_heads: 32,      // Query heads
    num_kv_heads: 8,    // KV heads (4:1 ratio)
    head_dim: 128,
    causal: true,
    ..Default::default()
};

// GQA ratio determines memory savings
// 4:1 = ~4x KV cache reduction
// 8:1 = ~8x KV cache reduction
assert_eq!(config.gqa_ratio(), 4);

Flash Attention Optimization

// Flash Attention is memory-efficient but has setup overhead
// Best for: longer sequences (>256 tokens)

// For short sequences, standard attention may be faster
let use_flash = sequence_length > 256;

if use_flash {
    let output = flash_attention_neon(&query, &key, &value, scale, causal);
} else {
    let output = standard_attention(&query, &key, &value, scale, causal);
}

Paged Attention for Inference

// Paged attention enables non-contiguous KV cache
// Best for: long-running inference with variable context

let mut cache = PagedKvCache::new(
    16,     // block_size: tokens per block
    8,      // num_kv_heads
    128,    // head_dim
);

// Append incrementally
for token in tokens {
    let (k, v) = compute_kv(token)?;
    cache.append(&k, &v);
}

// Efficient attention over paged cache
let output = paged_attention_neon(&query, &cache, &block_tables, scale);

Quantization Optimization

Model Quantization

Precision Memory Quality Speed
FP32 4x Best Slowest
FP16 2x Excellent Fast
Q8 1x Very Good Faster
Q4K 0.5x Good Fastest
Q4 0.5x Acceptable Fastest

Recommendations:

// High quality (16GB+ RAM)
let config = ModelConfig {
    quantization: Precision::Q8,
    ..Default::default()
};

// Balanced (8-16GB RAM)
let config = ModelConfig {
    quantization: Precision::Q4K,  // K-quant preserves quality
    ..Default::default()
};

// Memory constrained (<8GB RAM)
let config = ModelConfig {
    quantization: Precision::Q4,
    ..Default::default()
};

KV Cache Quantization

// Hybrid quantization: recent tokens in high precision
let config = KvCacheConfig {
    tail_length: 256,           // Recent: FP16
    tail_precision: Precision::FP16,
    store_precision: Precision::Q4,  // Older: Q4
    ..Default::default()
};

// Quality impact by position
// Position 0-256 (tail): Full quality
// Position 256+: ~95% quality with Q4

Hardware-Specific Optimization

Apple Silicon (M1/M2/M3/M4)

// Metal backend for GPU acceleration
let backend = CandleBackend::with_device(DeviceType::Metal)?;

// Optimize for unified memory
let config = ModelConfig {
    // Unified memory = larger KV cache possible
    kv_cache_config: KvCacheConfig {
        max_tokens: 8192,  // Can be larger on M-series
        ..Default::default()
    },
    ..Default::default()
};

M4 Pro Specific:

  • Use metal feature for GPU acceleration
  • NEON SIMD enabled by default
  • Leverage unified memory for larger context

NVIDIA GPUs

// CUDA backend
let backend = CandleBackend::with_device(DeviceType::Cuda(0))?;

// Optimize for separate VRAM
let config = ModelConfig {
    kv_cache_config: KvCacheConfig {
        // Conservative: VRAM is limited
        max_tokens: 4096,
        ..Default::default()
    },
    ..Default::default()
};

CPU Fallback

// CPU with SIMD optimization
let backend = CandleBackend::with_device(DeviceType::Cpu)?;

// Reduce memory pressure
let config = ModelConfig {
    quantization: Precision::Q4,
    kv_cache_config: KvCacheConfig {
        tail_length: 128,
        max_tokens: 2048,
        ..Default::default()
    },
    ..Default::default()
};

Real-Time Optimization

Adaptive Optimization

use ruvllm::optimization::{RealTimeOptimizer, OptimizerConfig};

let optimizer = RealTimeOptimizer::new(OptimizerConfig {
    target_latency_ms: 100.0,
    min_throughput: 50.0,  // tokens/sec
    memory_threshold: 0.9,  // 90% of available
});

// Optimizer adjusts parameters in real-time
loop {
    let metrics = backend.get_metrics();
    let adjustments = optimizer.recommend(&metrics);

    if adjustments.reduce_batch_size {
        config.batch_size -= 1;
    }
    if adjustments.increase_quantization {
        config.kv_cache_config.store_precision = Precision::Q4;
    }
}

Latency Monitoring

// Track latency components
struct LatencyBreakdown {
    tokenization_us: u64,
    prefill_us: u64,
    decode_us: u64,
    sampling_us: u64,
    lora_us: u64,
}

impl LatencyBreakdown {
    fn total_ms(&self) -> f64 {
        (self.tokenization_us + self.prefill_us +
         self.decode_us + self.sampling_us + self.lora_us) as f64 / 1000.0
    }

    fn bottleneck(&self) -> &str {
        let max = [
            (self.tokenization_us, "tokenization"),
            (self.prefill_us, "prefill"),
            (self.decode_us, "decode"),
            (self.sampling_us, "sampling"),
            (self.lora_us, "lora"),
        ].into_iter().max_by_key(|(v, _)| *v).unwrap();
        max.1
    }
}

Benchmarking

Running Benchmarks

# All benchmarks
cargo bench

# Specific benchmarks
cargo bench --bench attention_bench
cargo bench --bench lora_bench
cargo bench --bench e2e_bench

# With specific features
cargo bench --features metal
cargo bench --features cuda

Custom Benchmarks

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use ruvllm::kernels::attention::flash_attention_neon;

fn bench_attention(c: &mut Criterion) {
    let query = vec![0.1f32; 128];
    let key = vec![0.1f32; 512 * 128];
    let value = vec![0.1f32; 512 * 128];
    let scale = 1.0 / 128.0_f32.sqrt();

    c.bench_function("flash_attention_512", |b| {
        b.iter(|| {
            flash_attention_neon(
                black_box(&query),
                black_box(&key),
                black_box(&value),
                scale,
                true,
            )
        })
    });
}

criterion_group!(benches, bench_attention);
criterion_main!(benches);

Optimization Checklist

Before Deployment

  • Choose appropriate quantization (Q4K for most cases)
  • Configure KV cache for expected context length
  • Enable GQA if model supports it
  • Set appropriate batch sizes for memory
  • Configure SONA learning rates
  • Test with representative workloads

Monitoring

  • Track prefill and decode throughput
  • Monitor memory usage over time
  • Log KV cache hit rates
  • Track SONA learning metrics
  • Alert on latency spikes

Troubleshooting

Symptom Likely Cause Solution
High latency Batch too large Reduce batch size
OOM errors KV cache too large Reduce max_tokens or use Q4
Quality degradation Over-quantization Use Q8 instead of Q4
Slow adaptation Learning rate too low Increase instant_lr
Forgetting EWC lambda too low Increase ewc_lambda