RuvLLM Optimization Guide (v2.0.0)

This guide covers performance optimization strategies for RuvLLM, including SONA learning loops, batch sizing, KV cache management, and hardware-specific tuning.

v2.0.0 Performance Highlights

Feature	Improvement	Notes
Multi-threaded GEMM	12.7x speedup	Rayon on M4 Pro 10-core
Flash Attention 2	+10% throughput	Auto block sizing
Quantized Inference	4-8x memory	INT8/INT4/Q4_K
Metal GPU	3x speedup	simdgroup_matrix
Memory Pool	Zero-alloc	Arena allocator

Performance Overview

Key Metrics

Metric	Target (M4 Pro)	Achieved (v2.0.0)	Description
Prefill	>2000 tok/s	3500 tok/s	Processing input tokens
Decode	>80 tok/s	120 tok/s	Generating output tokens
TTFT	<50ms	35ms	Time to first token
Memory	<8GB for 7B	3.4GB (Q4K)	Peak memory usage
MicroLoRA	<1ms	8.56us	Per-request adaptation

Architecture Impact

┌─────────────────────────────────────────────────────────┐
│                    Optimization Layers                   │
├─────────────────────────────────────────────────────────┤
│  SONA Learning      │ Real-time adaptation, routing     │
├─────────────────────────────────────────────────────────┤
│  Attention          │ Flash, Paged, GQA - 2-4x speedup │
├─────────────────────────────────────────────────────────┤
│  KV Cache           │ Two-tier, quantized - 4x memory  │
├─────────────────────────────────────────────────────────┤
│  Quantization       │ Q4K, Q8 - 4-8x smaller           │
├─────────────────────────────────────────────────────────┤
│  SIMD/GPU           │ NEON, Metal - hardware accel     │
└─────────────────────────────────────────────────────────┘

SONA Learning Optimization

Instant Loop Tuning

The instant loop runs per-request with <1ms target latency.

let config = SonaLlmConfig {
    // Learning rate for instant updates
    // Higher = faster adaptation, more variance
    // Lower = slower adaptation, more stable
    instant_lr: 0.01,

    // Quality threshold - skip low-quality samples
    training: TrainingConfig {
        quality_threshold: 0.5,  // 0.0-1.0
        ..Default::default()
    },
    ..Default::default()
};

Tuning Guidelines:

Use Case	instant_lr	quality_threshold
High variance tasks	0.005	0.7
Stable domains	0.02	0.3
User personalization	0.01	0.5

Background Loop Tuning

Consolidates patterns without blocking inference.

let config = SonaLlmConfig {
    // How often to run (milliseconds)
    background_interval_ms: 100,

    // Minimum samples before consolidation
    background_min_samples: 10,

    // Maximum pending (triggers forced consolidation)
    max_pending_samples: 1000,

    // Consolidation strategy
    consolidation_strategy: ConsolidationStrategy::EwcMerge,
    ..Default::default()
};

Tuning Guidelines:

Priority	interval_ms	min_samples	Strategy
Latency	200	20	Average
Quality	50	5	EwcMerge
Memory	100	50	BestOnly

Deep Loop Optimization

Triggered periodically for full optimization.

let config = SonaLlmConfig {
    // Accumulated quality threshold to trigger
    deep_trigger_threshold: 100.0,
    ..Default::default()
};

// Manual trigger for scheduled optimization
if sona.should_trigger_deep() || is_scheduled_time() {
    let samples = collect_high_quality_samples();
    let result = sona.deep_optimize(&samples);

    // Log improvement
    println!("Deep optimization: quality delta = {:.3}", result.quality_delta);
}

Batch Size Optimization

Dynamic Batching

// Optimal batch sizes vary by operation
struct BatchConfig {
    prefill_batch: usize,   // Process multiple prompts together
    decode_batch: usize,    // Parallel token generation
    lora_batch: usize,      // LoRA adaptation batch
}

impl BatchConfig {
    fn for_memory(available_gb: f32) -> Self {
        match available_gb {
            x if x < 8.0 => Self {
                prefill_batch: 1,
                decode_batch: 4,
                lora_batch: 16,
            },
            x if x < 16.0 => Self {
                prefill_batch: 2,
                decode_batch: 8,
                lora_batch: 32,
            },
            _ => Self {
                prefill_batch: 4,
                decode_batch: 16,
                lora_batch: 64,
            },
        }
    }
}

Batch Size Impact

Batch Size	Throughput	Latency	Memory
1	Low	Lowest	Lowest
4	Medium	Low	Medium
8	High	Medium	High
16+	Highest	Higher	Highest

Rule of thumb: Increase batch size until memory pressure or latency constraints are hit.

KV Cache Optimization

Two-Tier Configuration

let config = KvCacheConfig {
    // Tokens in high-precision tail
    // More = better attention quality for recent context
    // Less = less memory usage
    tail_length: 256,

    // Tail precision (FP16 recommended)
    tail_precision: Precision::FP16,

    // Store precision (Q4 for 4x compression)
    store_precision: Precision::Q4,

    // Maximum context length
    max_tokens: 4096,

    // KV heads (depends on model architecture)
    num_kv_heads: 8,
    head_dim: 128,

    // Batch size for migration (affects latency spikes)
    migration_batch: 64,
};

Memory Calculation

KV Cache Memory = num_layers * 2 * max_tokens * num_kv_heads * head_dim * bytes_per_element

Example (Qwen2.5-7B with 4096 context):
- Layers: 32
- KV heads: 8
- Head dim: 128
- FP16 tail (256 tokens): 32 * 2 * 256 * 8 * 128 * 2 = 33.5 MB
- Q4 store (3840 tokens): 32 * 2 * 3840 * 8 * 128 * 0.5 = 125.8 MB
- Total: ~160 MB (vs ~672 MB for full FP16)

Cache Strategies by Use Case

Use Case	tail_length	store_precision	max_tokens
Chat (short)	128	Q8	2048
Chat (long)	256	Q4	8192
Document QA	512	Q4	16384
Code completion	128	Q8	4096

Attention Optimization

Grouped-Query Attention (GQA)

let config = AttentionConfig {
    num_heads: 32,      // Query heads
    num_kv_heads: 8,    // KV heads (4:1 ratio)
    head_dim: 128,
    causal: true,
    ..Default::default()
};

// GQA ratio determines memory savings
// 4:1 = ~4x KV cache reduction
// 8:1 = ~8x KV cache reduction
assert_eq!(config.gqa_ratio(), 4);

Flash Attention Optimization

// Flash Attention is memory-efficient but has setup overhead
// Best for: longer sequences (>256 tokens)

// For short sequences, standard attention may be faster
let use_flash = sequence_length > 256;

if use_flash {
    let output = flash_attention_neon(&query, &key, &value, scale, causal);
} else {
    let output = standard_attention(&query, &key, &value, scale, causal);
}

Paged Attention for Inference

// Paged attention enables non-contiguous KV cache
// Best for: long-running inference with variable context

let mut cache = PagedKvCache::new(
    16,     // block_size: tokens per block
    8,      // num_kv_heads
    128,    // head_dim
);

// Append incrementally
for token in tokens {
    let (k, v) = compute_kv(token)?;
    cache.append(&k, &v);
}

// Efficient attention over paged cache
let output = paged_attention_neon(&query, &cache, &block_tables, scale);

Quantization Optimization

Model Quantization

Precision	Memory	Quality	Speed
FP32	4x	Best	Slowest
FP16	2x	Excellent	Fast
Q8	1x	Very Good	Faster
Q4K	0.5x	Good	Fastest
Q4	0.5x	Acceptable	Fastest

Recommendations:

// High quality (16GB+ RAM)
let config = ModelConfig {
    quantization: Precision::Q8,
    ..Default::default()
};

// Balanced (8-16GB RAM)
let config = ModelConfig {
    quantization: Precision::Q4K,  // K-quant preserves quality
    ..Default::default()
};

// Memory constrained (<8GB RAM)
let config = ModelConfig {
    quantization: Precision::Q4,
    ..Default::default()
};

KV Cache Quantization

// Hybrid quantization: recent tokens in high precision
let config = KvCacheConfig {
    tail_length: 256,           // Recent: FP16
    tail_precision: Precision::FP16,
    store_precision: Precision::Q4,  // Older: Q4
    ..Default::default()
};

// Quality impact by position
// Position 0-256 (tail): Full quality
// Position 256+: ~95% quality with Q4

Hardware-Specific Optimization

Apple Silicon (M1/M2/M3/M4)

// Metal backend for GPU acceleration
let backend = CandleBackend::with_device(DeviceType::Metal)?;

// Optimize for unified memory
let config = ModelConfig {
    // Unified memory = larger KV cache possible
    kv_cache_config: KvCacheConfig {
        max_tokens: 8192,  // Can be larger on M-series
        ..Default::default()
    },
    ..Default::default()
};

M4 Pro Specific:

Use metal feature for GPU acceleration
NEON SIMD enabled by default
Leverage unified memory for larger context

NVIDIA GPUs

// CUDA backend
let backend = CandleBackend::with_device(DeviceType::Cuda(0))?;

// Optimize for separate VRAM
let config = ModelConfig {
    kv_cache_config: KvCacheConfig {
        // Conservative: VRAM is limited
        max_tokens: 4096,
        ..Default::default()
    },
    ..Default::default()
};

CPU Fallback

// CPU with SIMD optimization
let backend = CandleBackend::with_device(DeviceType::Cpu)?;

// Reduce memory pressure
let config = ModelConfig {
    quantization: Precision::Q4,
    kv_cache_config: KvCacheConfig {
        tail_length: 128,
        max_tokens: 2048,
        ..Default::default()
    },
    ..Default::default()
};

Real-Time Optimization

Adaptive Optimization

use ruvllm::optimization::{RealTimeOptimizer, OptimizerConfig};

let optimizer = RealTimeOptimizer::new(OptimizerConfig {
    target_latency_ms: 100.0,
    min_throughput: 50.0,  // tokens/sec
    memory_threshold: 0.9,  // 90% of available
});

// Optimizer adjusts parameters in real-time
loop {
    let metrics = backend.get_metrics();
    let adjustments = optimizer.recommend(&metrics);

    if adjustments.reduce_batch_size {
        config.batch_size -= 1;
    }
    if adjustments.increase_quantization {
        config.kv_cache_config.store_precision = Precision::Q4;
    }
}

Latency Monitoring

// Track latency components
struct LatencyBreakdown {
    tokenization_us: u64,
    prefill_us: u64,
    decode_us: u64,
    sampling_us: u64,
    lora_us: u64,
}

impl LatencyBreakdown {
    fn total_ms(&self) -> f64 {
        (self.tokenization_us + self.prefill_us +
         self.decode_us + self.sampling_us + self.lora_us) as f64 / 1000.0
    }

    fn bottleneck(&self) -> &str {
        let max = [
            (self.tokenization_us, "tokenization"),
            (self.prefill_us, "prefill"),
            (self.decode_us, "decode"),
            (self.sampling_us, "sampling"),
            (self.lora_us, "lora"),
        ].into_iter().max_by_key(|(v, _)| *v).unwrap();
        max.1
    }
}

Benchmarking

Running Benchmarks

# All benchmarks
cargo bench

# Specific benchmarks
cargo bench --bench attention_bench
cargo bench --bench lora_bench
cargo bench --bench e2e_bench

# With specific features
cargo bench --features metal
cargo bench --features cuda

Custom Benchmarks

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use ruvllm::kernels::attention::flash_attention_neon;

fn bench_attention(c: &mut Criterion) {
    let query = vec![0.1f32; 128];
    let key = vec![0.1f32; 512 * 128];
    let value = vec![0.1f32; 512 * 128];
    let scale = 1.0 / 128.0_f32.sqrt();

    c.bench_function("flash_attention_512", |b| {
        b.iter(|| {
            flash_attention_neon(
                black_box(&query),
                black_box(&key),
                black_box(&value),
                scale,
                true,
            )
        })
    });
}

criterion_group!(benches, bench_attention);
criterion_main!(benches);

Optimization Checklist

Before Deployment

Choose appropriate quantization (Q4K for most cases)
Configure KV cache for expected context length
Enable GQA if model supports it
Set appropriate batch sizes for memory
Configure SONA learning rates
Test with representative workloads

Monitoring

Troubleshooting

Symptom	Likely Cause	Solution
High latency	Batch too large	Reduce batch size
OOM errors	KV cache too large	Reduce max_tokens or use Q4
Quality degradation	Over-quantization	Use Q8 instead of Q4
Slow adaptation	Learning rate too low	Increase instant_lr
Forgetting	EWC lambda too low	Increase ewc_lambda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuvLLM Optimization Guide (v2.0.0)

v2.0.0 Performance Highlights

Performance Overview

Key Metrics

Architecture Impact

SONA Learning Optimization

Instant Loop Tuning

Background Loop Tuning

Deep Loop Optimization

Batch Size Optimization

Dynamic Batching

Batch Size Impact

KV Cache Optimization

Two-Tier Configuration

Memory Calculation

Cache Strategies by Use Case

Attention Optimization

Grouped-Query Attention (GQA)

Flash Attention Optimization

Paged Attention for Inference

Quantization Optimization

Model Quantization

KV Cache Quantization

Hardware-Specific Optimization

Apple Silicon (M1/M2/M3/M4)

NVIDIA GPUs

CPU Fallback

Real-Time Optimization

Adaptive Optimization

Latency Monitoring

Benchmarking

Running Benchmarks

Custom Benchmarks

Optimization Checklist

Before Deployment

Monitoring

Troubleshooting

FilesExpand file tree

OPTIMIZATION.md

Latest commit

History

OPTIMIZATION.md

File metadata and controls

RuvLLM Optimization Guide (v2.0.0)

v2.0.0 Performance Highlights

Performance Overview

Key Metrics

Architecture Impact

SONA Learning Optimization

Instant Loop Tuning

Background Loop Tuning

Deep Loop Optimization

Batch Size Optimization

Dynamic Batching

Batch Size Impact

KV Cache Optimization

Two-Tier Configuration

Memory Calculation

Cache Strategies by Use Case

Attention Optimization

Grouped-Query Attention (GQA)

Flash Attention Optimization

Paged Attention for Inference

Quantization Optimization

Model Quantization

KV Cache Quantization

Hardware-Specific Optimization

Apple Silicon (M1/M2/M3/M4)

NVIDIA GPUs

CPU Fallback

Real-Time Optimization

Adaptive Optimization

Latency Monitoring

Benchmarking

Running Benchmarks

Custom Benchmarks

Optimization Checklist

Before Deployment

Monitoring

Troubleshooting