This guide covers performance optimization strategies for RuvLLM, including SONA learning loops, batch sizing, KV cache management, and hardware-specific tuning.
| Feature | Improvement | Notes |
|---|---|---|
| Multi-threaded GEMM | 12.7x speedup | Rayon on M4 Pro 10-core |
| Flash Attention 2 | +10% throughput | Auto block sizing |
| Quantized Inference | 4-8x memory | INT8/INT4/Q4_K |
| Metal GPU | 3x speedup | simdgroup_matrix |
| Memory Pool | Zero-alloc | Arena allocator |
| Metric | Target (M4 Pro) | Achieved (v2.0.0) | Description |
|---|---|---|---|
| Prefill | >2000 tok/s | 3500 tok/s | Processing input tokens |
| Decode | >80 tok/s | 120 tok/s | Generating output tokens |
| TTFT | <50ms | 35ms | Time to first token |
| Memory | <8GB for 7B | 3.4GB (Q4K) | Peak memory usage |
| MicroLoRA | <1ms | 8.56us | Per-request adaptation |
┌─────────────────────────────────────────────────────────┐
│ Optimization Layers │
├─────────────────────────────────────────────────────────┤
│ SONA Learning │ Real-time adaptation, routing │
├─────────────────────────────────────────────────────────┤
│ Attention │ Flash, Paged, GQA - 2-4x speedup │
├─────────────────────────────────────────────────────────┤
│ KV Cache │ Two-tier, quantized - 4x memory │
├─────────────────────────────────────────────────────────┤
│ Quantization │ Q4K, Q8 - 4-8x smaller │
├─────────────────────────────────────────────────────────┤
│ SIMD/GPU │ NEON, Metal - hardware accel │
└─────────────────────────────────────────────────────────┘
The instant loop runs per-request with <1ms target latency.
let config = SonaLlmConfig {
// Learning rate for instant updates
// Higher = faster adaptation, more variance
// Lower = slower adaptation, more stable
instant_lr: 0.01,
// Quality threshold - skip low-quality samples
training: TrainingConfig {
quality_threshold: 0.5, // 0.0-1.0
..Default::default()
},
..Default::default()
};Tuning Guidelines:
| Use Case | instant_lr | quality_threshold |
|---|---|---|
| High variance tasks | 0.005 | 0.7 |
| Stable domains | 0.02 | 0.3 |
| User personalization | 0.01 | 0.5 |
Consolidates patterns without blocking inference.
let config = SonaLlmConfig {
// How often to run (milliseconds)
background_interval_ms: 100,
// Minimum samples before consolidation
background_min_samples: 10,
// Maximum pending (triggers forced consolidation)
max_pending_samples: 1000,
// Consolidation strategy
consolidation_strategy: ConsolidationStrategy::EwcMerge,
..Default::default()
};Tuning Guidelines:
| Priority | interval_ms | min_samples | Strategy |
|---|---|---|---|
| Latency | 200 | 20 | Average |
| Quality | 50 | 5 | EwcMerge |
| Memory | 100 | 50 | BestOnly |
Triggered periodically for full optimization.
let config = SonaLlmConfig {
// Accumulated quality threshold to trigger
deep_trigger_threshold: 100.0,
..Default::default()
};
// Manual trigger for scheduled optimization
if sona.should_trigger_deep() || is_scheduled_time() {
let samples = collect_high_quality_samples();
let result = sona.deep_optimize(&samples);
// Log improvement
println!("Deep optimization: quality delta = {:.3}", result.quality_delta);
}// Optimal batch sizes vary by operation
struct BatchConfig {
prefill_batch: usize, // Process multiple prompts together
decode_batch: usize, // Parallel token generation
lora_batch: usize, // LoRA adaptation batch
}
impl BatchConfig {
fn for_memory(available_gb: f32) -> Self {
match available_gb {
x if x < 8.0 => Self {
prefill_batch: 1,
decode_batch: 4,
lora_batch: 16,
},
x if x < 16.0 => Self {
prefill_batch: 2,
decode_batch: 8,
lora_batch: 32,
},
_ => Self {
prefill_batch: 4,
decode_batch: 16,
lora_batch: 64,
},
}
}
}| Batch Size | Throughput | Latency | Memory |
|---|---|---|---|
| 1 | Low | Lowest | Lowest |
| 4 | Medium | Low | Medium |
| 8 | High | Medium | High |
| 16+ | Highest | Higher | Highest |
Rule of thumb: Increase batch size until memory pressure or latency constraints are hit.
let config = KvCacheConfig {
// Tokens in high-precision tail
// More = better attention quality for recent context
// Less = less memory usage
tail_length: 256,
// Tail precision (FP16 recommended)
tail_precision: Precision::FP16,
// Store precision (Q4 for 4x compression)
store_precision: Precision::Q4,
// Maximum context length
max_tokens: 4096,
// KV heads (depends on model architecture)
num_kv_heads: 8,
head_dim: 128,
// Batch size for migration (affects latency spikes)
migration_batch: 64,
};KV Cache Memory = num_layers * 2 * max_tokens * num_kv_heads * head_dim * bytes_per_element
Example (Qwen2.5-7B with 4096 context):
- Layers: 32
- KV heads: 8
- Head dim: 128
- FP16 tail (256 tokens): 32 * 2 * 256 * 8 * 128 * 2 = 33.5 MB
- Q4 store (3840 tokens): 32 * 2 * 3840 * 8 * 128 * 0.5 = 125.8 MB
- Total: ~160 MB (vs ~672 MB for full FP16)
| Use Case | tail_length | store_precision | max_tokens |
|---|---|---|---|
| Chat (short) | 128 | Q8 | 2048 |
| Chat (long) | 256 | Q4 | 8192 |
| Document QA | 512 | Q4 | 16384 |
| Code completion | 128 | Q8 | 4096 |
let config = AttentionConfig {
num_heads: 32, // Query heads
num_kv_heads: 8, // KV heads (4:1 ratio)
head_dim: 128,
causal: true,
..Default::default()
};
// GQA ratio determines memory savings
// 4:1 = ~4x KV cache reduction
// 8:1 = ~8x KV cache reduction
assert_eq!(config.gqa_ratio(), 4);// Flash Attention is memory-efficient but has setup overhead
// Best for: longer sequences (>256 tokens)
// For short sequences, standard attention may be faster
let use_flash = sequence_length > 256;
if use_flash {
let output = flash_attention_neon(&query, &key, &value, scale, causal);
} else {
let output = standard_attention(&query, &key, &value, scale, causal);
}// Paged attention enables non-contiguous KV cache
// Best for: long-running inference with variable context
let mut cache = PagedKvCache::new(
16, // block_size: tokens per block
8, // num_kv_heads
128, // head_dim
);
// Append incrementally
for token in tokens {
let (k, v) = compute_kv(token)?;
cache.append(&k, &v);
}
// Efficient attention over paged cache
let output = paged_attention_neon(&query, &cache, &block_tables, scale);| Precision | Memory | Quality | Speed |
|---|---|---|---|
| FP32 | 4x | Best | Slowest |
| FP16 | 2x | Excellent | Fast |
| Q8 | 1x | Very Good | Faster |
| Q4K | 0.5x | Good | Fastest |
| Q4 | 0.5x | Acceptable | Fastest |
Recommendations:
// High quality (16GB+ RAM)
let config = ModelConfig {
quantization: Precision::Q8,
..Default::default()
};
// Balanced (8-16GB RAM)
let config = ModelConfig {
quantization: Precision::Q4K, // K-quant preserves quality
..Default::default()
};
// Memory constrained (<8GB RAM)
let config = ModelConfig {
quantization: Precision::Q4,
..Default::default()
};// Hybrid quantization: recent tokens in high precision
let config = KvCacheConfig {
tail_length: 256, // Recent: FP16
tail_precision: Precision::FP16,
store_precision: Precision::Q4, // Older: Q4
..Default::default()
};
// Quality impact by position
// Position 0-256 (tail): Full quality
// Position 256+: ~95% quality with Q4// Metal backend for GPU acceleration
let backend = CandleBackend::with_device(DeviceType::Metal)?;
// Optimize for unified memory
let config = ModelConfig {
// Unified memory = larger KV cache possible
kv_cache_config: KvCacheConfig {
max_tokens: 8192, // Can be larger on M-series
..Default::default()
},
..Default::default()
};M4 Pro Specific:
- Use
metalfeature for GPU acceleration - NEON SIMD enabled by default
- Leverage unified memory for larger context
// CUDA backend
let backend = CandleBackend::with_device(DeviceType::Cuda(0))?;
// Optimize for separate VRAM
let config = ModelConfig {
kv_cache_config: KvCacheConfig {
// Conservative: VRAM is limited
max_tokens: 4096,
..Default::default()
},
..Default::default()
};// CPU with SIMD optimization
let backend = CandleBackend::with_device(DeviceType::Cpu)?;
// Reduce memory pressure
let config = ModelConfig {
quantization: Precision::Q4,
kv_cache_config: KvCacheConfig {
tail_length: 128,
max_tokens: 2048,
..Default::default()
},
..Default::default()
};use ruvllm::optimization::{RealTimeOptimizer, OptimizerConfig};
let optimizer = RealTimeOptimizer::new(OptimizerConfig {
target_latency_ms: 100.0,
min_throughput: 50.0, // tokens/sec
memory_threshold: 0.9, // 90% of available
});
// Optimizer adjusts parameters in real-time
loop {
let metrics = backend.get_metrics();
let adjustments = optimizer.recommend(&metrics);
if adjustments.reduce_batch_size {
config.batch_size -= 1;
}
if adjustments.increase_quantization {
config.kv_cache_config.store_precision = Precision::Q4;
}
}// Track latency components
struct LatencyBreakdown {
tokenization_us: u64,
prefill_us: u64,
decode_us: u64,
sampling_us: u64,
lora_us: u64,
}
impl LatencyBreakdown {
fn total_ms(&self) -> f64 {
(self.tokenization_us + self.prefill_us +
self.decode_us + self.sampling_us + self.lora_us) as f64 / 1000.0
}
fn bottleneck(&self) -> &str {
let max = [
(self.tokenization_us, "tokenization"),
(self.prefill_us, "prefill"),
(self.decode_us, "decode"),
(self.sampling_us, "sampling"),
(self.lora_us, "lora"),
].into_iter().max_by_key(|(v, _)| *v).unwrap();
max.1
}
}# All benchmarks
cargo bench
# Specific benchmarks
cargo bench --bench attention_bench
cargo bench --bench lora_bench
cargo bench --bench e2e_bench
# With specific features
cargo bench --features metal
cargo bench --features cudause criterion::{black_box, criterion_group, criterion_main, Criterion};
use ruvllm::kernels::attention::flash_attention_neon;
fn bench_attention(c: &mut Criterion) {
let query = vec![0.1f32; 128];
let key = vec![0.1f32; 512 * 128];
let value = vec![0.1f32; 512 * 128];
let scale = 1.0 / 128.0_f32.sqrt();
c.bench_function("flash_attention_512", |b| {
b.iter(|| {
flash_attention_neon(
black_box(&query),
black_box(&key),
black_box(&value),
scale,
true,
)
})
});
}
criterion_group!(benches, bench_attention);
criterion_main!(benches);- Choose appropriate quantization (Q4K for most cases)
- Configure KV cache for expected context length
- Enable GQA if model supports it
- Set appropriate batch sizes for memory
- Configure SONA learning rates
- Test with representative workloads
- Track prefill and decode throughput
- Monitor memory usage over time
- Log KV cache hit rates
- Track SONA learning metrics
- Alert on latency spikes
| Symptom | Likely Cause | Solution |
|---|---|---|
| High latency | Batch too large | Reduce batch size |
| OOM errors | KV cache too large | Reduce max_tokens or use Q4 |
| Quality degradation | Over-quantization | Use Q8 instead of Q4 |
| Slow adaptation | Learning rate too low | Increase instant_lr |
| Forgetting | EWC lambda too low | Increase ewc_lambda |