Skip to content

Latest commit

 

History

History
66 lines (55 loc) · 2.68 KB

File metadata and controls

66 lines (55 loc) · 2.68 KB

Changelog

All notable changes to the ruvllm crate will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[2.0.0] - 2025-01-19

Added

  • Multi-threaded GEMM/GEMV with Rayon (12.7x speedup on M4 Pro)
  • Flash Attention 2 with auto block sizing (+10% throughput)
  • INT8/INT4/Q4_K quantized inference kernels (4-8x memory reduction)
  • Optimized Metal GPU shaders (simdgroup_matrix)
  • Memory pool with arena allocator (zero-alloc inference)
  • WASM support via ruvllm-wasm crate
  • npm package integration (@ruvector/ruvllm v2)
  • Paged attention for non-contiguous KV cache
  • Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) support
  • Two-tier KV cache with FP16 tail and quantized cold storage
  • MicroLoRA for real-time per-request adaptation (<1ms latency)
  • EWC++ (Elastic Weight Consolidation) to prevent catastrophic forgetting
  • SONA learning integration with three-tier loops (instant/background/deep)
  • Native Metal compute shaders for M4 Pro optimization
  • Candle backend integration for HuggingFace model loading

Changed

  • GEMV performance: 6 GFLOPS -> 35.9 GFLOPS (6x improvement)
  • GEMM performance: 6 GFLOPS -> 19.2 GFLOPS (3.2x improvement)
  • Cache blocking tuned for M4 Pro (96x64x256 tiles)
  • 12x4 micro-kernel for better register utilization
  • RMSNorm optimized with NEON SIMD (620ns for 4096 dim, 16x better than target)
  • Flash Attention achieves 840us for 256-token sequences
  • MicroLoRA forward pass: 8.56us scalar, 2.61us SIMD (117x/383x better than target)

Fixed

  • Parameter estimation accuracy for 7B models
  • Doctest crate name compatibility
  • KV cache migration batch sizing for latency spikes
  • Memory bandwidth optimization for large matrix operations

Performance Highlights (M4 Pro, 48GB RAM)

Operation Latency Target Status
Flash Attention (256 seq) 840us <2ms 2.4x better
RMSNorm (4096 dim) 620ns <10us 16x better
GEMV (4096x4096) 1.36ms <5ms 3.7x better
MicroLoRA forward (rank=2, dim=4096) 8.56us <1ms 117x better
RoPE with tables (128 dim, 32 tokens) 1.33us <50us 37x better

[0.1.32] - 2025-01-18

Added

  • Initial ruvllm-integration crate with basic LLM serving runtime
  • Paged attention implementation
  • KV cache management
  • SONA learning integration scaffolding
  • Basic NEON SIMD kernels for ARM64

Dependencies

  • ruvector-core for storage backend
  • ruvector-sona for learning integration
  • candle-core, candle-nn, candle-transformers for ML backend
  • tokenizers for text processing
  • hf-hub for model downloads