Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

RuvLLM

Crates.io docs.rs npm License: MIT

The local LLM inference engine that learns from every request -- Metal, CUDA, WebGPU, no cloud APIs.

cargo add ruvllm

RuvLLM loads GGUF models and runs them on your hardware with full acceleration -- Apple Silicon, NVIDIA GPUs, WebAssembly, whatever you have. Unlike other local inference tools, it gets smarter over time: SONA (Self-Optimizing Neural Architecture) watches how you use it and adapts automatically, so responses improve without manual tuning. It's part of RuVector, the self-learning vector database with graph intelligence.

RuvLLM OpenAI API llama.cpp Ollama vLLM
Cost Free after hardware Per-token billing Free Free Free
Privacy Data stays on your machine Sent to third party Local Local Local
Self-learning SONA adapts automatically Static Static Static Static
Per-request tuning MicroLoRA in <1 ms Not available Not available Not available Not available
Hardware support Metal, CUDA, ANE, WebGPU, CPU N/A Metal, CUDA, CPU Metal, CUDA, CPU CUDA only
WASM / Browser Yes (5.5 KB runtime) Via network call Not available Not available Not available
Vector DB integration Built-in (RuVector) Separate service Not available Not available Not available
Speculative decoding Yes N/A Yes No Yes
Continuous batching Yes N/A No No Yes
Production serving mistral-rs backend N/A Server mode Server mode Native

Key Features

Feature What It Does Why It Matters
SONA three-tier learning Adapts to your queries at three speeds: instant (<1 ms), background (~100 ms), deep (minutes) Responses improve automatically without manual retraining
Metal + CUDA + ANE Hardware-accelerated inference across Apple Silicon, NVIDIA GPUs, and Apple Neural Engine Get the most out of whatever hardware you have
Flash Attention 2 Memory-efficient attention with O(N) complexity and online softmax Longer contexts with less memory
GGUF memory mapping Memory-mapped model loading with quantization (Q4K, Q8, FP16) Load large models fast, use 4-8x less RAM
Speculative decoding Draft model generates candidates, target model verifies in parallel 2-3x faster text generation
Continuous batching Dynamic batch scheduling for concurrent requests 2-3x throughput improvement for serving
MicroLoRA Per-request fine-tuning with rank 1-2 adapters Personalize responses in <1 ms without full retraining
HuggingFace Hub Download and upload models directly One-line model access, easy sharing
mistral-rs backend PagedAttention, X-LoRA, ISQ for production serving Scale to 50+ concurrent users
Task-specific adapters 5 pre-trained LoRA adapters (coder, researcher, security, architect, reviewer) Instant specialization with hot-swap

Part of the RuVector ecosystem -- the self-learning vector database with graph intelligence, local AI, and PostgreSQL built in.

Quick Start

use ruvllm::prelude::*;

let mut backend = CandleBackend::with_device(DeviceType::Metal)?;
backend.load_gguf("models/qwen2.5-7b-q4_k.gguf", ModelConfig::default())?;

let response = backend.generate("Explain quantum computing in simple terms.",
    GenerateParams {
        max_tokens: 256,
        temperature: 0.7,
        top_p: 0.9,
        ..Default::default()
    }
)?;

println!("{}", response);

Installation

Add to your Cargo.toml:

[dependencies]
# Recommended for Apple Silicon Mac
ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "parallel"] }

# For NVIDIA GPUs
ruvllm = { version = "2.0", features = ["inference-cuda", "parallel"] }

# Minimal (CPU only)
ruvllm = { version = "2.0" }

Or install the npm package:

npm install @ruvector/ruvllm

What's New in v2.3

Feature Description Benefit
RuvLTRA-Medium 3B Purpose-built 3B model for Claude Flow 42 layers, 256K context, speculative decode
HuggingFace Hub Full Hub integration (download/upload) Easy model sharing and distribution
Task-Specific LoRA 5 pre-trained adapters for agent types Optimized for coder/researcher/security/architect/reviewer
Adapter Merging TIES, DARE, SLERP, Task Arithmetic Combine adapters for multi-task models
Hot-Swap Adapters Zero-downtime adapter switching Runtime task specialization
Claude Dataset 2,700+ Claude-style training examples Optimized for Claude Flow integration
HNSW Routing 150x faster semantic pattern matching <25µs pattern retrieval
Evaluation Harness Real model evaluation with SWE-Bench 5 ablation modes, quality metrics
HNSW Auto-Dimension Automatic embedding dimension detection No manual config needed
mistral-rs Backend Production-scale serving with PagedAttention 5-10x concurrent users, X-LoRA, ISQ

Previous v2.0-2.2 Features

Feature Description Benefit
Apple Neural Engine Core ML backend with ANE routing 38 TOPS, 3-4x power efficiency
Hybrid GPU+ANE Pipeline Intelligent operation routing Best of both accelerators
Multi-threaded GEMM Rayon parallelization 4-12x speedup on M4 Pro
Flash Attention 2 Auto block sizing, online softmax O(N) memory, +10% throughput
Quantized Inference INT8/INT4/Q4_K/Q8_K kernels 4-8x memory reduction
Metal GPU Shaders simdgroup_matrix operations 3x speedup on Apple Silicon
GGUF Support Memory-mapped model loading Fast loading, reduced RAM
Continuous Batching Dynamic batch scheduling 2-3x throughput improvement
Speculative Decoding Draft model acceleration 2-3x faster generation
Gemma-2 & Phi-3 New model architectures Extended model support

Backends

Backend Best For Acceleration
Candle Single user, edge, WASM Metal, CUDA, CPU
Core ML Apple Silicon efficiency Apple Neural Engine (38 TOPS)
Hybrid Pipeline Maximum throughput on Mac GPU for attention, ANE for MLP
mistral-rs Production serving (10-100 users) PagedAttention, X-LoRA, ISQ

Feature Flags

Feature Description
candle Enable Candle backend (HuggingFace)
metal Apple Silicon GPU acceleration via Candle
metal-compute Native Metal compute shaders (M4 Pro optimized)
cuda NVIDIA GPU acceleration
coreml Apple Neural Engine via Core ML
hybrid-ane GPU+ANE hybrid pipeline (recommended for Mac)
inference-metal Full Metal inference stack
inference-metal-native Metal + native shaders (best M4 Pro perf)
inference-cuda Full CUDA inference stack
parallel Multi-threaded GEMM/GEMV with Rayon
accelerate Apple Accelerate BLAS (~2x GEMV speedup)
gguf-mmap Memory-mapped GGUF loading
async-runtime Tokio async support
wasm WebAssembly support
mistral-rs mistral-rs backend (PagedAttention, X-LoRA, ISQ)
mistral-rs-metal mistral-rs with Apple Silicon acceleration
mistral-rs-cuda mistral-rs with NVIDIA CUDA acceleration

Architecture

+----------------------------------+
|         Application              |
+----------------------------------+
               |
+----------------------------------+
|        RuvLLM Backend            |
|  +----------------------------+  |
|  |   Hybrid Pipeline Router   |  |
|  |  ┌─────────┐ ┌──────────┐  |  |
|  |  │  Metal  │ │   ANE    │  |  |
|  |  │   GPU   │ │ Core ML  │  |  |
|  |  └────┬────┘ └────┬─────┘  |  |
|  |       │    ↕      │        |  |
|  |  Attention    MLP/FFN      |  |
|  |  RoPE         Activations  |  |
|  |  Softmax      LayerNorm    |  |
|  +----------------------------+  |
|               |                  |
|  +----------------------------+  |
|  |     SONA Learning          |  |
|  |  - Instant (<1ms)          |  |
|  |  - Background (~100ms)     |  |
|  |  - Deep (minutes)          |  |
|  +----------------------------+  |
|               |                  |
|  +----------------------------+  |
|  |     NEON/SIMD Kernels      |  |
|  |  - Flash Attention 2       |  |
|  |  - Paged KV Cache          |  |
|  |  - Quantized MatMul        |  |
|  +----------------------------+  |
+----------------------------------+

Supported Models

Model Family Sizes Quantization Backend
RuvLTRA-Small 0.5B Q4K, Q5K, Q8, FP16 Candle/Metal/ANE
RuvLTRA-Medium 3B Q4K, Q5K, Q8, FP16 Candle/Metal
Qwen 2.5 0.5B-72B Q4K, Q8, FP16 Candle/Metal
Llama 3.x 8B-70B Q4K, Q8, FP16 Candle/Metal
Mistral 7B-22B Q4K, Q8, FP16 Candle/Metal
Phi-3 3.8B-14B Q4K, Q8, FP16 Candle/Metal
Gemma-2 2B-27B Q4K, Q8, FP16 Candle/Metal

RuvLTRA Models (Claude Flow Optimized)

Model Parameters Hidden Layers Context Features
RuvLTRA-Small 494M 896 24 32K GQA 7:1, SONA hooks
RuvLTRA-Medium 3.0B 2560 42 256K Flash Attention 2, Speculative Decode

Performance (M4 Pro 14-core)

Inference Benchmarks

Model Quant Prefill (tok/s) Decode (tok/s) Memory
Qwen2.5-7B Q4K 2,800 95 4.2 GB
Qwen2.5-7B Q8 2,100 72 7.8 GB
Llama3-8B Q4K 2,600 88 4.8 GB
Mistral-7B Q4K 2,500 85 4.1 GB
Phi-3-3.8B Q4K 3,500 135 2.3 GB
Gemma2-9B Q4K 2,200 75 5.2 GB

ANE vs GPU Performance (M4 Pro)

Dimension ANE GPU Winner
< 512 +30-50% - ANE
512-1024 +10-30% - ANE
1024-1536 ~Similar ~Similar Either
1536-2048 - +10-20% GPU
> 2048 - +30-50% GPU

Kernel Benchmarks

Kernel Single-thread Multi-thread (10-core)
GEMM 4096x4096 1.2 GFLOPS 12.7 GFLOPS
GEMV 4096x4096 0.8 GFLOPS 6.4 GFLOPS
Flash Attention (seq=2048) 850μs 320μs
RMS Norm (4096) 2.1μs 0.8μs
RoPE (4096, 128) 4.3μs 1.6μs

Apple Neural Engine (ANE) Integration

RuvLLM v2.0 includes full ANE support via Core ML:

use ruvllm::backends::coreml::{CoreMLBackend, AneStrategy};

// Create ANE-optimized backend
let backend = CoreMLBackend::new(AneStrategy::PreferAneForMlp)?;

// Or use hybrid pipeline for best performance
use ruvllm::backends::HybridPipeline;

let pipeline = HybridPipeline::new(HybridConfig {
    ane_strategy: AneStrategy::Adaptive,
    gpu_for_attention: true,  // Attention on GPU
    ane_for_mlp: true,        // MLP/FFN on ANE
    ..Default::default()
})?;

ANE Routing Recommendations

Operation Recommended Reason
Attention GPU Better for variable sequence lengths
Flash Attention GPU GPU memory bandwidth advantage
MLP/FFN ANE Optimal for fixed-size matmuls
GELU/SiLU ANE Dedicated activation units
LayerNorm/RMSNorm ANE Good for small dimensions
Embedding GPU Sparse operations

MicroLoRA Real-Time Adaptation

RuvLLM supports per-request fine-tuning using MicroLoRA:

use ruvllm::lora::{MicroLoRA, MicroLoraConfig, AdaptFeedback};

// Create MicroLoRA adapter
let config = MicroLoraConfig::for_hidden_dim(4096);
let lora = MicroLoRA::new(config);

// Adapt on user feedback
let feedback = AdaptFeedback::from_quality(0.9);
lora.adapt(&input_embedding, feedback)?;

// Apply learned updates
lora.apply_updates(0.01); // learning rate

// Get adaptation stats
let stats = lora.stats();
println!("Samples: {}, Avg quality: {:.2}", stats.samples, stats.avg_quality);

SONA Three-Tier Learning

Continuous improvement with three learning loops:

use ruvllm::optimization::{SonaLlm, SonaLlmConfig, ConsolidationStrategy};

let config = SonaLlmConfig {
    instant_lr: 0.01,
    background_interval_ms: 100,
    deep_trigger_threshold: 100.0,
    consolidation_strategy: ConsolidationStrategy::EwcMerge,
    ..Default::default()
};

let sona = SonaLlm::new(config);

// 1. Instant Loop (<1ms): Per-request MicroLoRA
let result = sona.instant_adapt("user query", "model response", 0.85);
println!("Instant adapt: {}μs", result.latency_us);

// 2. Background Loop (~100ms): Pattern consolidation
if let result = sona.maybe_background() {
    if result.applied {
        println!("Consolidated {} samples", result.samples_used);
    }
}

// 3. Deep Loop (minutes): Full optimization
if sona.should_trigger_deep() {
    let result = sona.deep_optimize(OptimizationTrigger::QualityThreshold(100.0));
    println!("Deep optimization: {:.1}s", result.latency_us as f64 / 1_000_000.0);
}

// Check learning stats
let stats = sona.stats();
println!("Total samples: {}", stats.total_samples);
println!("Accumulated quality: {:.2}", stats.accumulated_quality);

Two-Tier KV Cache

Memory-efficient caching with automatic tiering:

use ruvllm::kv_cache::{TwoTierKvCache, KvCacheConfig};

let config = KvCacheConfig {
    tail_length: 256,              // Recent tokens in FP16
    tail_precision: Precision::FP16,
    store_precision: Precision::Q4,  // Older tokens in Q4
    max_tokens: 8192,
    num_layers: 32,
    num_kv_heads: 8,
    head_dim: 128,
};

let cache = TwoTierKvCache::new(config);
cache.append(&keys, &values)?;

// Automatic migration from tail to quantized store
let stats = cache.stats();
println!("Tail: {} tokens, Store: {} tokens", stats.tail_tokens, stats.store_tokens);
println!("Compression ratio: {:.2}x", stats.compression_ratio);
println!("Memory saved: {:.1} MB", stats.memory_saved_mb);

Continuous Batching

High-throughput serving with dynamic batching:

use ruvllm::serving::{ContinuousBatchScheduler, SchedulerConfig, InferenceRequest};

let scheduler = ContinuousBatchScheduler::new(SchedulerConfig {
    max_batch_size: 32,
    max_batch_tokens: 4096,
    max_waiting_time_ms: 50,
    preemption_mode: PreemptionMode::Recompute,
    ..Default::default()
});

// Add requests
scheduler.add_request(InferenceRequest::new(tokens, params))?;

// Process batch
while let Some(batch) = scheduler.get_next_batch() {
    let outputs = backend.forward_batch(&batch)?;
    scheduler.process_outputs(outputs)?;
}

// Get throughput stats
let stats = scheduler.stats();
println!("Throughput: {:.1} tok/s", stats.tokens_per_second);
println!("Batch utilization: {:.1}%", stats.avg_batch_utilization * 100.0);

Speculative Decoding

Accelerate generation with draft models:

use ruvllm::speculative::{SpeculativeDecoder, SpeculativeConfig};

let config = SpeculativeConfig {
    draft_tokens: 4,           // Tokens to draft per step
    acceptance_threshold: 0.8, // Min probability for acceptance
    ..Default::default()
};

let decoder = SpeculativeDecoder::new(
    target_model,
    draft_model,
    config,
)?;

// Generate with speculation
let output = decoder.generate(prompt, GenerateParams {
    max_tokens: 256,
    ..Default::default()
})?;

println!("Acceptance rate: {:.1}%", output.stats.acceptance_rate * 100.0);
println!("Speedup: {:.2}x", output.stats.speedup);

GGUF Model Loading

Efficient loading with memory mapping:

use ruvllm::gguf::{GgufLoader, GgufConfig};

let loader = GgufLoader::new(GgufConfig {
    mmap_enabled: true,       // Memory-map for fast loading
    validate_checksum: true,  // Verify file integrity
    ..Default::default()
});

// Load model metadata
let metadata = loader.read_metadata("model.gguf")?;
println!("Model: {}", metadata.name);
println!("Parameters: {}B", metadata.parameters / 1_000_000_000);
println!("Quantization: {:?}", metadata.quantization);

// Load into backend
let tensors = loader.load_tensors("model.gguf")?;
backend.load_tensors(tensors)?;

mistral-rs Backend (Production Serving)

RuvLLM v2.3 includes integration with mistral-rs for production-scale LLM serving with advanced memory management.

Note: The mistral-rs crate is not yet published to crates.io. The integration is designed and ready—enable it when mistral-rs becomes available.

Key Features

Feature Description Benefit
PagedAttention vLLM-style KV cache management 5-10x concurrent users, 85-95% memory utilization
X-LoRA Per-token adapter routing <1ms routing overhead, multi-task inference
ISQ In-Situ Quantization (AWQ, GPTQ, RTN) Runtime quantization without re-export

Usage Example

use ruvllm::backends::mistral::{
    MistralBackend, MistralBackendConfig,
    PagedAttentionConfig, XLoraConfig, IsqConfig
};

// Configure mistral-rs backend for production serving
let config = MistralBackendConfig::builder()
    // PagedAttention: Enable 50+ concurrent users
    .paged_attention(PagedAttentionConfig {
        block_size: 16,
        max_blocks: 4096,
        gpu_memory_fraction: 0.9,
        enable_prefix_caching: true,
    })
    // X-LoRA: Per-token adapter routing
    .xlora(XLoraConfig {
        adapters: vec![
            "adapters/coder".into(),
            "adapters/researcher".into(),
        ],
        top_k: 2,
        temperature: 0.3,
    })
    // ISQ: Runtime quantization
    .isq(IsqConfig {
        bits: 4,
        method: IsqMethod::AWQ,
        calibration_samples: 128,
    })
    .build();

let mut backend = MistralBackend::new(config)?;
backend.load_model("mistralai/Mistral-7B-Instruct-v0.2", ModelConfig::default())?;

// Generate with PagedAttention + X-LoRA
let response = backend.generate("Write secure authentication code", GenerateParams {
    max_tokens: 512,
    temperature: 0.7,
    ..Default::default()
})?;

When to Use mistral-rs vs Candle

Scenario Recommended Backend Reason
Single user / Edge Candle Simpler, smaller binary
10-100 concurrent users mistral-rs PagedAttention memory efficiency
Multi-task models mistral-rs X-LoRA per-token routing
Runtime quantization mistral-rs ISQ without model re-export
WASM / Browser Candle mistral-rs doesn't support WASM

Feature Flags

# Enable mistral-rs (when available on crates.io)
ruvllm = { version = "2.3", features = ["mistral-rs"] }

# With Metal acceleration (Apple Silicon)
ruvllm = { version = "2.3", features = ["mistral-rs-metal"] }

# With CUDA acceleration (NVIDIA)
ruvllm = { version = "2.3", features = ["mistral-rs-cuda"] }

See ADR-008: mistral-rs Integration for detailed architecture decisions.

Configuration

Environment Variables

Variable Description Default
RUVLLM_CACHE_DIR Model cache directory ~/.cache/ruvllm
RUVLLM_LOG_LEVEL Logging level info
RUVLLM_METAL_DEVICE Metal device index 0
RUVLLM_ANE_ENABLED Enable ANE routing true
RUVLLM_SONA_ENABLED Enable SONA learning true

Model Configuration

let config = ModelConfig {
    max_context: 8192,
    use_flash_attention: true,
    quantization: Quantization::Q4K,
    kv_cache_config: KvCacheConfig::default(),
    rope_scaling: Some(RopeScaling::Linear { factor: 2.0 }),
    sliding_window: Some(4096),
    ..Default::default()
};

Benchmarks

Run benchmarks with:

# Attention benchmarks
cargo bench --bench attention_bench --features inference-metal

# ANE benchmarks (Mac only)
cargo bench --bench ane_bench --features coreml

# LoRA benchmarks
cargo bench --bench lora_bench

# End-to-end inference
cargo bench --bench e2e_bench --features inference-metal

# Metal shader benchmarks
cargo bench --bench metal_bench --features metal-compute

# Serving benchmarks
cargo bench --bench serving_bench --features inference-metal

HuggingFace Hub Integration (v2.3)

Download and upload models to HuggingFace Hub:

use ruvllm::hub::{ModelDownloader, ModelUploader, RuvLtraRegistry, DownloadConfig};

// Download from Hub
let downloader = ModelDownloader::new(DownloadConfig::default());
let model_path = downloader.download(
    "ruvector/ruvltra-small-q4km",
    Some("./models"),
)?;

// Or use the registry for RuvLTRA models
let registry = RuvLtraRegistry::new();
let model = registry.get("ruvltra-medium", "Q4_K_M")?;

// Upload to Hub (requires HF_TOKEN)
let uploader = ModelUploader::new("hf_your_token");
let url = uploader.upload(
    "./my-model.gguf",
    "username/my-ruvltra-model",
    Some(metadata),
)?;
println!("Uploaded to: {}", url);

Task-Specific LoRA Adapters (v2.3)

Pre-trained adapters optimized for Claude Flow agent types:

use ruvllm::lora::{RuvLtraAdapters, AdapterTrainer, AdapterMerger, HotSwapManager};

// Create adapter for specific task
let adapters = RuvLtraAdapters::new();
let coder = adapters.create_lora("coder", 768)?;       // Rank 16, code generation
let security = adapters.create_lora("security", 768)?; // Rank 16, vulnerability detection

// Available adapters:
// - coder:     Rank 16, Alpha 32.0, targets attention (Q,K,V,O)
// - researcher: Rank 8, Alpha 16.0, targets Q,K,V
// - security:  Rank 16, Alpha 32.0, targets attention + MLP
// - architect: Rank 12, Alpha 24.0, targets Q,V + Gate,Up
// - reviewer:  Rank 8, Alpha 16.0, targets Q,V

// Merge adapters for multi-task models
let merger = AdapterMerger::new(MergeConfig::weighted(weights));
let multi_task = merger.merge(&[coder, security], &output_config, 768)?;

// Hot-swap adapters at runtime
let mut manager = HotSwapManager::new();
manager.set_active(coder);
manager.prepare_standby(security);
manager.swap()?; // Zero-downtime switch

Adapter Merging Strategies

Strategy Description Use Case
Average Equal-weight averaging Simple multi-task
WeightedSum User-defined weights Task importance weighting
SLERP Spherical interpolation Smooth transitions
TIES Trim, Elect, Merge Robust multi-adapter
DARE Drop And REscale Sparse merging
TaskArithmetic Add/subtract vectors Task composition

Evaluation Harness (v2.3)

RuvLLM includes a comprehensive evaluation harness for benchmarking model quality:

use ruvllm::evaluation::{RealEvaluationHarness, EvalConfig, AblationMode};

// Create harness with GGUF model
let harness = RealEvaluationHarness::with_gguf(
    "./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
    EvalConfig::default(),
)?;

// Run single evaluation
let result = harness.evaluate(
    "Fix the null pointer exception in this code",
    "def process(data):\n    return data.split()",
    AblationMode::Full,
)?;

println!("Success: {}, Quality: {:.2}", result.success, result.quality_score);

// Run full ablation study (5 modes)
let report = harness.run_ablation_study(&tasks)?;
for (mode, metrics) in &report.mode_metrics {
    println!("{:?}: {:.1}% success, {:.2} quality",
        mode, metrics.success_rate * 100.0, metrics.avg_quality);
}

Ablation Modes

Mode Description Use Case
Baseline No enhancements Control baseline
RetrievalOnly HNSW pattern retrieval Measure retrieval impact
AdaptersOnly LoRA adapters Measure adaptation impact
RetrievalPlusAdapters HNSW + LoRA Combined without SONA
Full All systems (SONA + HNSW + LoRA) Production mode

SWE-Bench Task Loader

use ruvllm::evaluation::swe_bench::SweBenchLoader;

// Load SWE-Bench tasks
let loader = SweBenchLoader::new();
let tasks = loader.load_subset("lite", 50)?; // 50 tasks from lite subset

for task in &tasks {
    println!("Instance: {}", task.instance_id);
    println!("Problem: {}", task.problem_statement);
}

CLI Evaluation

# Run evaluation with default settings
cargo run --example run_eval --features async-runtime -- \
    --model ./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# Run SWE-Bench subset
cargo run --example run_eval --features async-runtime -- \
    --model ./models/model.gguf \
    --swe-bench-path ./data/swe-bench \
    --subset lite \
    --max-tasks 100

# Output report
cargo run --example run_eval --features async-runtime -- \
    --model ./models/model.gguf \
    --output ./reports/eval-report.json

HNSW Auto-Dimension Detection

The evaluation harness automatically detects model embedding dimensions:

// HNSW router automatically uses model's hidden_size
// TinyLlama 1.1B → 2048 dimensions
// Qwen2 0.5B → 896 dimensions
// RuvLTRA-Small → 896 dimensions
// RuvLTRA-Medium → 2560 dimensions

let harness = RealEvaluationHarness::with_config(
    EvalConfig::default(),
    RealInferenceConfig {
        enable_hnsw: true,
        hnsw_config: None, // Auto-detect from model
        ..Default::default()
    },
)?;

Examples

See the /examples directory for:

  • download_test_model.rs - Download and validate models
  • benchmark_model.rs - Full inference benchmarking
  • run_eval.rs - Run evaluation harness with SWE-Bench
  • Basic inference
  • Streaming generation
  • MicroLoRA adaptation
  • Multi-turn chat
  • Speculative decoding
  • Continuous batching
  • ANE hybrid inference

Error Handling

use ruvllm::error::{Result, RuvLLMError};

match backend.generate(prompt, params) {
    Ok(response) => println!("{}", response),
    Err(RuvLLMError::Model(e)) => eprintln!("Model error: {}", e),
    Err(RuvLLMError::OutOfMemory(e)) => eprintln!("OOM: {}", e),
    Err(RuvLLMError::Generation(e)) => eprintln!("Generation failed: {}", e),
    Err(RuvLLMError::Ane(e)) => eprintln!("ANE error: {}", e),
    Err(RuvLLMError::Gguf(e)) => eprintln!("GGUF loading error: {}", e),
    Err(e) => eprintln!("Error: {}", e),
}

npm Package

RuvLLM is also available as an npm package with native bindings:

npm install @ruvector/ruvllm
import { RuvLLM } from '@ruvector/ruvllm';

const llm = new RuvLLM();
const response = llm.query('Explain quantum computing');
console.log(response.text);

See @ruvector/ruvllm on npm for full documentation.

License

Apache-2.0 / MIT dual license.

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Links


Part of RuVector -- the self-learning vector database.