Skip to content

Latest commit

 

History

History
417 lines (312 loc) · 10.3 KB

File metadata and controls

417 lines (312 loc) · 10.3 KB

RuvLTRA-Medium: 3B Parameter Model Architecture

Overview

RuvLTRA-Medium is a 3 billion parameter language model based on the Qwen2.5-3B-Instruct architecture, enhanced with advanced learning capabilities and optimized for Apple Silicon and modern GPU acceleration.

Architecture Specifications

Model Configuration

Parameter Value Description
Total Parameters ~3.0B Full model size
Hidden Size 2048 Embedding dimension
Layers 32 Transformer decoder layers
Attention Heads 16 Query heads
KV Heads 2 Key-value heads (GQA)
GQA Ratio 8:1 Grouped Query Attention ratio
Head Dimension 128 Per-head dimension
Intermediate Size 11008 MLP hidden dimension
Vocabulary Size 151936 Qwen tokenizer
Context Length 32768 Maximum sequence length
RoPE Theta 1,000,000 RoPE base frequency

Quantization Options

Format Model Size Quality Speed Recommended Use
Q4_K_M ~2.0 GB Good Fast Production inference
Q5_K_M ~2.5 GB Better Medium Balanced quality/speed
Q8_0 ~3.5 GB Best Slower Maximum quality
Mixed ~2.8 GB Excellent Medium FP16 attn + Q4 MLP

Model Variants

1. RuvLTRA-Medium-Base

General-purpose model for diverse tasks.

Configuration:

let config = RuvLtraMediumConfig::base();

Characteristics:

  • Temperature: 0.7
  • Top-p: 0.9
  • SONA hooks: Layers 8, 16, 24
  • Pattern capacity: 50,000

Use Cases:

  • General conversation
  • Text completion
  • Summarization
  • Question answering

2. RuvLTRA-Medium-Coder

Optimized for code generation and analysis.

Configuration:

let config = RuvLtraMediumConfig::coder();

Characteristics:

  • Temperature: 0.2 (deterministic)
  • Top-p: 0.95
  • SONA hooks: Layers 8, 16, 24, 28 (extra late-layer)
  • Pattern capacity: 100,000
  • Quality threshold: 0.7 (stricter)

Use Cases:

  • Code completion
  • Bug fixing
  • Code refactoring
  • API generation

3. RuvLTRA-Medium-Agent

Routing and planning optimized for agent systems.

Configuration:

let config = RuvLtraMediumConfig::agent();

Characteristics:

  • Temperature: 0.3
  • Top-p: 0.85
  • SONA hooks: Layers 8, 16, 24
  • HNSW M: 32 (higher connectivity)
  • HNSW ef_construction: 400
  • Micro-LoRA rank: 2 (low latency)

Use Cases:

  • Claude Flow agent routing
  • Task planning
  • Decision making
  • Multi-agent coordination

RuvLTRA Enhancements

1. SONA Learning Hooks

SONA (Self-Optimizing Neural Architecture) hooks enable continuous learning during inference.

Hook Layers:

  • Layer 8: Early pattern recognition (shallow semantics)
  • Layer 16: Mid-layer semantic extraction (concepts)
  • Layer 24: Deep reasoning capture (abstract thinking)

Implementation:

let config = RuvLtraMediumConfig::base();
let mut model = RuvLtraMediumModel::new(&config)?;

// Enable custom hook layers
model.enable_sona_with_hooks(&[8, 16, 24])?;

Learning Loop:

  1. Instant Loop: Ring buffer with MicroLoRA (rank 4)
  2. Background Loop: Router training with EWC++ Fisher
  3. Deep Loop: Pattern bank consolidation

2. HNSW Routing Integration

HNSW (Hierarchical Navigable Small World) enables fast agent routing.

Configuration:

let config = RuvLtraMediumConfig::agent();
assert_eq!(config.sona_hooks.hnsw_m, 32);
assert_eq!(config.sona_hooks.hnsw_ef_construction, 400);

Performance:

  • Search: 150x-12,500x faster than brute-force
  • Insertion: O(log n) complexity
  • Memory: ~4 bytes per node per connection

3. Claude Flow Agent Embeddings

Integration with Claude Flow for intelligent task routing.

Features:

  • Agent type classification
  • Task complexity estimation
  • Quality prediction
  • Trajectory recording

Usage:

let config = RuvLtraMediumConfig::agent();
config.enable_agent_routing = true;

let model = RuvLtraMediumModel::new(&config)?;
// Model automatically records trajectories for routing

4. ReasoningBank Trajectory Storage

Stores successful reasoning patterns for future retrieval.

Storage Format:

  • State-action pairs
  • Quality scores (0.0-1.0)
  • Contextual embeddings
  • Temporal metadata

Configuration:

let config = RuvLtraMediumConfig::base();
config.enable_reasoning_bank = true;
config.sona_config.pattern_capacity = 50000;

Memory Optimization

1. Paged KV Cache

Efficient memory management for attention computation.

Block Size: 64 tokens per page

Benefits:

  • 40-60% memory reduction
  • Dynamic sequence handling
  • Copy-on-write semantics
  • Efficient prefix caching

Configuration:

let config = RuvLtraMediumConfig::base();
assert!(config.use_paged_attention);
assert_eq!(config.paged_config.page_size, 64);

2. Flash Attention 2

Optimized attention kernel for 2.49x-7.47x speedup.

Algorithm:

  • Tiled computation
  • Recomputation on-the-fly
  • IO-aware optimization
  • Causal masking

Performance:

Sequence Length Speedup Memory Savings
2K tokens 2.5x 30%
8K tokens 4.2x 50%
32K tokens 7.1x 70%

3. Speculative Decoding

Uses RuvLTRA-Small (0.5B) as draft model for 2-3x speedup.

Configuration:

let mut config = RuvLtraMediumConfig::base();
config.use_speculative_decoding = true;
config.speculative_config.lookahead = 4;
config.draft_model_path = Some("models/ruvltra-small-q4.gguf".into());

Parameters:

  • Lookahead: 4 tokens (default)
  • Acceptance threshold: 0.7
  • Draft temperature: 0.0 (greedy)
  • Adaptive lookahead: enabled

Expected Speedup:

Temperature Speedup
0.0 (greedy) 2.8-3.2x
0.5 2.2-2.6x
1.0 1.5-1.8x

Usage Examples

Basic Inference

use ruvllm::models::ruvltra_medium::{RuvLtraMediumConfig, RuvLtraMediumModel};

// Create model
let config = RuvLtraMediumConfig::base();
let mut model = RuvLtraMediumModel::new(&config)?;

// Tokenize input
let input_ids = vec![151643, 9521, 11, 1917]; // "Hello, world"
let positions = (0..input_ids.len()).collect::<Vec<_>>();

// Run inference
let logits = model.forward(&input_ids, &positions)?;

// Get next token
let next_token = argmax(&logits[logits.len() - config.vocab_size..]);

Code Generation (Coder Variant)

let config = RuvLtraMediumConfig::coder();
let mut model = RuvLtraMediumModel::new(&config)?;

// Enable SONA hooks for learning
model.enable_sona_with_hooks(&[8, 16, 24, 28])?;

// Generate code
let prompt = "fn fibonacci(n: u32) -> u32 {";
let output = model.generate(prompt, GenerateParams {
    max_tokens: 256,
    temperature: 0.2,
    top_p: 0.95,
    ..Default::default()
})?;

Agent Routing (Agent Variant)

let config = RuvLtraMediumConfig::agent();
let model = RuvLtraMediumModel::new(&config)?;

// Enable Claude Flow integration
assert!(config.enable_agent_routing);

// Model automatically:
// - Records trajectories
// - Updates HNSW index
// - Learns routing patterns

Speculative Decoding

let mut config = RuvLtraMediumConfig::base();
config.use_speculative_decoding = true;
config.draft_model_path = Some("ruvltra-small-q4.gguf".into());

let model = RuvLtraMediumModel::new(&config)?;

// 2-3x faster generation
let output = model.generate("Once upon a time", params)?;

Model Loading

From GGUF

use ruvllm::gguf::loader::GGUFLoader;

let loader = GGUFLoader::new("ruvltra-medium-q4_k_m.gguf")?;
let model = loader.load_ruvltra_medium()?;

Quantization Formats

# Download pre-quantized models
wget https://huggingface.co/ruvector/ruvltra-medium-q4_k_m-gguf
wget https://huggingface.co/ruvector/ruvltra-medium-q5_k_m-gguf
wget https://huggingface.co/ruvector/ruvltra-medium-q8_0-gguf

# Or quantize yourself
cargo run --release --bin quantize -- \
  --model qwen2.5-3b-instruct \
  --output ruvltra-medium-q4_k_m.gguf \
  --format q4_k_m

Performance Benchmarks

Inference Speed (Apple M3 Max)

Configuration Tokens/sec Memory Power
Base Q4_K_M 68 tok/s 2.2 GB 12W
Base Q5_K_M 55 tok/s 2.7 GB 14W
Base Q8_0 42 tok/s 3.8 GB 16W
Coder Q4_K_M 65 tok/s 2.4 GB 13W
Agent Q4_K_M 72 tok/s 2.1 GB 11W
+ Speculative 158 tok/s 2.8 GB 15W

Quality Metrics

Benchmark Base Coder Agent
MMLU 68.2% 66.8% 64.5%
HumanEval 52.4% 61.7% 48.9%
GSM8K 71.3% 69.8% 73.6%
TruthfulQA 45.8% 44.2% 47.1%

Integration with Claude Flow

Agent Routing

use ruvllm::models::ruvltra_medium::RuvLtraMediumConfig;
use ruvllm::claude_flow::AgentRouter;

let config = RuvLtraMediumConfig::agent();
let model = RuvLtraMediumModel::new(&config)?;

// Router uses model embeddings for task classification
let router = AgentRouter::new(model.sona().unwrap());

// Route task to optimal agent
let task = "Implement authentication system";
let agent = router.route(task)?; // Returns: "coder" or "security-architect"

Trajectory Recording

use ruvllm::sona::Trajectory;

// Create trajectory
let mut trajectory = Trajectory::new("code-generation");
trajectory.add_state(initial_state);
trajectory.add_action("generate_function", quality_score);

// Record in model
model.sona()
    .unwrap()
    .write()
    .record_trajectory(trajectory)?;

Limitations

  1. Context Window: 32K tokens (not extensible without retraining)
  2. SONA Hooks: Limited to 4 hooks due to memory overhead
  3. Speculative Decoding: Requires separate draft model
  4. Quantization: Q4/Q5 may degrade quality by 2-3%
  5. Hardware: Optimized for Apple Silicon; GPU acceleration recommended

Roadmap

  • RuvLTRA-Medium-Vision (multimodal)
  • Context extension to 128K tokens
  • Mixture-of-Experts (MoE) variant
  • On-device fine-tuning
  • Distillation to RuvLTRA-Small

References