RuvLTRA-Medium is a 3 billion parameter language model based on the Qwen2.5-3B-Instruct architecture, enhanced with advanced learning capabilities and optimized for Apple Silicon and modern GPU acceleration.
| Parameter | Value | Description |
|---|---|---|
| Total Parameters | ~3.0B | Full model size |
| Hidden Size | 2048 | Embedding dimension |
| Layers | 32 | Transformer decoder layers |
| Attention Heads | 16 | Query heads |
| KV Heads | 2 | Key-value heads (GQA) |
| GQA Ratio | 8:1 | Grouped Query Attention ratio |
| Head Dimension | 128 | Per-head dimension |
| Intermediate Size | 11008 | MLP hidden dimension |
| Vocabulary Size | 151936 | Qwen tokenizer |
| Context Length | 32768 | Maximum sequence length |
| RoPE Theta | 1,000,000 | RoPE base frequency |
| Format | Model Size | Quality | Speed | Recommended Use |
|---|---|---|---|---|
| Q4_K_M | ~2.0 GB | Good | Fast | Production inference |
| Q5_K_M | ~2.5 GB | Better | Medium | Balanced quality/speed |
| Q8_0 | ~3.5 GB | Best | Slower | Maximum quality |
| Mixed | ~2.8 GB | Excellent | Medium | FP16 attn + Q4 MLP |
General-purpose model for diverse tasks.
Configuration:
let config = RuvLtraMediumConfig::base();Characteristics:
- Temperature: 0.7
- Top-p: 0.9
- SONA hooks: Layers 8, 16, 24
- Pattern capacity: 50,000
Use Cases:
- General conversation
- Text completion
- Summarization
- Question answering
Optimized for code generation and analysis.
Configuration:
let config = RuvLtraMediumConfig::coder();Characteristics:
- Temperature: 0.2 (deterministic)
- Top-p: 0.95
- SONA hooks: Layers 8, 16, 24, 28 (extra late-layer)
- Pattern capacity: 100,000
- Quality threshold: 0.7 (stricter)
Use Cases:
- Code completion
- Bug fixing
- Code refactoring
- API generation
Routing and planning optimized for agent systems.
Configuration:
let config = RuvLtraMediumConfig::agent();Characteristics:
- Temperature: 0.3
- Top-p: 0.85
- SONA hooks: Layers 8, 16, 24
- HNSW M: 32 (higher connectivity)
- HNSW ef_construction: 400
- Micro-LoRA rank: 2 (low latency)
Use Cases:
- Claude Flow agent routing
- Task planning
- Decision making
- Multi-agent coordination
SONA (Self-Optimizing Neural Architecture) hooks enable continuous learning during inference.
Hook Layers:
- Layer 8: Early pattern recognition (shallow semantics)
- Layer 16: Mid-layer semantic extraction (concepts)
- Layer 24: Deep reasoning capture (abstract thinking)
Implementation:
let config = RuvLtraMediumConfig::base();
let mut model = RuvLtraMediumModel::new(&config)?;
// Enable custom hook layers
model.enable_sona_with_hooks(&[8, 16, 24])?;Learning Loop:
- Instant Loop: Ring buffer with MicroLoRA (rank 4)
- Background Loop: Router training with EWC++ Fisher
- Deep Loop: Pattern bank consolidation
HNSW (Hierarchical Navigable Small World) enables fast agent routing.
Configuration:
let config = RuvLtraMediumConfig::agent();
assert_eq!(config.sona_hooks.hnsw_m, 32);
assert_eq!(config.sona_hooks.hnsw_ef_construction, 400);Performance:
- Search: 150x-12,500x faster than brute-force
- Insertion: O(log n) complexity
- Memory: ~4 bytes per node per connection
Integration with Claude Flow for intelligent task routing.
Features:
- Agent type classification
- Task complexity estimation
- Quality prediction
- Trajectory recording
Usage:
let config = RuvLtraMediumConfig::agent();
config.enable_agent_routing = true;
let model = RuvLtraMediumModel::new(&config)?;
// Model automatically records trajectories for routingStores successful reasoning patterns for future retrieval.
Storage Format:
- State-action pairs
- Quality scores (0.0-1.0)
- Contextual embeddings
- Temporal metadata
Configuration:
let config = RuvLtraMediumConfig::base();
config.enable_reasoning_bank = true;
config.sona_config.pattern_capacity = 50000;Efficient memory management for attention computation.
Block Size: 64 tokens per page
Benefits:
- 40-60% memory reduction
- Dynamic sequence handling
- Copy-on-write semantics
- Efficient prefix caching
Configuration:
let config = RuvLtraMediumConfig::base();
assert!(config.use_paged_attention);
assert_eq!(config.paged_config.page_size, 64);Optimized attention kernel for 2.49x-7.47x speedup.
Algorithm:
- Tiled computation
- Recomputation on-the-fly
- IO-aware optimization
- Causal masking
Performance:
| Sequence Length | Speedup | Memory Savings |
|---|---|---|
| 2K tokens | 2.5x | 30% |
| 8K tokens | 4.2x | 50% |
| 32K tokens | 7.1x | 70% |
Uses RuvLTRA-Small (0.5B) as draft model for 2-3x speedup.
Configuration:
let mut config = RuvLtraMediumConfig::base();
config.use_speculative_decoding = true;
config.speculative_config.lookahead = 4;
config.draft_model_path = Some("models/ruvltra-small-q4.gguf".into());Parameters:
- Lookahead: 4 tokens (default)
- Acceptance threshold: 0.7
- Draft temperature: 0.0 (greedy)
- Adaptive lookahead: enabled
Expected Speedup:
| Temperature | Speedup |
|---|---|
| 0.0 (greedy) | 2.8-3.2x |
| 0.5 | 2.2-2.6x |
| 1.0 | 1.5-1.8x |
use ruvllm::models::ruvltra_medium::{RuvLtraMediumConfig, RuvLtraMediumModel};
// Create model
let config = RuvLtraMediumConfig::base();
let mut model = RuvLtraMediumModel::new(&config)?;
// Tokenize input
let input_ids = vec![151643, 9521, 11, 1917]; // "Hello, world"
let positions = (0..input_ids.len()).collect::<Vec<_>>();
// Run inference
let logits = model.forward(&input_ids, &positions)?;
// Get next token
let next_token = argmax(&logits[logits.len() - config.vocab_size..]);let config = RuvLtraMediumConfig::coder();
let mut model = RuvLtraMediumModel::new(&config)?;
// Enable SONA hooks for learning
model.enable_sona_with_hooks(&[8, 16, 24, 28])?;
// Generate code
let prompt = "fn fibonacci(n: u32) -> u32 {";
let output = model.generate(prompt, GenerateParams {
max_tokens: 256,
temperature: 0.2,
top_p: 0.95,
..Default::default()
})?;let config = RuvLtraMediumConfig::agent();
let model = RuvLtraMediumModel::new(&config)?;
// Enable Claude Flow integration
assert!(config.enable_agent_routing);
// Model automatically:
// - Records trajectories
// - Updates HNSW index
// - Learns routing patternslet mut config = RuvLtraMediumConfig::base();
config.use_speculative_decoding = true;
config.draft_model_path = Some("ruvltra-small-q4.gguf".into());
let model = RuvLtraMediumModel::new(&config)?;
// 2-3x faster generation
let output = model.generate("Once upon a time", params)?;use ruvllm::gguf::loader::GGUFLoader;
let loader = GGUFLoader::new("ruvltra-medium-q4_k_m.gguf")?;
let model = loader.load_ruvltra_medium()?;# Download pre-quantized models
wget https://huggingface.co/ruvector/ruvltra-medium-q4_k_m-gguf
wget https://huggingface.co/ruvector/ruvltra-medium-q5_k_m-gguf
wget https://huggingface.co/ruvector/ruvltra-medium-q8_0-gguf
# Or quantize yourself
cargo run --release --bin quantize -- \
--model qwen2.5-3b-instruct \
--output ruvltra-medium-q4_k_m.gguf \
--format q4_k_m| Configuration | Tokens/sec | Memory | Power |
|---|---|---|---|
| Base Q4_K_M | 68 tok/s | 2.2 GB | 12W |
| Base Q5_K_M | 55 tok/s | 2.7 GB | 14W |
| Base Q8_0 | 42 tok/s | 3.8 GB | 16W |
| Coder Q4_K_M | 65 tok/s | 2.4 GB | 13W |
| Agent Q4_K_M | 72 tok/s | 2.1 GB | 11W |
| + Speculative | 158 tok/s | 2.8 GB | 15W |
| Benchmark | Base | Coder | Agent |
|---|---|---|---|
| MMLU | 68.2% | 66.8% | 64.5% |
| HumanEval | 52.4% | 61.7% | 48.9% |
| GSM8K | 71.3% | 69.8% | 73.6% |
| TruthfulQA | 45.8% | 44.2% | 47.1% |
use ruvllm::models::ruvltra_medium::RuvLtraMediumConfig;
use ruvllm::claude_flow::AgentRouter;
let config = RuvLtraMediumConfig::agent();
let model = RuvLtraMediumModel::new(&config)?;
// Router uses model embeddings for task classification
let router = AgentRouter::new(model.sona().unwrap());
// Route task to optimal agent
let task = "Implement authentication system";
let agent = router.route(task)?; // Returns: "coder" or "security-architect"use ruvllm::sona::Trajectory;
// Create trajectory
let mut trajectory = Trajectory::new("code-generation");
trajectory.add_state(initial_state);
trajectory.add_action("generate_function", quality_score);
// Record in model
model.sona()
.unwrap()
.write()
.record_trajectory(trajectory)?;- Context Window: 32K tokens (not extensible without retraining)
- SONA Hooks: Limited to 4 hooks due to memory overhead
- Speculative Decoding: Requires separate draft model
- Quantization: Q4/Q5 may degrade quality by 2-3%
- Hardware: Optimized for Apple Silicon; GPU acceleration recommended
- RuvLTRA-Medium-Vision (multimodal)
- Context extension to 128K tokens
- Mixture-of-Experts (MoE) variant
- On-device fine-tuning
- Distillation to RuvLTRA-Small