Status: Proposed Date: 2026-01-20 Decision Makers: Ruvector Architecture Team Technical Area: LLM Inference Engine / Production Serving
RuvLLM v2.3 includes a stub MistralBackend implementation at crates/ruvllm/src/backends/mistral_backend.rs that defines the interface for high-performance LLM inference but lacks actual integration with the mistral-rs crate. The current Candle backend is optimized for single-user and edge deployment scenarios, but production-scale serving requires advanced memory management and multi-tenant capabilities.
The existing MistralBackend stub provides:
- Configuration structures for PagedAttention, X-LoRA, and ISQ
XLoraManagerwith adapter loading/routing logic (placeholder)MistralBackendConfigwith builder pattern for Metal/CUDA targets- Integration hooks for the
LlmBackendtrait
However, the implementation is non-functional:
- No actual mistral-rs crate dependency
- Token generation returns placeholder values
- Model loading does not wire to inference pipeline
- PagedAttention uses RuvLLM's internal implementation, not mistral-rs's optimized version
- Concurrent User Scaling: Candle backend is optimized for single-user inference; production servers need 10-100+ concurrent requests
- KV Cache Memory Pressure: Without vLLM-style paging, long-context sessions exhaust GPU memory
- Multi-Task Models: LoRA adapter switching requires per-request overhead; X-LoRA enables per-token routing
- Deployment Flexibility: Models should be quantized at runtime based on available hardware
- Concurrent sessions: 50-100 simultaneous inference requests
- Memory efficiency: 5-10x improvement in KV cache utilization
- Adapter latency: <1ms overhead for X-LoRA routing decisions
- Quantization: Runtime ISQ without model re-export
- Existing interface: Must implement
LlmBackendtrait seamlessly - Feature isolation: Optional dependency with feature flags
- Backend selection: Runtime choice between Candle and mistral-rs
- Apple Silicon: Metal acceleration via
mistral-rs-metal - NVIDIA GPUs: CUDA acceleration via
mistral-rs-cuda - CPU fallback: Pure Rust path for edge/WASM targets
Vendor mistral-rs source code directly into RuvLLM.
Pros:
- Full control over API surface
- No external dependency versioning
- Can customize for RuvLLM's needs
Cons:
- Maintenance burden of tracking upstream
- Miss upstream optimizations and fixes
- Duplicated effort
Add mistral-rs as an optional dependency behind feature flags, wiring the existing MistralBackend interface to actual mistral-rs crate.
Pros:
- Leverage upstream development
- Clean separation via features
- Users choose their backend at compile time
- Smaller binary for edge deployments (Candle-only)
Cons:
- API surface depends on upstream stability
- Two codepaths to maintain
- Feature matrix complexity
Use dynamic dispatch to select backend at runtime via configuration.
Pros:
- Single binary for all deployments
- Runtime flexibility
Cons:
- Binary size includes all backends
- Dynamic dispatch overhead
- Complex testing matrix
Chosen Option: Option B - Optional Dependency with Feature Flags
Add mistral-rs as an optional dependency with three feature flags, wiring the existing MistralBackend stub to the actual mistral-rs implementation.
- Separation of concerns: Edge deployments use Candle (no mistral-rs dependency); server deployments enable mistral-rs features
- Upstream leverage: mistral-rs team maintains PagedAttention, X-LoRA, ISQ implementations
- Existing interface: The
MistralBackendstub already defines the API; we wire it to real implementation - Incremental adoption: Users can migrate from Candle to mistral-rs backend per-deployment
# Cargo.toml additions
[features]
default = ["candle-backend"]
# Base mistral-rs integration
mistral-rs = ["dep:mistralrs", "dep:mistralrs-core"]
# Apple Silicon Metal acceleration
mistral-rs-metal = ["mistral-rs", "mistralrs/metal"]
# NVIDIA CUDA acceleration
mistral-rs-cuda = ["mistral-rs", "mistralrs/cuda"]
[dependencies]
# Optional mistral-rs integration
mistralrs = { version = "0.3", optional = true }
mistralrs-core = { version = "0.3", optional = true }| Feature | Candle | mistral-rs | mistral-rs-metal | mistral-rs-cuda |
|---|---|---|---|---|
| Single-user inference | Yes | Yes | Yes | Yes |
| PagedAttention | No | Yes | Yes | Yes |
| X-LoRA | No | Yes | Yes | Yes |
| ISQ | No | Yes | Yes | Yes |
| Metal acceleration | Yes | No | Yes | No |
| CUDA acceleration | Partial | No | No | Yes |
| WASM support | Yes | No | No | No |
| Binary size | ~15MB | ~45MB | ~50MB | ~60MB |
+-----------------------------------------------------------------------+
| MISTRAL-RS INTEGRATION ARCHITECTURE |
+-----------------------------------------------------------------------+
| |
| +-------------------+ +-------------------+ +--------------+ |
| | MistralBackend | | mistralrs::Model | | Hardware | |
| | (RuvLLM adapter) | | (inference core) | | Accelerator | |
| | | | | | | |
| | - Config mapping |---->| - PagedAttention |---->| - Metal | |
| | - Trait impl | | - X-LoRA routing | | - CUDA | |
| | - Error handling | | - ISQ runtime | | - CPU | |
| +--------+----------+ +---------+---------+ +------+-------+ |
| | | | |
| v v v |
| +--------+----------+ +---------+---------+ +------+-------+ |
| | LlmBackend trait | | KV Cache Pool | | Tensor Ops | |
| | (RuvLLM unified) | | (PagedAttention) | | (kernels) | |
| +-------------------+ +-------------------+ +--------------+ |
| |
+-----------------------------------------------------------------------+
PagedAttention partitions the KV cache into fixed-size blocks (pages) that can be allocated non-contiguously, enabling:
- 5-10x concurrent users: Memory shared across requests via copy-on-write pages
- Dynamic allocation: Pages allocated as sequences grow, freed when complete
- Prefix caching: Common prefixes (system prompts) share pages across requests
/// PagedAttention configuration for mistral-rs
#[cfg(feature = "mistral-rs")]
pub struct PagedAttentionConfig {
/// Block size in tokens (typical: 16)
pub block_size: usize,
/// Maximum blocks in page table
pub max_blocks: usize,
/// GPU memory fraction for KV cache (0.0-1.0)
pub gpu_memory_fraction: f32,
/// Enable prefix caching for repeated prompts
pub enable_prefix_caching: bool,
}
impl Default for PagedAttentionConfig {
fn default() -> Self {
Self {
block_size: 16,
max_blocks: 4096,
gpu_memory_fraction: 0.9,
enable_prefix_caching: true,
}
}
}Performance Impact:
| Metric | Without PagedAttention | With PagedAttention |
|---|---|---|
| Concurrent users | 1-2 | 10-50 |
| Memory utilization | 40-60% | 85-95% |
| Memory fragmentation | High | Near-zero |
X-LoRA enables per-token adapter routing for multi-task models:
- Dynamic mixing: Router network selects adapters per token
- Learned routing: MLP router trained on adapter selection
- Top-k activation: Only k adapters compute per token (efficiency)
/// X-LoRA configuration for multi-adapter inference
#[cfg(feature = "mistral-rs")]
pub struct XLoraConfig {
/// Adapter names/paths to load
pub adapters: Vec<String>,
/// Top-k adapters to activate per token
pub top_k: usize,
/// Router temperature for softmax
pub temperature: f32,
/// Mixing mode
pub mixing_mode: XLoraMixingMode,
}
#[derive(Debug, Clone, Copy)]
pub enum XLoraMixingMode {
/// Sum weighted adapter outputs
Additive,
/// Concatenate and project
Concatenate,
/// Gated mixture with learned gates
Gated,
}Use Cases:
- Code + chat model: Route code tokens to code adapter, natural language to chat adapter
- Multi-language: Route based on detected language
- Domain-specific: Finance, medical, legal adapters activated by context
ISQ enables runtime quantization without pre-exported quantized models:
- Runtime flexibility: Same model weights, different quantization per deployment
- Memory adaptation: Quantize to fit available hardware
- Quality preservation: Activation-aware methods (AWQ, GPTQ) maintain accuracy
/// ISQ configuration for runtime quantization
#[cfg(feature = "mistral-rs")]
pub struct IsqConfig {
/// Quantization bits (2, 4, 8)
pub bits: u8,
/// Quantization method
pub method: IsqMethod,
/// Calibration dataset size
pub calibration_samples: usize,
}
#[derive(Debug, Clone, Copy)]
pub enum IsqMethod {
/// Activation-aware Weight Quantization
AWQ,
/// GPTQ with optimal brain quantization
GPTQ,
/// Round-to-nearest (fastest, lower quality)
RTN,
/// SmoothQuant (activation smoothing)
SmoothQuant,
}Performance Impact:
| Method | Bits | Memory Reduction | Quality Loss |
|---|---|---|---|
| AWQ | 4 | 4x | <1% |
| GPTQ | 4 | 4x | <1% |
| RTN | 4 | 4x | 2-3% |
| AWQ | 2 | 8x | 3-5% |
- Add mistral-rs dependencies with feature flags
- Implement config mapping:
MistralBackendConfig->mistralrs::Config - Wire
load_modelto mistral-rs model loading - Wire
generateandgenerate_streamto mistral-rs inference
#[cfg(feature = "mistral-rs")]
impl LlmBackend for MistralBackend {
fn load_model(&mut self, model_id: &str, config: ModelConfig) -> Result<()> {
use mistralrs::{ModelKind, MistralRs, MistralRsBuilder};
let builder = MistralRsBuilder::new(model_id)
.with_paged_attention(self.config.paged_attention.as_ref().map(|pa| {
mistralrs::PagedAttentionConfig {
block_size: pa.block_size,
..Default::default()
}
}));
self.inner = Some(builder.build()?);
Ok(())
}
fn generate(&self, prompt: &str, params: GenerateParams) -> Result<String> {
let inner = self.inner.as_ref()
.ok_or_else(|| Error::msg("Model not loaded"))?;
let request = mistralrs::Request::new(prompt)
.with_max_tokens(params.max_tokens)
.with_temperature(params.temperature);
let response = inner.send_request(request)?;
Ok(response.text)
}
}- Enable PagedAttention with configurable parameters
- Add X-LoRA adapter loading and routing
- Implement ISQ with calibration pipeline
- Test and validate Metal acceleration
- Test and validate CUDA acceleration
- Benchmark against Candle backend
- Production-scale serving: PagedAttention enables 5-10x more concurrent users
- Multi-task efficiency: X-LoRA eliminates adapter switching overhead
- Deployment flexibility: ISQ allows runtime quantization decisions
- Upstream maintenance: mistral-rs team maintains core inference optimizations
- Feature parity: Access to latest mistral-rs features (Flash Attention 2, speculative decoding)
- Dependency complexity: Additional crate dependencies increase build complexity
- API surface coupling: Changes in mistral-rs may require RuvLLM updates
- Feature matrix: Two backend codepaths require testing both paths
- WASM incompatibility: mistral-rs does not support WASM targets
- Two backend options: Candle remains optimal for edge/WASM; mistral-rs for server
- Compile-time selection: Users choose backend via feature flags
- Binary size tradeoff: Server builds are larger; edge builds unchanged
| Risk | Mitigation |
|---|---|
| mistral-rs API instability | Pin to specific version; abstract via MistralBackend interface |
| Feature flag complexity | Comprehensive CI matrix testing all feature combinations |
| Performance regression | Benchmark suite comparing Candle vs mistral-rs |
| Metal/CUDA compatibility | Platform-specific CI runners for hardware validation |
- Rejected: Different model format (GGUF), weaker Rust integration
- Consideration: Could add as third backend for GGUF model support
- Rejected: Candle's PagedAttention is experimental and less mature
- Consideration: Monitor upstream development
- Rejected: Python FFI adds latency; deployment complexity
- Consideration: vLLM's algorithm informs our understanding
- ADR-001: Ruvector Core Architecture (HNSW, Graph Store)
- ADR-002: RuvLLM Integration with Ruvector
- ADR-003: SIMD Optimization Strategy
- ADR-004: KV Cache Management
- ADR-006: Memory Management
- ADR-007: Security Review & Technical Debt
MistralBackendimplementsLlmBackendtrait- All existing RuvLLM consumers work unchanged
- Feature flags are additive (no breaking changes)
- Unit tests for config mapping
- Integration tests with sample models
- Benchmark suite comparing backends
- CI matrix for feature flag combinations
- Feature flag documentation in README
- Backend selection guide
- Performance comparison benchmarks
- mistral-rs Repository: https://github.com/EricLBuehler/mistral.rs
- vLLM PagedAttention Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention"
- X-LoRA Paper: "X-LoRA: Mixture of Low-Rank Adapter Experts"
- ISQ/AWQ Paper: "AWQ: Activation-aware Weight Quantization for LLM Compression"
- Existing MistralBackend stub:
crates/ruvllm/src/backends/mistral_backend.rs
| Component | Status | Notes |
|---|---|---|
| Feature flags | Pending | Add to Cargo.toml |
| Config mapping | Pending | MistralBackendConfig -> mistralrs::Config |
| Model loading | Pending | Wire to mistral-rs loader |
| Generation | Pending | Wire to mistral-rs inference |
| PagedAttention | Pending | Enable via config |
| X-LoRA | Pending | Wire existing XLoraManager |
| ISQ | Pending | Implement calibration pipeline |
| Metal acceleration | Pending | Test on Apple Silicon |
| CUDA acceleration | Pending | Test on NVIDIA GPUs |
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-01-20 | Ruvector Architecture Team | Initial proposal |