ADR-008: mistral-rs Integration for Production-Scale LLM Serving

Status: Proposed Date: 2026-01-20 Decision Makers: Ruvector Architecture Team Technical Area: LLM Inference Engine / Production Serving

Context and Problem Statement

RuvLLM v2.3 includes a stub MistralBackend implementation at crates/ruvllm/src/backends/mistral_backend.rs that defines the interface for high-performance LLM inference but lacks actual integration with the mistral-rs crate. The current Candle backend is optimized for single-user and edge deployment scenarios, but production-scale serving requires advanced memory management and multi-tenant capabilities.

Current State

The existing MistralBackend stub provides:

Configuration structures for PagedAttention, X-LoRA, and ISQ
XLoraManager with adapter loading/routing logic (placeholder)
MistralBackendConfig with builder pattern for Metal/CUDA targets
Integration hooks for the LlmBackend trait

However, the implementation is non-functional:

No actual mistral-rs crate dependency
Token generation returns placeholder values
Model loading does not wire to inference pipeline
PagedAttention uses RuvLLM's internal implementation, not mistral-rs's optimized version

Key Challenges

Concurrent User Scaling: Candle backend is optimized for single-user inference; production servers need 10-100+ concurrent requests
KV Cache Memory Pressure: Without vLLM-style paging, long-context sessions exhaust GPU memory
Multi-Task Models: LoRA adapter switching requires per-request overhead; X-LoRA enables per-token routing
Deployment Flexibility: Models should be quantized at runtime based on available hardware

Decision Drivers

Performance Requirements

Concurrent sessions: 50-100 simultaneous inference requests
Memory efficiency: 5-10x improvement in KV cache utilization
Adapter latency: <1ms overhead for X-LoRA routing decisions
Quantization: Runtime ISQ without model re-export

Compatibility Requirements

Existing interface: Must implement LlmBackend trait seamlessly
Feature isolation: Optional dependency with feature flags
Backend selection: Runtime choice between Candle and mistral-rs

Hardware Requirements

Apple Silicon: Metal acceleration via mistral-rs-metal
NVIDIA GPUs: CUDA acceleration via mistral-rs-cuda
CPU fallback: Pure Rust path for edge/WASM targets

Considered Options

Option A: Fork and Embed mistral-rs

Vendor mistral-rs source code directly into RuvLLM.

Pros:

Full control over API surface
No external dependency versioning
Can customize for RuvLLM's needs

Cons:

Maintenance burden of tracking upstream
Miss upstream optimizations and fixes
Duplicated effort

Option B: Optional Dependency with Feature Flags

Add mistral-rs as an optional dependency behind feature flags, wiring the existing MistralBackend interface to actual mistral-rs crate.

Pros:

Leverage upstream development
Clean separation via features
Users choose their backend at compile time
Smaller binary for edge deployments (Candle-only)

Cons:

API surface depends on upstream stability
Two codepaths to maintain
Feature matrix complexity

Option C: Runtime Backend Selection

Use dynamic dispatch to select backend at runtime via configuration.

Pros:

Single binary for all deployments
Runtime flexibility

Cons:

Binary size includes all backends
Dynamic dispatch overhead
Complex testing matrix

Decision Outcome

Chosen Option: Option B - Optional Dependency with Feature Flags

Add mistral-rs as an optional dependency with three feature flags, wiring the existing MistralBackend stub to the actual mistral-rs implementation.

Rationale

Separation of concerns: Edge deployments use Candle (no mistral-rs dependency); server deployments enable mistral-rs features
Upstream leverage: mistral-rs team maintains PagedAttention, X-LoRA, ISQ implementations
Existing interface: The MistralBackend stub already defines the API; we wire it to real implementation
Incremental adoption: Users can migrate from Candle to mistral-rs backend per-deployment

Technical Specifications

Feature Flags

# Cargo.toml additions
[features]
default = ["candle-backend"]

# Base mistral-rs integration
mistral-rs = ["dep:mistralrs", "dep:mistralrs-core"]

# Apple Silicon Metal acceleration
mistral-rs-metal = ["mistral-rs", "mistralrs/metal"]

# NVIDIA CUDA acceleration
mistral-rs-cuda = ["mistral-rs", "mistralrs/cuda"]

[dependencies]
# Optional mistral-rs integration
mistralrs = { version = "0.3", optional = true }
mistralrs-core = { version = "0.3", optional = true }

Feature Matrix

Feature	Candle	mistral-rs	mistral-rs-metal	mistral-rs-cuda
Single-user inference	Yes	Yes	Yes	Yes
PagedAttention	No	Yes	Yes	Yes
X-LoRA	No	Yes	Yes	Yes
ISQ	No	Yes	Yes	Yes
Metal acceleration	Yes	No	Yes	No
CUDA acceleration	Partial	No	No	Yes
WASM support	Yes	No	No	No
Binary size	~15MB	~45MB	~50MB	~60MB

Architecture

+-----------------------------------------------------------------------+
|                    MISTRAL-RS INTEGRATION ARCHITECTURE                 |
+-----------------------------------------------------------------------+
|                                                                        |
|   +-------------------+     +-------------------+     +--------------+ |
|   | MistralBackend    |     | mistralrs::Model  |     | Hardware     | |
|   | (RuvLLM adapter)  |     | (inference core)  |     | Accelerator  | |
|   |                   |     |                   |     |              | |
|   | - Config mapping  |---->| - PagedAttention  |---->| - Metal      | |
|   | - Trait impl      |     | - X-LoRA routing  |     | - CUDA       | |
|   | - Error handling  |     | - ISQ runtime     |     | - CPU        | |
|   +--------+----------+     +---------+---------+     +------+-------+ |
|            |                          |                      |         |
|            v                          v                      v         |
|   +--------+----------+     +---------+---------+     +------+-------+ |
|   | LlmBackend trait  |     | KV Cache Pool     |     | Tensor Ops   | |
|   | (RuvLLM unified)  |     | (PagedAttention)  |     | (kernels)    | |
|   +-------------------+     +-------------------+     +--------------+ |
|                                                                        |
+-----------------------------------------------------------------------+

Key Features to Enable

1. PagedAttention (vLLM-style KV Cache Management)

PagedAttention partitions the KV cache into fixed-size blocks (pages) that can be allocated non-contiguously, enabling:

5-10x concurrent users: Memory shared across requests via copy-on-write pages
Dynamic allocation: Pages allocated as sequences grow, freed when complete
Prefix caching: Common prefixes (system prompts) share pages across requests

/// PagedAttention configuration for mistral-rs
#[cfg(feature = "mistral-rs")]
pub struct PagedAttentionConfig {
    /// Block size in tokens (typical: 16)
    pub block_size: usize,
    /// Maximum blocks in page table
    pub max_blocks: usize,
    /// GPU memory fraction for KV cache (0.0-1.0)
    pub gpu_memory_fraction: f32,
    /// Enable prefix caching for repeated prompts
    pub enable_prefix_caching: bool,
}

impl Default for PagedAttentionConfig {
    fn default() -> Self {
        Self {
            block_size: 16,
            max_blocks: 4096,
            gpu_memory_fraction: 0.9,
            enable_prefix_caching: true,
        }
    }
}

Performance Impact:

Metric	Without PagedAttention	With PagedAttention
Concurrent users	1-2	10-50
Memory utilization	40-60%	85-95%
Memory fragmentation	High	Near-zero

2. X-LoRA (eXpert-mixed LoRA)

X-LoRA enables per-token adapter routing for multi-task models:

Dynamic mixing: Router network selects adapters per token
Learned routing: MLP router trained on adapter selection
Top-k activation: Only k adapters compute per token (efficiency)

/// X-LoRA configuration for multi-adapter inference
#[cfg(feature = "mistral-rs")]
pub struct XLoraConfig {
    /// Adapter names/paths to load
    pub adapters: Vec<String>,
    /// Top-k adapters to activate per token
    pub top_k: usize,
    /// Router temperature for softmax
    pub temperature: f32,
    /// Mixing mode
    pub mixing_mode: XLoraMixingMode,
}

#[derive(Debug, Clone, Copy)]
pub enum XLoraMixingMode {
    /// Sum weighted adapter outputs
    Additive,
    /// Concatenate and project
    Concatenate,
    /// Gated mixture with learned gates
    Gated,
}

Use Cases:

Code + chat model: Route code tokens to code adapter, natural language to chat adapter
Multi-language: Route based on detected language
Domain-specific: Finance, medical, legal adapters activated by context

3. ISQ (In-Situ Quantization)

ISQ enables runtime quantization without pre-exported quantized models:

Runtime flexibility: Same model weights, different quantization per deployment
Memory adaptation: Quantize to fit available hardware
Quality preservation: Activation-aware methods (AWQ, GPTQ) maintain accuracy

/// ISQ configuration for runtime quantization
#[cfg(feature = "mistral-rs")]
pub struct IsqConfig {
    /// Quantization bits (2, 4, 8)
    pub bits: u8,
    /// Quantization method
    pub method: IsqMethod,
    /// Calibration dataset size
    pub calibration_samples: usize,
}

#[derive(Debug, Clone, Copy)]
pub enum IsqMethod {
    /// Activation-aware Weight Quantization
    AWQ,
    /// GPTQ with optimal brain quantization
    GPTQ,
    /// Round-to-nearest (fastest, lower quality)
    RTN,
    /// SmoothQuant (activation smoothing)
    SmoothQuant,
}

Performance Impact:

Method	Bits	Memory Reduction	Quality Loss
AWQ	4	4x	<1%
GPTQ	4	4x	<1%
RTN	4	4x	2-3%
AWQ	2	8x	3-5%

Implementation Roadmap

Phase 1: Core Integration (Week 1-2)

Add mistral-rs dependencies with feature flags
Implement config mapping: MistralBackendConfig -> mistralrs::Config
Wire load_model to mistral-rs model loading
Wire generate and generate_stream to mistral-rs inference

#[cfg(feature = "mistral-rs")]
impl LlmBackend for MistralBackend {
    fn load_model(&mut self, model_id: &str, config: ModelConfig) -> Result<()> {
        use mistralrs::{ModelKind, MistralRs, MistralRsBuilder};

        let builder = MistralRsBuilder::new(model_id)
            .with_paged_attention(self.config.paged_attention.as_ref().map(|pa| {
                mistralrs::PagedAttentionConfig {
                    block_size: pa.block_size,
                    ..Default::default()
                }
            }));

        self.inner = Some(builder.build()?);
        Ok(())
    }

    fn generate(&self, prompt: &str, params: GenerateParams) -> Result<String> {
        let inner = self.inner.as_ref()
            .ok_or_else(|| Error::msg("Model not loaded"))?;

        let request = mistralrs::Request::new(prompt)
            .with_max_tokens(params.max_tokens)
            .with_temperature(params.temperature);

        let response = inner.send_request(request)?;
        Ok(response.text)
    }
}

Phase 2: Advanced Features (Week 3-4)

Enable PagedAttention with configurable parameters
Add X-LoRA adapter loading and routing
Implement ISQ with calibration pipeline

Phase 3: Hardware Acceleration (Week 5-6)

Test and validate Metal acceleration
Test and validate CUDA acceleration
Benchmark against Candle backend

Consequences

Positive Consequences

Production-scale serving: PagedAttention enables 5-10x more concurrent users
Multi-task efficiency: X-LoRA eliminates adapter switching overhead
Deployment flexibility: ISQ allows runtime quantization decisions
Upstream maintenance: mistral-rs team maintains core inference optimizations
Feature parity: Access to latest mistral-rs features (Flash Attention 2, speculative decoding)

Negative Consequences

Dependency complexity: Additional crate dependencies increase build complexity
API surface coupling: Changes in mistral-rs may require RuvLLM updates
Feature matrix: Two backend codepaths require testing both paths
WASM incompatibility: mistral-rs does not support WASM targets

Neutral Consequences

Two backend options: Candle remains optimal for edge/WASM; mistral-rs for server
Compile-time selection: Users choose backend via feature flags
Binary size tradeoff: Server builds are larger; edge builds unchanged

Risk Mitigation

Risk	Mitigation
mistral-rs API instability	Pin to specific version; abstract via MistralBackend interface
Feature flag complexity	Comprehensive CI matrix testing all feature combinations
Performance regression	Benchmark suite comparing Candle vs mistral-rs
Metal/CUDA compatibility	Platform-specific CI runners for hardware validation

Alternatives Considered

llama.cpp via rust-llama

Rejected: Different model format (GGUF), weaker Rust integration
Consideration: Could add as third backend for GGUF model support

candle-transformers PagedAttention

Rejected: Candle's PagedAttention is experimental and less mature
Consideration: Monitor upstream development

vLLM Python Backend

Rejected: Python FFI adds latency; deployment complexity
Consideration: vLLM's algorithm informs our understanding

Related Decisions

ADR-001: Ruvector Core Architecture (HNSW, Graph Store)
ADR-002: RuvLLM Integration with Ruvector
ADR-003: SIMD Optimization Strategy
ADR-004: KV Cache Management
ADR-006: Memory Management
ADR-007: Security Review & Technical Debt

Compliance and Standards

API Compatibility

MistralBackend implements LlmBackend trait
All existing RuvLLM consumers work unchanged
Feature flags are additive (no breaking changes)

Testing Requirements

Unit tests for config mapping
Integration tests with sample models
Benchmark suite comparing backends
CI matrix for feature flag combinations

Documentation Requirements

Feature flag documentation in README
Backend selection guide
Performance comparison benchmarks

References

mistral-rs Repository: https://github.com/EricLBuehler/mistral.rs
vLLM PagedAttention Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention"
X-LoRA Paper: "X-LoRA: Mixture of Low-Rank Adapter Experts"
ISQ/AWQ Paper: "AWQ: Activation-aware Weight Quantization for LLM Compression"
Existing MistralBackend stub: crates/ruvllm/src/backends/mistral_backend.rs

Implementation Status

Component	Status	Notes
Feature flags	Pending	Add to Cargo.toml
Config mapping	Pending	MistralBackendConfig -> mistralrs::Config
Model loading	Pending	Wire to mistral-rs loader
Generation	Pending	Wire to mistral-rs inference
PagedAttention	Pending	Enable via config
X-LoRA	Pending	Wire existing XLoraManager
ISQ	Pending	Implement calibration pipeline
Metal acceleration	Pending	Test on Apple Silicon
CUDA acceleration	Pending	Test on NVIDIA GPUs

Revision History

Version	Date	Author	Changes
1.0	2026-01-20	Ruvector Architecture Team	Initial proposal

FilesExpand file tree

ADR-008-mistral-rs-integration.md

Latest commit

History

ADR-008-mistral-rs-integration.md

File metadata and controls

ADR-008: mistral-rs Integration for Production-Scale LLM Serving

Context and Problem Statement

Current State

Key Challenges

Decision Drivers

Performance Requirements

Compatibility Requirements

Hardware Requirements

Considered Options

Option A: Fork and Embed mistral-rs

Option B: Optional Dependency with Feature Flags

Option C: Runtime Backend Selection

Decision Outcome

Rationale

Technical Specifications

Feature Flags

Feature Matrix

Architecture

Key Features to Enable

1. PagedAttention (vLLM-style KV Cache Management)

2. X-LoRA (eXpert-mixed LoRA)

3. ISQ (In-Situ Quantization)

Implementation Roadmap

Phase 1: Core Integration (Week 1-2)

Phase 2: Advanced Features (Week 3-4)

Phase 3: Hardware Acceleration (Week 5-6)

Consequences

Positive Consequences

Negative Consequences

Neutral Consequences

Risk Mitigation

Alternatives Considered

llama.cpp via rust-llama

candle-transformers PagedAttention

vLLM Python Backend

Related Decisions

Compliance and Standards

API Compatibility

Testing Requirements

Documentation Requirements

References

Implementation Status

Revision History