Document ID: ADR-STS-AGI-001 Status: Implemented (Core Infrastructure Complete) Date: 2026-02-20 Version: 2.0 Authors: RuVector Architecture Team Related ADRs: ADR-STS-001, ADR-STS-002, ADR-STS-003, ADR-STS-006, ADR-039 Scope: AGI-aligned capability integration for ultra-low-latency sublinear solvers
The sublinear-time-solver library provides O(log n) iterative solvers (Neumann series, Push-based, Hybrid Random Walk) with SIMD-accelerated SpMV kernels achieving up to 400M nonzeros/s on AVX-512. Current algorithm selection is static: the caller chooses a solver at compile time. AGI-class reasoning introduces a fundamentally different paradigm -- the system itself selects, tunes, and generates solver strategies at runtime based on learned representations of problem structure.
| Multiplier | Mechanism | Expected Gain |
|---|---|---|
| Neural algorithm routing | SONA maps problem features to optimal solver | 3-10x latency reduction for misrouted problems |
| Fused kernel generation | Problem-specific SIMD code synthesis | 2-5x throughput over generic kernels |
| Predictive preconditioning | Learned preconditioner selection | ~3x fewer iterations |
| Memory-aware scheduling | Cache-optimal tiling and prefetch | 1.5-2x bandwidth utilization |
| Coherence-driven termination | Prime Radiant scores guide early exit | 15-40% latency savings on converged problems |
Combined, these capabilities target a 0.15x end-to-end latency envelope relative to the current baseline -- moving from milliseconds to sub-hundred-microsecond solves for typical vector database workloads (n <= 100K, nnz/n ~ 10-50).
All core infrastructure components specified in this document are now implemented:
| Component | Specified In | Implemented In | LOC | Status |
|---|---|---|---|---|
| Neural algorithm routing | Section 2 | router.rs (1,702 LOC, 24 tests) |
1,702 | Complete |
| SpMV fused kernels | Section 3 | simd.rs (162), types.rs spmv_fast_f32 |
762 | Complete (AVX2/NEON/WASM) |
| Jacobi preconditioning | Section 4 | neumann.rs (715 LOC) |
715 | Complete |
| Arena memory management | Section 5 | arena.rs (176 LOC) |
176 | Complete |
| Coherence convergence checks | Section 6 | budget.rs (310), error.rs (120) |
430 | Complete |
| Cross-layer optimization | Section 7 | All 18 modules (10,729 LOC) | 10,729 | Phase 1 Complete |
| Audit/witness trail | Section 7.4 | audit.rs (316 LOC, 8 tests) |
316 | Complete |
| Input validation | Implied | validation.rs (790 LOC, 39 tests) |
790 | Complete |
| Event sourcing | Implied | events.rs (86 LOC) |
86 | Complete |
Total: 10,729 LOC across 18 modules, 241 tests, 7 algorithms fully operational.
| Target | Specified | Current | Gap |
|---|---|---|---|
| Routing accuracy | 95% | Router implemented, training pending | Training on SuiteSparse |
| SpMV throughput | 8.4 GFLOPS | Fused f32 kernels operational | Benchmark pending |
| Convergence iterations | k/3 | Jacobi preconditioning active | ILU/AMG in Phase 2 |
| Memory overhead | 1.2x | Arena allocator (176 LOC) | Profiling pending |
| End-to-end latency | 0.15x | Full pipeline implemented | Benchmark pending |
| Cache miss rate | 12% | Tiled SpMV available | perf measurement pending |
| Tolerance waste | < 5% | Dynamic budget in budget.rs |
Tuning in Phase 2 |
The solver library exposes three algorithms with distinct convergence profiles:
- NeumannSolver: O(k * nnz) per solve, converges for rho(I - D^{-1}A) < 1. Optimal for diagonally dominant systems with moderate condition number.
- Push-based: Localized computation proportional to output precision. Optimal for problems where only a few components of x matter.
- Hybrid Random Walk: Stochastic with O(1/epsilon^2) variance. Optimal for massive graphs where deterministic iteration is memory-bound.
Static selection forces the caller to understand spectral properties before calling the solver. Misrouting (e.g., using Neumann on a poorly conditioned Laplacian) wastes 3-10x wall-clock time before the spectral radius check rejects the problem.
SONA (crates/sona/) already implements adaptive routing with experience replay.
The integration pathway:
-
Feature extraction (< 50us): From the CsrMatrix, extract a fixed-size feature vector -- dimension n, nnz, average row degree, diagonal dominance ratio, estimated spectral radius (reusing
POWER_ITERATION_STEPSfromneumann.rs), sparsity profile class, and row-length variance. -
Neural routing: SONA's MLP (3x64, ReLU) maps features to a distribution over {Neumann, Push, RandomWalk, CG-fallback}. Runs in < 100us on CPU.
-
Reinforcement learning on convergence feedback: After each solve, the router receives a reward:
reward = -log(wall_time) + alpha * (1 - residual_norm / tolerance)The
ConvergenceInfostruct already captures iterations, residual_norm, and elapsed -- all required for reward computation. -
Online adaptation: SONA's ReasoningBank stores (features, choice, reward) triples. Mini-batch updates every 100 solves refine the policy.
- Routing accuracy: 70% (heuristic) to 95% (learned) on SuiteSparse benchmarks
- Misrouted latency: 3-10x reduction by eliminating wasted iterations
- Cold-start: Pre-trained on synthetic matrices covering all SparsityProfile variants
The current SpMV in types.rs is generic over T: Copy + Default + Mul + AddAssign.
The spmv_fast_f32 variant eliminates bounds checks but uses a single loop structure
regardless of sparsity pattern. Pattern-specific kernels yield significant gains.
An AGI code synthesis agent observes SparsityProfile at runtime and generates optimized SIMD kernels per pattern:
- Band matrices: Fixed stride enables contiguous SIMD loads (no gather), unrolled loops eliminate branch misprediction. Expected: 4x throughput.
- Block-diagonal: Blocks fit in L1; dense GEMV replaces sparse SpMV within blocks. Expected: 3-5x throughput.
- Random sparse: Gather-based AVX-512 with software prefetching, row reordering by degree for SIMD lane balance. Expected: 1.5-2x throughput.
Matrix --> SparsityProfile classifier (< 10us)
--> Kernel template selection (band / block / random / dense)
--> SIMD intrinsic instantiation with concrete widths
--> Cranelift JIT compilation (< 1ms)
--> Cached by (profile, dimension_class, arch) key
JIT overhead amortizes after 2-3 solves. For long-running workloads, cache hit rate approaches 100% after warmup.
Two key optimizations in the SpMV hot loop:
- Gather latency hiding: On Zen 4/5,
vpgatherddhas 14-cycle latency. Generated kernels interleave 3 independent gather chains to keep the gather unit saturated. - Accumulator pressure: With 32 ZMM registers (AVX-512), 4 independent accumulators per row group reduce horizontal reduction frequency by 4x.
| Pattern | Current (GFLOPS) | Fused (GFLOPS) | Speedup |
|---|---|---|---|
| Band | 2.1 | 8.4 | 4.0x |
| Block-diagonal | 2.1 | 7.3 | 3.5x |
| Random sparse | 2.1 | 4.2 | 2.0x |
| Dense fallback | 2.1 | 10.5 | 5.0x |
The Neumann solver uses Jacobi preconditioning (D^{-1} scaling). This is O(n)
to compute and effective for diagonally dominant systems, but suboptimal for poorly
conditioned matrices where ILU(0) or AMG would converge in far fewer iterations.
A classifier predicts the optimal preconditioner from the neural router's feature vector:
| Preconditioner | Selection Criterion | Iteration Reduction |
|---|---|---|
| Jacobi (D^{-1}) | Diagonal dominance ratio > 2.0 | Baseline |
| Block-Jacobi | Block-diagonal structure detected | 2-3x |
| ILU(0) | Moderate kappa (< 1000) | 3-5x |
| SPAI | Random sparse, kappa > 1000 | 2-4x |
| AMG | Graph Laplacian structure | 5-10x (O(n) solve) |
Pre-trained on SuiteSparse (2,800+ matrices, 50+ domains) using spectral gap estimates, nonzero distribution entropy, graph structure metrics, and domain tags. Fine-tuning requires 50-100 labeled examples. For vector database workloads, Laplacian structure provides strong inductive bias -- AMG is almost always optimal.
The solver monitors convergence rate during the first 10 iterations. If the rate falls below 50% of the predicted rate, it switches to the next-best preconditioner candidate and resets the iteration counter. Overhead: < 1% per iteration.
EWC++ (crates/ruvector-gnn/) prevents catastrophic forgetting during adaptation:
L_total = L_task + lambda/2 * sum_i F_i * (theta_i - theta_i^*)^2
The preconditioner model retains SuiteSparse knowledge while learning production matrix distributions. Fisher information F_i weights parameter importance.
An AGI scheduler predicts total memory before solve initiation:
workspace_bytes = n * vectors_per_algorithm * sizeof(f64)
+ preconditioner_memory(profile, n) + alignment_padding
If workspace exceeds available L3, the scheduler selects a more memory-efficient algorithm or activates out-of-core streaming.
For large matrices (n > L2_size / sizeof(f64)), SpMV is tiled hierarchically:
- L1 (32-64 KB): x-vector segment per row tile fits in L1. Typical: 128-256 rows.
- L2 (256 KB - 1 MB): Multiple L1 tiles grouped for temporal reuse of shared column indices (common in graph Laplacians).
- L3 (4-32 MB): Full CSR data for tile group fits in L3. Matrices with n > 1M require partitioning.
The SpMV gather pattern x[col_indices[idx]] causes irregular access. AGI-driven
prefetch analyzes col_indices offline and inserts software prefetch instructions.
For random patterns, it prefetches x-entries for the next row while processing
the current row, hiding memory latency behind computation.
For parallel solvers on multi-socket systems: rows assigned by owner-computes rule, workspace allocated on local NUMA nodes (MPOL_BIND), and cross-NUMA reductions use hierarchical summation. Expected: 1.5-2x bandwidth on 2-socket, 2-3x on 4-socket.
The Prime Radiant framework computes coherence scores measuring solution consistency across complementary subspaces:
coherence(x_k) = 1 - ||P_1 x_k - P_2 x_k|| / ||x_k||
High coherence (> 0.95) indicates convergence in all significant modes, enabling early termination even before the residual norm reaches the requested tolerance.
The sheaf Laplacian provides tighter condition number estimates (kappa_sheaf <=
kappa_standard). A 5-step Lanczos iteration yields lambda_min/lambda_max estimates
in O(nnz), piggybacking on existing power iteration infrastructure. This enables
iteration count prediction: k_predicted = sqrt(kappa_sheaf) * log(1/epsilon).
In vector database workloads, ranking depends on relative ordering, not absolute accuracy. The system queries downstream accuracy requirements and computes:
epsilon_solver = delta_ranking / (kappa * ||A^{-1}||)
For top-10 retrieval (n=100K), this saves 15-40% of iterations.
The SOTA analysis (ADR-STS-SOTA) establishes epsilon_total <= sum(epsilon_i) for additive pipelines. AGI reasoning allocates the error budget optimally across solver, quantization, and approximation layers. If epsilon_total = 0.01 and epsilon_quantization = 0.003, the solver only needs epsilon_s = 0.007 -- potentially halving the iteration count.
- SVE2: Variable-length vectors (128-2048 bit). AGI kernel generator produces
SVE2 intrinsics adapting to hardware vector length via
svcntw(). - CXL memory: Pooled memory across hosts. Scheduler places large matrices in CXL memory, using prefetch to hide ~150ns latency (vs ~80ns local DDR5).
- AMX: Intel tile multiply for dense sub-blocks within sparse matrices provides 8x throughput over AVX-512.
pub struct AdaptiveSolver {
router: SonaRouter, // Neural algorithm selector
neumann: NeumannSolver, // Diagonal-dominant specialist
push: PushSolver, // Localized solve specialist
random_walk: RandomWalkSolver,// Memory-bound specialist
cg: ConjugateGradient, // General SPD fallback
kernel_cache: KernelCache, // JIT-compiled SpMV kernels
precond_model: PrecondModel, // Learned preconditioner selector
}Router, kernel cache, and preconditioner model cooperate to minimize end-to-end solve time for each problem instance.
Pipeline: Query -> Embedding -> HNSW Search -> Graph Construction -> Solver -> Ranking
- Solver-HNSW fusion: Operate on HNSW edges directly, skip graph construction.
- Speculative solving: Begin with approximate graph while HNSW refines;
warm-start from streaming checkpoints (
fast_solver.rs). - Batch amortization: Share preconditioner across multiple concurrent solves.
Every AGI-influenced decision is recorded in an RVF witness chain (SHAKE-256,
crates/rvf/rvf-crypto/) capturing input hash, algorithm choice, router
confidence, preconditioner, iterations, residual, and wall time. This enables
deterministic replay, regression detection, and correctness verification.
| Capability | Current | Target | Method | Validation |
|---|---|---|---|---|
| Routing accuracy | 70% | 95% | SONA neural router | SuiteSparse benchmarks |
| SpMV throughput (GFLOPS) | 2.1 | 8.4 | Fused kernels | Band/block/random sweep |
| Convergence iterations | k | k/3 | Predictive preconditioning | Condition-stratified test |
| Memory overhead | 2.5x | 1.2x | Memory-aware scheduling | Peak RSS measurement |
| End-to-end latency | 1.0x | 0.15x | Cross-layer fusion | Full pipeline benchmark |
| L2 cache miss rate | 35% | 12% | Tiling + prefetch | perf stat counters |
| NUMA scaling | 60% | 85% | Owner-computes | 2/4-socket tests |
| Tolerance waste | 40% | < 5% | Dynamic adjustment | Ranking accuracy vs. time |
| Stage | Current (us) | Target (us) | Reduction |
|---|---|---|---|
| Feature extraction | 0 | 45 | N/A (new) |
| Router inference | 0 | 8 | N/A (new) |
| Kernel lookup/JIT | 0 | 2 (cached) | N/A (new) |
| Preconditioner setup | 50 | 30 | 0.6x |
| SpMV iterations | 800 | 120 | 0.15x |
| Convergence check | 20 | 5 | 0.25x |
| Total | 870 | 210 | 0.24x |
The 55us AGI overhead is recouped within the first 2 iterations of the improved solver.
Extract feature vectors from SuiteSparse (2,800+ matrices), compute ground-truth
optimal algorithm per matrix, train SONA MLP (input(7)->64->64->64->output(4),
Adam lr=1e-3), integrate into AdaptiveSolver with convergence feedback RL, and
validate 95% accuracy at < 100us latency.
Deps: crates/sona/, ConvergenceInfo.
Realized: ruvector-solver crate with router.rs (1,702 LOC), neumann.rs (715), cg.rs (1,112), forward_push.rs (828), backward_push.rs (714), random_walk.rs (838), true_solver.rs (908), bmssp.rs (1,151). All algorithms operational with 241 tests passing.
Implement SparsityProfile classifier extending the existing enum in types.rs.
Write kernel templates per pattern and ISA (AVX-512, AVX2, NEON, WASM SIMD128).
Integrate Cranelift JIT with kernel cache keyed by (profile, arch). Benchmark
against generic SpMV on SuiteSparse.
Deps: cranelift-jit, ruvector-core SIMD intrinsics.
Implement ILU(0), Block-Jacobi, and SPAI behind a Preconditioner trait. Train
preconditioner classifier on SuiteSparse with total-solve-time labels. Integrate
EWC++ from crates/ruvector-gnn/ for continual learning. Deploy online refinement
with convergence-rate monitoring.
Deps: crates/ruvector-gnn/ EWC++.
Solver-HNSW fusion and speculative solving with warm-start. RVF witness chain
deployment (SHAKE-256). SVE2/CXL/AMX hardware integration. Full pipeline
benchmark and regression testing against witness baselines.
Deps: All prior phases, crates/rvf/rvf-crypto/.
Risk: AGI overhead (~55us) exceeds savings for small problems. Mitigation: Bypass router for n < 5000; use lookup tables for common profiles; amortize in batch mode. Residual: Low for target range (n = 10K-1M).
Risk: Router trained on SuiteSparse misroutes novel matrix families. Mitigation: Confidence threshold (p < 0.6 -> CG fallback); online RL adapts to production distribution; EWC++ prevents forgetting. Residual: Medium -- novel structures need 50-100 solves to adapt.
Risk: JIT kernels are opaque to developers. Mitigation: Template-based generation (not arbitrary code); RVF witness chain records kernel version; versioned cache enables rollback; embedded generation comments for inspection. Residual: Low.
Risk: Mid-iteration switches cause non-monotone residual decay.
Mitigation: Switches reset iteration counter and baseline; existing
INSTABILITY_GROWTH_FACTOR detection applies post-switch; witness chain records
switch points. Residual: Low.
Risk: Kernels tuned for one microarchitecture underperform on another. Mitigation: Cache keyed by arch; auto-tuning on first run; WASM SIMD128 portable fallback; SVE2 vector-length-agnostic model. Residual: Low.
-
Spielman, D.A., Teng, S.-H. (2014). Nearly Linear Time Algorithms for Preconditioning and Solving SDD Linear Systems. SIAM J. Matrix Anal. Appl.
-
Koutis, I., Miller, G.L., Peng, R. (2011). A Nearly-m*log(n) Time Solver for SDD Linear Systems. FOCS 2011.
-
Martinsson, P.G., Tropp, J.A. (2020). Randomized Numerical Linear Algebra: Foundations and Algorithms. Acta Numerica, 29, 403-572.
-
Chen, L. et al. (2022). Maximum Flow and Minimum-Cost Flow in Almost-Linear Time. FOCS 2022. arXiv:2203.00671.
-
Kirkpatrick, J. et al. (2017). Overcoming Catastrophic Forgetting in Neural Networks. PNAS, 114(13), 3521-3526.
-
RuVector ADR-STS-SOTA-research-analysis.md (2026).
-
RuVector ADR-STS-optimization-guide.md (2026).