Sprux is a high-performance sparse direct solver with GPU acceleration.
Formerly BaSpaCho (Batched Sparse Cholesky).
- Cholesky (SPD), LU with partial pivoting (general), LDL^T (symmetric indefinite)
- GPU backends: CUDA (NVIDIA), Metal (Apple Silicon), OpenCL (experimental)
- CPU backends: OpenBLAS, Intel MKL, Apple Accelerate
- Supernodal sparse elimination with level-set parallelism
- Preprocessing pipeline: BTF max transversal, equilibration, static pivoting
- External encoder API for GPU pipeline embedding (IREE, XLA custom-calls)
- Mixed-precision iterative refinement (float GPU factor + double CPU accumulation)
- Block-structured matrices with partial factor/solve for marginal computation
- Python bindings via pybind11
# Configure (CPU with OpenBLAS, no GPU)
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DSPRUX_USE_CUBLAS=0
# Build
cmake --build build -j16
# Test
ctest --test-dir buildFor Metal (Apple Silicon):
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
-DSPRUX_USE_CUBLAS=0 -DSPRUX_USE_METAL=1 -DBLA_VENDOR=Apple
cmake --build build -j16For CUDA (NVIDIA):
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
cmake --build build -j16| Backend | Flag | Precision | GPU | Best For |
|---|---|---|---|---|
| CPU BLAS | -DSPRUX_USE_BLAS=1 (default) |
float/double | No | General use, double precision |
| CUDA | -DSPRUX_USE_CUBLAS=1 (default) |
float/double | NVIDIA | Large problems, double precision on GPU |
| Metal | -DSPRUX_USE_METAL=1 |
float only | Apple Silicon | macOS, mixed-precision with refinement |
| OpenCL | -DSPRUX_USE_OPENCL=1 |
float/double | Any | Experimental portable GPU |
Runtime selection:
Settings settings;
settings.backend = BackendAuto; // Auto-detect: CUDA > Metal > OpenCL > CPU
auto solver = createSolver(settings, paramSize, structure);| CMake Option | Default | Description |
|---|---|---|
SPRUX_USE_CUBLAS |
ON | Enable CUDA support |
SPRUX_USE_METAL |
OFF | Enable Metal support (macOS only) |
SPRUX_USE_OPENCL |
OFF | Enable OpenCL + CLBlast |
SPRUX_USE_BLAS |
ON | Enable CPU BLAS |
SPRUX_CUDA_ARCHS |
"detect" | CUDA architectures ("detect", "torch", or "60;70;75") |
SPRUX_USE_SUITESPARSE_AMD |
OFF | Use SuiteSparse AMD instead of Eigen |
SPRUX_BUILD_TESTS |
ON | Build unit tests |
SPRUX_BUILD_EXAMPLES |
ON | Build examples and benchmarks |
BLA_VENDOR |
(auto) | BLAS: ATLAS, OpenBLAS, Intel10_64lp_seq, Apple |
#include "sprux/sprux/Solver.h"
using namespace Sprux;
Settings settings;
settings.backend = BackendFast;
auto solver = createSolver(settings, paramSize, sparseStructure);
solver->factor(data.data());
solver->solve(data.data(), rhs.data(), n, 1);Settings settings;
settings.backend = BackendMetal;
settings.matrixType = MTYPE_GENERAL;
settings.staticPivotThreshold = 0.0; // auto
auto solver = createSolver(settings, paramSize, sparseStructure);
std::vector<int64_t> pivots(solver->numSpans());
solver->factorLU(data.data(), pivots.data());
solver->solveLU(data.data(), pivots.data(), rhs.data(), n, 1);solver->factorLDLT(data.data());
solver->solveLDLT(data.data(), rhs.data(), n, 1);See docs/api-guide.md for detailed usage including Metal embedding, persistent contexts, preprocessing pipeline, and Python bindings.
# Cholesky — compare with CHOLMOD baseline
build/sprux/benchmarking/bench -B 1_CHOLMOD
# Bundle Adjustment in the Large
build/sprux/benchmarking/BAL_bench -i ~/BAL/problem-871-527480-pre.txt
# LU — circuit Jacobians with Metal/CUDA/CPU backends
build/sprux/benchmarking/lu_bench -d test_data/c6288_sequence -b Metal_SparseSee docs/benchmarks.md for benchmark tools, test data, and CI setup.
The solver pipeline: symbolic analysis (AMD ordering, supernode merging, level-set scheduling) → numeric factorization (GPU sparse elimination + CPU/GPU dense BLAS) → solve (forward/backward substitution).
Key design decisions:
- Hybrid GPU/CPU execution: sparse elimination runs on GPU; dense operations use CPU BLAS for small blocks (exploiting Apple Silicon unified memory or cheap D↔H copies on CUDA)
- Level-set parallelism: independent eliminations are batched into single GPU dispatches
- External encoder API: factor and solve operations encode into a caller-provided Metal command encoder for zero-overhead pipeline fusion
See docs/architecture.md for full details on data structures, backend design, and memory management.
import sprux
solver = sprux.create_solver(param_sizes, row_ptrs, col_inds,
matrix_type="general", backend="metal")
solver.factor_lu(data, pivots)
solver.solve_lu(data, pivots, rhs)Build with: cmake -DSPRUX_BUILD_PYTHON=ON (requires pybind11).
Optimizer.h— Levenberg-Marquardt optimizer with direct and mixed direct/iterative solversOptimizeSimple.cpp— spring-connected pointsOptimizeBaAtLarge.cpp— bundle adjustment from BALOptimizeCompModel.cpp— fit BLAS computation model to hardware timings
PCG_Sample.cpp— partial elimination + preconditioned conjugate gradient
Fetched automatically by CMake:
- Eigen 3.4.0, GoogleTest, dispenso (multithreading), SuiteSparse BTF
- Sophus (BA examples only)
Optional:
- CUDA Toolkit 10.2+ (arch ≥ 60 for double atomics)
- CHOLMOD (SuiteSparse) — benchmarking baseline
- OpenCL 1.2+ & CLBlast — OpenCL backend
- pybind11 — Python bindings
- Block structure: the library works with block-structured matrices. Purely scalar matrices (all 1×1 blocks) work but won't benefit from supernodal BLAS. Best performance with parameter blocks of size 1–12.
- CUDA determinism: sparse elimination uses
atomicAddon GPU, making CUDA results non-deterministic by default. Use the two-phase deterministic elimination option if needed. CUDA architecture ≥ 6.0 required for double-precisionatomicAdd. - Metal precision: float only. Use
BackendFastorBackendCudafor double precision. Mixed-precision iterative refinement recovers double-precision accuracy on Metal. - Ordering: only AMD (Approximate Minimum Degree) reordering is supported.
MIT — see LICENSE.
Original BaSpaCho: Copyright (c) Meta Platforms, Inc. and affiliates. Sprux extensions: Copyright (c) Robert Taylor, 2026.