Sprux

Sprux is a high-performance sparse direct solver with GPU acceleration.

Formerly BaSpaCho (Batched Sparse Cholesky).

Features

Cholesky (SPD), LU with partial pivoting (general), LDL^T (symmetric indefinite)
GPU backends: CUDA (NVIDIA), Metal (Apple Silicon), OpenCL (experimental)
CPU backends: OpenBLAS, Intel MKL, Apple Accelerate
Supernodal sparse elimination with level-set parallelism
Preprocessing pipeline: BTF max transversal, equilibration, static pivoting
External encoder API for GPU pipeline embedding (IREE, XLA custom-calls)
Mixed-precision iterative refinement (float GPU factor + double CPU accumulation)
Block-structured matrices with partial factor/solve for marginal computation
Python bindings via pybind11

Quick Start

# Configure (CPU with OpenBLAS, no GPU)
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DSPRUX_USE_CUBLAS=0

# Build
cmake --build build -j16

# Test
ctest --test-dir build

For Metal (Apple Silicon):

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
  -DSPRUX_USE_CUBLAS=0 -DSPRUX_USE_METAL=1 -DBLA_VENDOR=Apple
cmake --build build -j16

For CUDA (NVIDIA):

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
cmake --build build -j16

Backend Selection

Backend	Flag	Precision	GPU	Best For
CPU BLAS	`-DSPRUX_USE_BLAS=1` (default)	float/double	No	General use, double precision
CUDA	`-DSPRUX_USE_CUBLAS=1` (default)	float/double	NVIDIA	Large problems, double precision on GPU
Metal	`-DSPRUX_USE_METAL=1`	float only	Apple Silicon	macOS, mixed-precision with refinement
OpenCL	`-DSPRUX_USE_OPENCL=1`	float/double	Any	Experimental portable GPU

Runtime selection:

Settings settings;
settings.backend = BackendAuto;  // Auto-detect: CUDA > Metal > OpenCL > CPU
auto solver = createSolver(settings, paramSize, structure);

Build Options

CMake Option	Default	Description
`SPRUX_USE_CUBLAS`	ON	Enable CUDA support
`SPRUX_USE_METAL`	OFF	Enable Metal support (macOS only)
`SPRUX_USE_OPENCL`	OFF	Enable OpenCL + CLBlast
`SPRUX_USE_BLAS`	ON	Enable CPU BLAS
`SPRUX_CUDA_ARCHS`	"detect"	CUDA architectures ("detect", "torch", or "60;70;75")
`SPRUX_USE_SUITESPARSE_AMD`	OFF	Use SuiteSparse AMD instead of Eigen
`SPRUX_BUILD_TESTS`	ON	Build unit tests
`SPRUX_BUILD_EXAMPLES`	ON	Build examples and benchmarks
`BLA_VENDOR`	(auto)	BLAS: ATLAS, OpenBLAS, Intel10_64lp_seq, Apple

Usage

Cholesky (SPD)

#include "sprux/sprux/Solver.h"
using namespace Sprux;

Settings settings;
settings.backend = BackendFast;
auto solver = createSolver(settings, paramSize, sparseStructure);

solver->factor(data.data());
solver->solve(data.data(), rhs.data(), n, 1);

LU (General Matrices)

Settings settings;
settings.backend = BackendMetal;
settings.matrixType = MTYPE_GENERAL;
settings.staticPivotThreshold = 0.0;  // auto
auto solver = createSolver(settings, paramSize, sparseStructure);

std::vector<int64_t> pivots(solver->numSpans());
solver->factorLU(data.data(), pivots.data());
solver->solveLU(data.data(), pivots.data(), rhs.data(), n, 1);

LDL^T (Symmetric Indefinite)

solver->factorLDLT(data.data());
solver->solveLDLT(data.data(), rhs.data(), n, 1);

See docs/api-guide.md for detailed usage including Metal embedding, persistent contexts, preprocessing pipeline, and Python bindings.

Benchmarks

# Cholesky — compare with CHOLMOD baseline
build/sprux/benchmarking/bench -B 1_CHOLMOD

# Bundle Adjustment in the Large
build/sprux/benchmarking/BAL_bench -i ~/BAL/problem-871-527480-pre.txt

# LU — circuit Jacobians with Metal/CUDA/CPU backends
build/sprux/benchmarking/lu_bench -d test_data/c6288_sequence -b Metal_Sparse

See docs/benchmarks.md for benchmark tools, test data, and CI setup.

Architecture

The solver pipeline: symbolic analysis (AMD ordering, supernode merging, level-set scheduling) → numeric factorization (GPU sparse elimination + CPU/GPU dense BLAS) → solve (forward/backward substitution).

Key design decisions:

Hybrid GPU/CPU execution: sparse elimination runs on GPU; dense operations use CPU BLAS for small blocks (exploiting Apple Silicon unified memory or cheap D↔H copies on CUDA)
Level-set parallelism: independent eliminations are batched into single GPU dispatches
External encoder API: factor and solve operations encode into a caller-provided Metal command encoder for zero-overhead pipeline fusion

See docs/architecture.md for full details on data structures, backend design, and memory management.

Python Bindings

import sprux
solver = sprux.create_solver(param_sizes, row_ptrs, col_inds,
                                matrix_type="general", backend="metal")
solver.factor_lu(data, pivots)
solver.solve_lu(data, pivots, rhs)

Build with: cmake -DSPRUX_BUILD_PYTHON=ON (requires pybind11).

Examples

Optimizer.h — Levenberg-Marquardt optimizer with direct and mixed direct/iterative solvers
- OptimizeSimple.cpp — spring-connected points
- OptimizeBaAtLarge.cpp — bundle adjustment from BAL
- OptimizeCompModel.cpp — fit BLAS computation model to hardware timings
PCG_Sample.cpp — partial elimination + preconditioned conjugate gradient

Dependencies

Fetched automatically by CMake:

Eigen 3.4.0, GoogleTest, dispenso (multithreading), SuiteSparse BTF
Sophus (BA examples only)

Optional:

CUDA Toolkit 10.2+ (arch ≥ 60 for double atomics)
CHOLMOD (SuiteSparse) — benchmarking baseline
OpenCL 1.2+ & CLBlast — OpenCL backend
pybind11 — Python bindings

Caveats

Block structure: the library works with block-structured matrices. Purely scalar matrices (all 1×1 blocks) work but won't benefit from supernodal BLAS. Best performance with parameter blocks of size 1–12.
CUDA determinism: sparse elimination uses atomicAdd on GPU, making CUDA results non-deterministic by default. Use the two-phase deterministic elimination option if needed. CUDA architecture ≥ 6.0 required for double-precision atomicAdd.
Metal precision: float only. Use BackendFast or BackendCuda for double precision. Mixed-precision iterative refinement recovers double-precision accuracy on Metal.
Ordering: only AMD (Approximate Minimum Degree) reordering is supported.

License

MIT — see LICENSE.

Original BaSpaCho: Copyright (c) Meta Platforms, Inc. and affiliates. Sprux extensions: Copyright (c) Robert Taylor, 2026.

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
.circleci		.circleci
.claude		.claude
.github		.github
cmake		cmake
docs		docs
python		python
scripts		scripts
sprux		sprux
test_data		test_data
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
BENCHMARK_RESULTS.md		BENCHMARK_RESULTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.gpu-base		Dockerfile.gpu-base
LICENSE		LICENSE
README.md		README.md
pixi.toml		pixi.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sprux

Features

Quick Start

Backend Selection

Build Options

Usage

Cholesky (SPD)

LU (General Matrices)

LDL^T (Symmetric Indefinite)

Benchmarks

Architecture

Python Bindings

Examples

Dependencies

Caveats

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sprux

Features

Quick Start

Backend Selection

Build Options

Usage

Cholesky (SPD)

LU (General Matrices)

LDL^T (Symmetric Indefinite)

Benchmarks

Architecture

Python Bindings

Examples

Dependencies

Caveats

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages