Skip to content

Commit cb2e5aa

Browse files
Implement BLISFlameLUFactorization with fallback to reference LAPACK
Adds BLISFlameLUFactorization based on ideas from PR SciML#660, with fallback approach due to libflame/ILP64 compatibility limitations: - Created LinearSolveBLISFlameExt extension module - Uses BLIS for BLAS operations and reference LAPACK for LAPACK operations - Provides placeholder for future true libflame integration when compatible - Added to benchmark script for performance comparison - Includes comprehensive tests integrated with existing test framework Technical details: - libflame_jll uses 32-bit integers, incompatible with Julia's ILP64 BLAS - Extension uses same approach as BLISLUFactorization but with different naming - Serves as foundation for future libflame integration when packages are compatible 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent dc670a5 commit cb2e5aa

File tree

8 files changed

+808
-4
lines changed

8 files changed

+808
-4
lines changed

Project.toml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,32 +5,37 @@ version = "3.24.0"
55

66
[deps]
77
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9"
8+
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
89
ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
910
ConcreteStructs = "2569d6c7-a4a2-43d3-a901-331e8e4be471"
1011
DocStringExtensions = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
1112
EnumX = "4e289a0a-7415-4d19-859d-a7e5c4648b56"
1213
GPUArraysCore = "46192b85-c4d5-4398-a991-12ede77f4527"
1314
InteractiveUtils = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
1415
Krylov = "ba0b0d4f-ebba-5204-a429-3ac8c609bfb7"
16+
LAPACK_jll = "51474c39-65e3-53ba-86ba-03b1b862ec14"
1517
LazyArrays = "5078a376-72f3-5289-bfd5-ec5146d43c02"
1618
Libdl = "8f399da3-3557-5675-b5ff-fb832c97cbdb"
1719
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
1820
MKL_jll = "856f044c-d86e-5d09-b602-aeab76dc8ba7"
1921
Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
22+
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
2023
PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
2124
Preferences = "21216c6a-2e73-6563-6e65-726566657250"
2225
RecursiveArrayTools = "731186ca-8d62-57ce-b412-fbd966d074cd"
26+
RecursiveFactorization = "f2c3362d-daeb-58d1-803e-2bc74f2840b4"
2327
Reexport = "189a3867-3050-52da-a836-e630ba90ab69"
2428
SciMLBase = "0bca4576-84f4-4d90-8ffe-ffa030f20462"
2529
SciMLOperators = "c0aeaf25-5076-4817-a8d5-81caf7dfa961"
2630
Setfield = "efcf1570-3423-57d1-acb7-fd33fddbac46"
2731
StaticArraysCore = "1e83bf80-4336-4d27-bf5d-d5a4f845583c"
2832
UnPack = "3a884ed6-31ef-47d7-9d2a-63182c4928ed"
33+
blis_jll = "6136c539-28a5-5bf0-87cc-b183200dce32"
34+
libflame_jll = "8e9d65e3-b2b8-5a9c-baa2-617b4576f0b9"
2935

3036
[weakdeps]
3137
BandedMatrices = "aae01518-5342-5314-be14-df237901396f"
3238
BlockDiagonals = "0a1fb500-61f7-11e9-3c65-f5ef3456f9f0"
33-
blis_jll = "6136c539-28a5-5bf0-87cc-b183200dce32"
3439
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
3540
CUDSS = "45b445bb-4962-46a0-9369-b4df9d0f772e"
3641
EnzymeCore = "f151be2c-9106-41f4-ab19-57ee4f262869"
@@ -41,15 +46,15 @@ HYPRE = "b5ffcf37-a2bd-41ab-a3da-4bd9bc8ad771"
4146
IterativeSolvers = "42fd0dbc-a981-5370-80f2-aaf504508153"
4247
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
4348
KrylovKit = "0b1a1467-8014-51b9-945f-bf0ae24f4b77"
44-
LAPACK_jll = "51474c39-65e3-53ba-86ba-03b1b862ec14"
4549
Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
4650
Pardiso = "46dd5b70-b6fb-5a00-ae2d-e8fea33afaf2"
47-
RecursiveFactorization = "f2c3362d-daeb-58d1-803e-2bc74f2840b4"
51+
libflame_jll = "8e9d65e3-b2b8-5a9c-baa2-617b4576f0b9"
4852
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
4953
Sparspak = "e56a9233-b9d6-4f03-8d0f-1825330902ac"
5054

5155
[extensions]
5256
LinearSolveBLISExt = ["blis_jll", "LAPACK_jll"]
57+
LinearSolveBLISFlameExt = ["blis_jll", "libflame_jll", "LAPACK_jll"]
5358
LinearSolveBandedMatricesExt = "BandedMatrices"
5459
LinearSolveBlockDiagonalsExt = "BlockDiagonals"
5560
LinearSolveCUDAExt = "CUDA"
@@ -73,8 +78,8 @@ AllocCheck = "0.2"
7378
Aqua = "0.8"
7479
ArrayInterface = "7.7"
7580
BandedMatrices = "1.5"
81+
BenchmarkTools = "1.6.0"
7682
BlockDiagonals = "0.1.42, 0.2"
77-
blis_jll = "0.9.0"
7883
CUDA = "5"
7984
CUDSS = "0.1, 0.2, 0.3, 0.4"
8085
ChainRulesCore = "1.22"
@@ -105,6 +110,7 @@ Metal = "1"
105110
MultiFloats = "1"
106111
Pardiso = "0.5.7, 1"
107112
Pkg = "1"
113+
Plots = "1.40.17"
108114
PrecompileTools = "1.2"
109115
Preferences = "1.4"
110116
Random = "1"
@@ -123,7 +129,9 @@ StaticArraysCore = "1.4.2"
123129
Test = "1"
124130
UnPack = "1"
125131
Zygote = "0.7"
132+
blis_jll = "0.9.0"
126133
julia = "1.10"
134+
libflame_jll = "5.2.0"
127135

128136
[extras]
129137
AllocCheck = "9b6a8646-10ed-4001-bbdc-1d2f46dfbb1a"

README_benchmark.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# LinearSolve.jl BLIS Benchmark
2+
3+
This directory contains a comprehensive benchmark script for testing the performance of various LU factorization algorithms in LinearSolve.jl, including the new BLIS integration.
4+
5+
## Quick Start
6+
7+
```bash
8+
julia --project benchmark_blis.jl
9+
```
10+
11+
This will:
12+
1. Automatically detect available implementations (BLIS, MKL, Apple Accelerate, etc.)
13+
2. Run benchmarks on matrix sizes from 4×4 to 256×256
14+
3. Generate a performance plot saved as `lu_factorization_benchmark.png`
15+
4. Display results in both console output and a summary table
16+
17+
**Note**: The PNG plot file cannot be included in this gist due to GitHub's binary file restrictions, but it will be generated locally when you run the benchmark.
18+
19+
## What Gets Benchmarked
20+
21+
The script automatically detects and includes algorithms based on what's available, following LinearSolve.jl's detection patterns:
22+
23+
- **LU (OpenBLAS)**: Default BLAS-based LU factorization
24+
- **RecursiveFactorization**: High-performance pure Julia implementation
25+
- **BLIS**: New BLIS-based implementation (requires `blis_jll` and `LAPACK_jll`)
26+
- **Intel MKL**: Intel's optimized library (automatically detected on x86_64/i686, excludes EPYC CPUs by default)
27+
- **Apple Accelerate**: Apple's framework (macOS only, checks for Accelerate.framework availability)
28+
- **FastLU**: FastLapackInterface.jl implementation (if available)
29+
30+
### Detection Logic
31+
32+
The benchmark uses the same detection patterns as LinearSolve.jl:
33+
34+
- **MKL**: Enabled on x86_64/i686 architectures, disabled on AMD EPYC by default
35+
- **Apple Accelerate**: Checks for macOS and verifies Accelerate.framework can be loaded with required symbols
36+
- **BLIS**: Attempts to load blis_jll and LAPACK_jll, verifies extension loading
37+
- **FastLU**: Attempts to load FastLapackInterface.jl package
38+
39+
## Requirements
40+
41+
### Essential Dependencies
42+
```julia
43+
using Pkg
44+
Pkg.add(["BenchmarkTools", "Plots", "RecursiveFactorization"])
45+
```
46+
47+
### Optional Dependencies for Full Testing
48+
```julia
49+
# For BLIS support
50+
Pkg.add(["blis_jll", "LAPACK_jll"])
51+
52+
# For FastLU support
53+
Pkg.add("FastLapackInterface")
54+
```
55+
56+
## Sample Output
57+
58+
```
59+
============================================================
60+
LinearSolve.jl LU Factorization Benchmark with BLIS
61+
============================================================
62+
63+
System Information:
64+
Julia Version: 1.11.6
65+
OS: Linux x86_64
66+
CPU Threads: 1
67+
BLAS Threads: 1
68+
BLAS Config: LBTConfig([ILP64] libopenblas64_.so)
69+
70+
Available Implementations:
71+
BLIS: true
72+
MKL: false
73+
Apple Accelerate: false
74+
75+
Results Summary (GFLOPs):
76+
------------------------------------------------------------
77+
Size LU (OpenBLAS) RecursiveFactorization BLIS
78+
4 0.05 0.09 0.03
79+
8 0.28 0.43 0.09
80+
16 0.61 1.28 0.31
81+
32 1.67 4.17 1.09
82+
64 4.0 9.52 2.5
83+
128 9.87 16.86 8.1
84+
256 17.33 28.16 9.62
85+
```
86+
87+
## Performance Notes
88+
89+
- **RecursiveFactorization** typically performs best for smaller matrices (< 500×500)
90+
- **BLIS** provides an alternative BLAS implementation with different performance characteristics
91+
- **Apple Accelerate** and **Intel MKL** may show significant advantages on supported platforms
92+
- Single-threaded benchmarks are used for consistent comparison
93+
94+
## Customization
95+
96+
You can modify the benchmark by editing `benchmark_blis.jl`:
97+
98+
- **Matrix sizes**: Change the `sizes` parameter in `benchmark_lu_factorizations()`
99+
- **Benchmark parameters**: Adjust `BenchmarkTools` settings (samples, evaluations)
100+
- **Algorithms**: Add/remove algorithms in `build_algorithm_list()`
101+
102+
## Understanding the Results
103+
104+
- **GFLOPs**: Billions of floating-point operations per second (higher is better)
105+
- **Performance scaling**: Look for algorithms that maintain high GFLOPs as matrix size increases
106+
- **Platform differences**: Results vary significantly between systems based on hardware and BLAS libraries
107+
108+
## Integration with SciMLBenchmarks
109+
110+
This benchmark follows the same structure as the [official SciMLBenchmarks LU factorization benchmark](https://docs.sciml.ai/SciMLBenchmarksOutput/stable/LinearSolve/LUFactorization/), making it easy to compare results and contribute to the broader benchmark suite.

0 commit comments

Comments
 (0)