Skip to content

Commit 34b06a2

Browse files
committed
Add tests and documentation for mixed precision methods
- Added mixed precision tests to the Core test group in runtests.jl - Added documentation for all four mixed precision methods in docs - Added section explaining when to use mixed precision methods - Documentation includes performance characteristics and use cases The tests now run as part of the standard test suite, and the documentation provides clear guidance on when these methods are beneficial (large well-conditioned problems with memory bandwidth bottlenecks). 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent 0e70f68 commit 34b06a2

File tree

2 files changed

+23
-0
lines changed

2 files changed

+23
-0
lines changed

docs/src/solvers/solvers.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,24 @@ this is only recommended for Float32 matrices. Choose `CudaOffloadLUFactorizatio
3232
performance on well-conditioned problems, or `CudaOffloadQRFactorization` for better numerical
3333
stability on ill-conditioned problems.
3434

35+
#### Mixed Precision Methods
36+
37+
For large well-conditioned problems where memory bandwidth is the bottleneck, mixed precision
38+
methods can provide significant speedups (up to 2x) by performing the factorization in Float32
39+
while maintaining Float64 interfaces. These methods are particularly effective for:
40+
- Large dense matrices (> 1000x1000)
41+
- Well-conditioned problems (condition number < 10^4)
42+
- Hardware with good Float32 performance
43+
44+
Available mixed precision solvers:
45+
- `MKL32MixedLUFactorization` - Intel CPUs with MKL
46+
- `AppleAccelerate32MixedLUFactorization` - Apple CPUs with Accelerate
47+
- `CUDAOffload32MixedLUFactorization` - NVIDIA GPUs with CUDA
48+
- `MetalOffload32MixedLUFactorization` - Apple GPUs with Metal
49+
50+
These methods automatically handle the precision conversion, making them easy drop-in replacements
51+
when reduced precision is acceptable for the factorization step.
52+
3553
!!! note
3654

3755
Performance details for dense LU-factorizations can be highly dependent on the hardware configuration.
@@ -205,6 +223,7 @@ KrylovJL
205223

206224
```@docs
207225
MKLLUFactorization
226+
MKL32MixedLUFactorization
208227
```
209228

210229
### AppleAccelerate.jl
@@ -215,6 +234,7 @@ MKLLUFactorization
215234

216235
```@docs
217236
AppleAccelerateLUFactorization
237+
AppleAccelerate32MixedLUFactorization
218238
```
219239

220240
### Metal.jl
@@ -225,6 +245,7 @@ AppleAccelerateLUFactorization
225245

226246
```@docs
227247
MetalLUFactorization
248+
MetalOffload32MixedLUFactorization
228249
```
229250

230251
### Pardiso.jl
@@ -251,6 +272,7 @@ The following are non-standard GPU factorization routines.
251272
```@docs
252273
CudaOffloadLUFactorization
253274
CudaOffloadQRFactorization
275+
CUDAOffload32MixedLUFactorization
254276
```
255277

256278
### AMDGPU.jl

test/runtests.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ if GROUP == "All" || GROUP == "Core"
1919
@time @safetestset "Traits" include("traits.jl")
2020
@time @safetestset "Verbosity" include("verbosity.jl")
2121
@time @safetestset "BandedMatrices" include("banded.jl")
22+
@time @safetestset "Mixed Precision" include("test_mixed_precision.jl")
2223
end
2324

2425
# Don't run Enzyme tests on prerelease

0 commit comments

Comments
 (0)