Skip to content

Commit 3bff586

Browse files
Add a page on choosing the right solver and architectural details
This also documents the performance changes users will see with the new breaking.
1 parent 88dc1f6 commit 3bff586

File tree

4 files changed

+88
-2
lines changed

4 files changed

+88
-2
lines changed

docs/pages.jl

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
# Put in a separate page so it can be used by SciMLDocs.jl
22

33
pages = ["index.md",
4-
"Tutorials" => Any["tutorials/linear.md"
5-
"tutorials/caching_interface.md"],
4+
"Tutorials" => Any["tutorials/linear.md",
5+
"tutorials/caching_interface.md",
6+
"tutorials/accelerating_choices.md"],
67
"Basics" => Any["basics/LinearProblem.md",
78
"basics/common_solver_opts.md",
89
"basics/OperatorAssumptions.md",
821 KB
Loading
617 KB
Loading
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Accelerating your Linear Solves
2+
3+
!!! note
4+
This section is essential if you wish to achieve maximum performance with
5+
LinearSolve.jl, especially on v7 and above. Please ensure the tips of this
6+
section are adhered to when optimizing code and benchmarking.
7+
8+
Great, you've learned how to use LinearSolve.jl and you're using it daily,
9+
either directly or through other SciML libraries, and you want to improve
10+
your performance. How can this be done? While it might seem at first like a
11+
hopeless endevour, "A\b uses a BLAS library and so it's already highly optimized
12+
C code", it turns out there are many factors you need to consider to squeeze out
13+
the last 10x of performance. And yes, it can be about a factor of 10 in some
14+
scenarios, so let's dive in.
15+
16+
## Understanding Performance of Dense Linear Solves
17+
18+
The performance of dense linear solvers is highly dependent on the size of the matrix
19+
and the chosen architecture to run on, i.e. the CPU.
20+
[This issue](https://github.com/SciML/LinearSolve.jl/issues/357) gathered benchmark data
21+
from many different users and is summarized in the following graphs:
22+
23+
![Dense Linear Solve Benchmarks](../assets/dense_linear_solves.png)
24+
25+
Now one thing that is immediate is for example that AppleAccelerate generally does well
26+
on Apple M-series chips, MKL generally does well on Intel, etc. And we know this in
27+
LinearSolve.jl, in fact we automatically default to different BLASes based on the CPU
28+
architecture already as part of the design! So that covers most of the variation, but
29+
there are a few major tips to note when fine tuning the results to your system:
30+
31+
1. One of the best methods for size 150x150 matrices and below is RecursiveFactorization.jl.
32+
This is a pure Julia BLAS system, but it has a high load time overhead, and thus as of
33+
v7 it's no longer loaded by default! Thus if your matrices are in this range and you would
34+
value better run times at the cost of compile and load times, it is recommended you add
35+
`using RecursiveFactorization`. The defaulting algorithm will then consider it in its list
36+
and will automatically (in an architecture-specific way) insert it as it feels necesssary.
37+
2. One of the major factors that can inhibit BLAS performance on LU factorization is multithreading.
38+
In many of these plots you can see a giant dip in GFLOPs (higher is better) when a certain size
39+
threshold is hit. This is because, for the number of chosen threads, there was not enough work
40+
and thus when the threading threshold is hit you get a hit to the performance due to the added
41+
overhead. The threading performance can be a per-system thing, and it can be greatly influenced
42+
by the number of cores on your system and the number of threads you allow. Thus for example,
43+
OpenBLAS' LU factorization seems to generally be really bad at guessing the thread switch point
44+
for CPUs with really high core/thread counts. If this is the case, you may want to investigate
45+
decreasing your number of BLAS threads, i.e. via `BLAS.set_num_threads(i)`. Note that
46+
RecursiveFactorization.jl uses your Julia thread pool instead of the BLAS threads.
47+
3. The switch points between algorithms can be fairly inexact. LinearSolve.jl tried to keep a tab
48+
on where they are per platform and keep updated, but it can be a moving battle. You may be
49+
able to eek out some performance by testing between the various options on your platform, i.e.
50+
RFLUFactorization vs LUFactorization vs AppleAccelerateLUFactorization (M-series) vs
51+
MKLFactorization (X86) and hardcoding the choice for your problem if the default did not make
52+
the right guess.
53+
54+
!!! warn
55+
As noted, RecursiveFactorization.jl is one of the fastest linear solvers for smaller dense
56+
matrices but requires `using RecursiveFactorization` in order to be used in the default
57+
solver setups! Thus it's recommended that any optimized code or benchmarks sets this up.
58+
59+
## Understanding Performance of Sparse Linear Solves
60+
61+
Sparse linear solvers are not as dependent on the CPU but highly dependent on the problem that
62+
is being solved. For example, this is for a 1D laplacian vs a 3D laplacian, changing N to make
63+
smaller and bigger versions:
64+
65+
![Sparse Linear Solve Benchmarks](../assets/dense_linear_solves.png)
66+
67+
Notice that the optimal linear solver changes based on problem (i.e. sparsity pattern) and size.
68+
LinearSolve.jl just uses a very simple "if small then use KLU and if large use UMFPACK", which
69+
is validated by this plot, but leaves a lot to be desired. In particular, the following rules
70+
should be thought about:
71+
72+
1. Pardiso is a great solver, you should try `using Pardiso` and using `MKLPardiso()` in many
73+
scenarios.
74+
2. The more structured a sparsity pattern is, the worse KLU is in comparison to the other
75+
algorithms.
76+
3. A Krylov subspace method with proper preconditioning will be better than direct solvers
77+
when the matrices get large enough. You could always precondition a sparse matrix with
78+
iLU as an easy choice, though the tolerance would need to be tuned in a problem-specific
79+
way.
80+
81+
!!! note
82+
UMFPACK does better when the BLAS is not OpenBLAS. Try `using MKL` on Intel and AMD Ryzen
83+
platforms and UMPACK will be faster! LinearSolve.jl cannot default to this as this changes
84+
global settings and thus only defaults to MKL locally, and thus cannot change the setting
85+
within UMFPACK.

0 commit comments

Comments
 (0)