|
| 1 | +# Accelerating your Linear Solves |
| 2 | + |
| 3 | +!!! note |
| 4 | + This section is essential if you wish to achieve maximum performance with |
| 5 | + LinearSolve.jl, especially on v7 and above. Please ensure the tips of this |
| 6 | + section are adhered to when optimizing code and benchmarking. |
| 7 | + |
| 8 | +Great, you've learned how to use LinearSolve.jl and you're using it daily, |
| 9 | +either directly or through other SciML libraries, and you want to improve |
| 10 | +your performance. How can this be done? While it might seem at first like a |
| 11 | +hopeless endevour, "A\b uses a BLAS library and so it's already highly optimized |
| 12 | +C code", it turns out there are many factors you need to consider to squeeze out |
| 13 | +the last 10x of performance. And yes, it can be about a factor of 10 in some |
| 14 | +scenarios, so let's dive in. |
| 15 | + |
| 16 | +## Understanding Performance of Dense Linear Solves |
| 17 | + |
| 18 | +The performance of dense linear solvers is highly dependent on the size of the matrix |
| 19 | +and the chosen architecture to run on, i.e. the CPU. |
| 20 | +[This issue](https://github.com/SciML/LinearSolve.jl/issues/357) gathered benchmark data |
| 21 | +from many different users and is summarized in the following graphs: |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +Now one thing that is immediate is for example that AppleAccelerate generally does well |
| 26 | +on Apple M-series chips, MKL generally does well on Intel, etc. And we know this in |
| 27 | +LinearSolve.jl, in fact we automatically default to different BLASes based on the CPU |
| 28 | +architecture already as part of the design! So that covers most of the variation, but |
| 29 | +there are a few major tips to note when fine tuning the results to your system: |
| 30 | + |
| 31 | +1. One of the best methods for size 150x150 matrices and below is RecursiveFactorization.jl. |
| 32 | + This is a pure Julia BLAS system, but it has a high load time overhead, and thus as of |
| 33 | + v7 it's no longer loaded by default! Thus if your matrices are in this range and you would |
| 34 | + value better run times at the cost of compile and load times, it is recommended you add |
| 35 | + `using RecursiveFactorization`. The defaulting algorithm will then consider it in its list |
| 36 | + and will automatically (in an architecture-specific way) insert it as it feels necesssary. |
| 37 | +2. One of the major factors that can inhibit BLAS performance on LU factorization is multithreading. |
| 38 | + In many of these plots you can see a giant dip in GFLOPs (higher is better) when a certain size |
| 39 | + threshold is hit. This is because, for the number of chosen threads, there was not enough work |
| 40 | + and thus when the threading threshold is hit you get a hit to the performance due to the added |
| 41 | + overhead. The threading performance can be a per-system thing, and it can be greatly influenced |
| 42 | + by the number of cores on your system and the number of threads you allow. Thus for example, |
| 43 | + OpenBLAS' LU factorization seems to generally be really bad at guessing the thread switch point |
| 44 | + for CPUs with really high core/thread counts. If this is the case, you may want to investigate |
| 45 | + decreasing your number of BLAS threads, i.e. via `BLAS.set_num_threads(i)`. Note that |
| 46 | + RecursiveFactorization.jl uses your Julia thread pool instead of the BLAS threads. |
| 47 | +3. The switch points between algorithms can be fairly inexact. LinearSolve.jl tried to keep a tab |
| 48 | + on where they are per platform and keep updated, but it can be a moving battle. You may be |
| 49 | + able to eek out some performance by testing between the various options on your platform, i.e. |
| 50 | + RFLUFactorization vs LUFactorization vs AppleAccelerateLUFactorization (M-series) vs |
| 51 | + MKLFactorization (X86) and hardcoding the choice for your problem if the default did not make |
| 52 | + the right guess. |
| 53 | + |
| 54 | +!!! warn |
| 55 | + As noted, RecursiveFactorization.jl is one of the fastest linear solvers for smaller dense |
| 56 | + matrices but requires `using RecursiveFactorization` in order to be used in the default |
| 57 | + solver setups! Thus it's recommended that any optimized code or benchmarks sets this up. |
| 58 | + |
| 59 | +## Understanding Performance of Sparse Linear Solves |
| 60 | + |
| 61 | +Sparse linear solvers are not as dependent on the CPU but highly dependent on the problem that |
| 62 | +is being solved. For example, this is for a 1D laplacian vs a 3D laplacian, changing N to make |
| 63 | +smaller and bigger versions: |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +Notice that the optimal linear solver changes based on problem (i.e. sparsity pattern) and size. |
| 68 | +LinearSolve.jl just uses a very simple "if small then use KLU and if large use UMFPACK", which |
| 69 | +is validated by this plot, but leaves a lot to be desired. In particular, the following rules |
| 70 | +should be thought about: |
| 71 | + |
| 72 | +1. Pardiso is a great solver, you should try `using Pardiso` and using `MKLPardiso()` in many |
| 73 | + scenarios. |
| 74 | +2. The more structured a sparsity pattern is, the worse KLU is in comparison to the other |
| 75 | + algorithms. |
| 76 | +3. A Krylov subspace method with proper preconditioning will be better than direct solvers |
| 77 | + when the matrices get large enough. You could always precondition a sparse matrix with |
| 78 | + iLU as an easy choice, though the tolerance would need to be tuned in a problem-specific |
| 79 | + way. |
| 80 | + |
| 81 | +!!! note |
| 82 | + UMFPACK does better when the BLAS is not OpenBLAS. Try `using MKL` on Intel and AMD Ryzen |
| 83 | + platforms and UMPACK will be faster! LinearSolve.jl cannot default to this as this changes |
| 84 | + global settings and thus only defaults to MKL locally, and thus cannot change the setting |
| 85 | + within UMFPACK. |
0 commit comments