You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our `AmulB!` for `MMatrix`es was the fastest at all sizes except `2`x`2`, where it lost out to `AmulB` for `SMatrix`, which in turn was faster than the hundreds of lines of
76
76
`StaticArray`s code at all sizes except `3`x`3`, `5`x`5`, and `6`x`6`.
Copy file name to clipboardExpand all lines: docs/src/examples/dot_product.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ However, there is another bottle neck: we can only perform 2 aligned loads per c
24
24
Thus, in 4 clock cycles, we can do up to 8 loads. But each `fma` requires 2 loads, meaning we are limited to 4 of them per 4 clock cyles, and any unrolling beyond 4 gives us no benefit.
25
25
26
26
Double precision benchmarks pitting Julia's builtin dot product, and code compiled with a variety of compilers:
What we just described is the core of the approach used by all these compilers. The variation in results is explained mostly by how they handle vectors with lengths that are not an integer multiple of `W`. I ran these on a computer with AVX512 so that `W = 8`. LLVM, the backend compiler of both Julia and Clang, shows rapid performance degredation as `N % 4W` increases, where `N` is the length of the vectors.
29
29
This is because, to handle the remainder, it uses a scalar loop that runs as written: multiply and add single elements, one after the other.
30
30
@@ -52,7 +52,7 @@ end
52
52
```
53
53
Because we only need a single load per `fma`-instruction, we can now benefit from having 8 separate accumulators.
54
54
For this reason, LoopVectorization now unrolls by 8 -- it decides how much to unroll by comparing the bottlenecks on throughput with latency. The other compilers do not change their behavior, so now LoopVectorization has the advantage:
This algorithm may need refinement, because Julia (without LoopVectorization) only unrolls by 4, yet achieves roughly the same performance as LoopVectorization at multiples of `4W = 32`, although performance declines rapidly from there due to the slow scalar loop. Performance for most is much higher -- more GFLOPS -- than the normal dot product, but still under half of the CPU's potential 131.2 GFLOPS, suggesting that some other bottlenecks are preventing the core from attaining 2 fmas per clock cycle.
57
57
Note also that `8W = 64`, so we don't really have enough iterations of the loop to amortize the overhead of performing the reductions of all these vectors into a single scalar.
58
58
By the time the vectors are long enough to do this, we'll start running into memory bandwidth bottlenecks.
Copy file name to clipboardExpand all lines: docs/src/examples/filtering.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,16 +18,16 @@ function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
18
18
end
19
19
```
20
20
These are effectively four nested loops. For all the benchmarks, `kern` was 3 by 3, making it too small for vectorizing these loops to be particularly profitable. By vectorizing an outer loop instead, it can benefit from SIMD and also avoid having to do a reduction (horizontal addition) of a vector before storing in `out`, as the vectors can then be stored directly.
LoopVectorization achieved much better performance than all the alternatives, which tried vectorizing the inner loops.
24
24
By making the compilers aware that the inner loops are too short to be worth vectorizing, we can get them to vectorize an outer loop instead. By defining the size of `kern` as constant in `C` and `Fortran`, and using size parameters in Julia, we can inform the compilers:
Now all are doing much better than they were before, although still well shy of the 131.2 GFLOPS theoretical limit for the host CPU cores. While they all improved, two are lagging behind the main group:
27
27
-`ifort` lags behind all the others except base Julia. I'll need to do more investigating to find out why.
28
28
- Base Julia. While providing static size information was enough for it to realize vectorizing the inner loops was not worth it, base Julia was seemingly the only one that didn't decide to vectorize an outer loop instead.
29
29
30
30
Manually unrolling the inner loops allows base Julia to vectorize, while the performance of all non-Julia variants was unchanged:
LoopVectorization is currently limited to only unrolling two loops (but a third may be vectorized, effectively unrolling it by the length of the vectors). Manually unrolling two of the loops lets up to four loops be unrolled.
Copy file name to clipboardExpand all lines: docs/src/examples/matrix_multiplication.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,21 +18,21 @@ end
18
18
and this can handle all transposed/not-tranposed permutations. LoopVectorization will change loop orders and strategy as appropriate based on the types of the input matrices. For each of the others, I wrote separate functions to handle each case.
19
19
Letting all three matrices be square and `Size` x `Size`, we attain the following benchmark results:
This is classic GEMM, `𝐂 = 𝐀 * 𝐁`. GFortran's intrinsic `matmul` function does fairly well. But all the compilers are well behind LoopVectorization here, which falls behind MKL's `gemm` beyond 70x70 or so. The problem imposed by alignment is also striking: performance is much higher when the sizes are integer multiplies of 8. Padding arrays so that each column is aligned regardless of the number of rows can thus be very profitable. [PaddedMatrices.jl](https://github.com/JuliaSIMD/PaddedMatrices.jl) offers just such arrays in Julia. I believe that is also what the [-pad](https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-pad-qpad) compiler flag does when using Intel's compilers.
The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` instrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the ecplicit copy.
26
26
27
27
ifort did equally well whethor or not `𝐁` was transposed, while LoopVectorization's performance degraded slightly faster as a function of size in the transposed case, because strides between memory accesses are larger when `𝐁` is transposed. But it still performed best of all the compiled loops over this size range, losing out to MKL and eventually OpenBLAS.
28
28
icc interestingly does better when it is transposed.
29
29
30
30
GEMM is easiest when the matrix `𝐀` is not tranposed (assuming column-major memory layouts), because then you can sum up columns of `𝐀` to store into `𝐂`. If `𝐀` were transposed, then we cannot efficiently load contiguous elements from `𝐀` that can best stored directly in `𝐂`. So for `𝐂 = 𝐀ᵀ * 𝐁`, contiguous vectors along the `k`-loop have to be reduced, adding some overhead.
Packing is critical for performance here. LoopVectorization does not pack, therefore it is well behind MKL and OpenBLAS, which do. Eigen packs, but is poorly optimized for this CPU architecture.
33
33
34
34
When both `𝐀` and ` 𝐁` are transposed, we now have `𝐂 = 𝐀ᵀ * 𝐁ᵀ = (𝐁 * 𝐀)ᵀ`.
Julia, Clang, and gfortran all struggled to vectorize this, because none of the matrices share a contiguous access: `M` for `𝐂`, `K` for `𝐀ᵀ`, and `N` for `𝐁ᵀ`. However, LoopVectorization and all the specialized matrix multiplication functions managed to do about as well as normal; transposing while storing the results takes negligible amounts of time relative to the matrix multiplication itself.
0 commit comments