You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -31,16 +31,19 @@ The macro assumes that loop iterations can be reordered. It also currently suppo
31
31
32
32
## Benchmarks
33
33
34
-
Please see the documentation for benchmarks versus base Julia, Clang-Polly, icc, ifort, gfortran, and Eigen. If you would believe any code or compiler flags can be improved, would like to submit your own benchmarks, or have Julia code using LoopVectorization that you would like to be tested for performance regressions on a semi-regular basis, please feel file an issue or PR with the code sample.
34
+
Please see the documentation for benchmarks versus base Julia, Clang, icc, ifort, gfortran, and Eigen. If you would believe any code or compiler flags can be improved, would like to submit your own benchmarks, or have Julia code using LoopVectorization that you would like to be tested for performance regressions on a semi-regular basis, please feel file an issue or PR with the code sample.
35
35
36
36
## Examples
37
37
### Dot Product
38
+
39
+
LLVM/Julia by default generate essentially optimal code for a primary vectorized part of this loop. In many cases -- such as the dot product -- this vectorized part of the loop computes 4*SIMD-vector-width iterations at a time.
40
+
On the CPU I'm running these benchmarks on with `Float64` data, the SIMD-vector-width is 8, meaning it will compute 32 iterations at a time.
41
+
However, LLVM is very slow at handling the tails, `length(iterations) % 32`. For this reason, [in benchmark plots](https://chriselrod.github.io/LoopVectorization.jl/latest/examples/dot_product/) you can see performance drop as the size of the remainder increases.
42
+
43
+
For simple loops like a dot product, LoopVectorization.jl's most important optimization is to handle these tails more efficiently:
38
44
<details>
39
45
<summaryClick me! ></summary>
40
46
<p>
41
-
42
-
A simple example with a single loop is the dot product:
43
-
```julia
44
47
julia> using LoopVectorization, BenchmarkTools
45
48
46
49
julia> function mydot(a, b)
@@ -64,73 +67,27 @@ mydotavx (generic function with 1 method)
64
67
julia> a = rand(256); b = rand(256);
65
68
66
69
julia> @btime mydot($a, $b)
67
-
12.273 ns (0 allocations:0 bytes)
68
-
62.61049816874535
70
+
12.220 ns (0 allocations: 0 bytes)
71
+
62.67140864639772
69
72
70
-
julia>@btimemydotavx($a, $b)
71
-
11.618 ns (0 allocations:0 bytes)
72
-
62.61049816874536
73
+
julia> @btime mydotavx($a, $b) # performance is similar
74
+
12.104 ns (0 allocations: 0 bytes)
75
+
62.67140864639772
73
76
74
77
julia> a = rand(255); b = rand(255);
75
78
76
-
julia>@btimemydot($a, $b)
77
-
36.539 ns (0 allocations:0 bytes)
78
-
62.29537331565549
79
-
80
-
julia>@btimemydotavx($a, $b)
81
-
11.739 ns (0 allocations:0 bytes)
82
-
62.29537331565549
83
-
```
84
-
85
-
On most recent CPUs, the performance of the dot product is bounded by
86
-
the speed at which it can load data; most recent x86_64 CPUs can perform
87
-
two aligned loads and two fused multiply adds (`fma`) per clock cycle.
88
-
However, the dot product requires two loads per `fma`.
89
-
90
-
A self-dot function, on the otherhand, requires one load per fma:
91
-
```julia
92
-
julia>functionmyselfdot(a)
93
-
s =0.0
94
-
@inbounds@simdfor i ∈eachindex(a)
95
-
s += a[i]*a[i]
96
-
end
97
-
s
98
-
end
99
-
myselfdot (generic function with 1 method)
100
-
101
-
julia>functionmyselfdotavx(a)
102
-
s =0.0
103
-
@avxfor i ∈eachindex(a)
104
-
s += a[i]*a[i]
105
-
end
106
-
s
107
-
end
108
-
myselfdotavx (generic function with 1 method)
109
-
110
-
julia> a =rand(256);
111
-
112
-
julia>@btimemyselfdot($a)
113
-
8.578 ns (0 allocations:0 bytes)
114
-
90.16636687132868
115
-
116
-
julia>@btimemyselfdotavx($a)
117
-
9.560 ns (0 allocations:0 bytes)
118
-
90.16636687132868
119
-
120
-
julia>@btimemyselfdot($b)
121
-
28.923 ns (0 allocations:0 bytes)
122
-
83.20114563267853
123
-
124
-
julia>@btimemyselfdotavx($b)
125
-
9.174 ns (0 allocations:0 bytes)
126
-
83.20114563267856
127
-
```
128
-
For this reason, the `@avx` version is roughly twice as fast. The `@inbounds @simd` version, however, is not, because it runs into the problem of loop carried dependencies: to add `a[i]*b[i]` to `s_new = s_old + a[i-j]*b[i-j]`, we must have first finished calculating `s_new`, but -- while two `fma` instructions can be initiated per cycle -- they each take several clock cycles to complete.
129
-
For this reason, we need to unroll the operation to run several independent instances concurrently. The `@avx` macro models this cost to try and pick an optimal unroll factor.
79
+
julia> @btime mydot($a, $b) # with loops shorter by 1, the remainder is now 32, and it is slow
In the future, I would like it to also model the cost of memory movement in the L1 and L2 cache, and use these to generate loops around the macro kernel following the work of [Low, et al. (2016)](http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf).
179
+
It can produce a good macro kernel. An implementation of matrix multiplication able to handle large matrices would need to be perform blocking and packing of arrays to prevent the operations from being memory bottle-necked.
180
+
Some day, LoopVectorization may itself may try to model the costs of memory movement in the L1 and L2 cache, and use these to generate loops around the macro kernel following the work of [Low, et al. (2016)](http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf).
181
+
182
+
But for now, you should view it as a tool for generating efficient computational kernels, leaving tasks of parallelization and cache efficiency to you.
221
183
222
-
Until then, performance will degrade rapidly compared to BLAS as the size of the matrices increase. The advantage of the `@avx` macro, however, is that it is general. Not every operation is supported by BLAS.
223
184
224
-
For example, what if `A` were the outer product of two vectors?
225
185
<!-- ```julia -->
226
186
227
187
@@ -239,24 +199,24 @@ Another example, a straightforward operation expressed well via broadcasting and
0 commit comments