Skip to content

Commit 182a3a7

Browse files
committed
Update README.
1 parent b090052 commit 182a3a7

File tree

1 file changed

+60
-100
lines changed

1 file changed

+60
-100
lines changed

README.md

Lines changed: 60 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -31,16 +31,19 @@ The macro assumes that loop iterations can be reordered. It also currently suppo
3131

3232
## Benchmarks
3333

34-
Please see the documentation for benchmarks versus base Julia, Clang-Polly, icc, ifort, gfortran, and Eigen. If you would believe any code or compiler flags can be improved, would like to submit your own benchmarks, or have Julia code using LoopVectorization that you would like to be tested for performance regressions on a semi-regular basis, please feel file an issue or PR with the code sample.
34+
Please see the documentation for benchmarks versus base Julia, Clang, icc, ifort, gfortran, and Eigen. If you would believe any code or compiler flags can be improved, would like to submit your own benchmarks, or have Julia code using LoopVectorization that you would like to be tested for performance regressions on a semi-regular basis, please feel file an issue or PR with the code sample.
3535

3636
## Examples
3737
### Dot Product
38+
39+
LLVM/Julia by default generate essentially optimal code for a primary vectorized part of this loop. In many cases -- such as the dot product -- this vectorized part of the loop computes 4*SIMD-vector-width iterations at a time.
40+
On the CPU I'm running these benchmarks on with `Float64` data, the SIMD-vector-width is 8, meaning it will compute 32 iterations at a time.
41+
However, LLVM is very slow at handling the tails, `length(iterations) % 32`. For this reason, [in benchmark plots](https://chriselrod.github.io/LoopVectorization.jl/latest/examples/dot_product/) you can see performance drop as the size of the remainder increases.
42+
43+
For simple loops like a dot product, LoopVectorization.jl's most important optimization is to handle these tails more efficiently:
3844
<details>
3945
<summaryClick me! ></summary>
4046
<p>
41-
42-
A simple example with a single loop is the dot product:
43-
```julia
4447
julia> using LoopVectorization, BenchmarkTools
4548

4649
julia> function mydot(a, b)
@@ -64,73 +67,27 @@ mydotavx (generic function with 1 method)
6467
julia> a = rand(256); b = rand(256);
6568

6669
julia> @btime mydot($a, $b)
67-
12.273 ns (0 allocations: 0 bytes)
68-
62.61049816874535
70+
12.220 ns (0 allocations: 0 bytes)
71+
62.67140864639772
6972

70-
julia> @btime mydotavx($a, $b)
71-
11.618 ns (0 allocations: 0 bytes)
72-
62.61049816874536
73+
julia> @btime mydotavx($a, $b) # performance is similar
74+
12.104 ns (0 allocations: 0 bytes)
75+
62.67140864639772
7376

7477
julia> a = rand(255); b = rand(255);
7578

76-
julia> @btime mydot($a, $b)
77-
36.539 ns (0 allocations: 0 bytes)
78-
62.29537331565549
79-
80-
julia> @btime mydotavx($a, $b)
81-
11.739 ns (0 allocations: 0 bytes)
82-
62.29537331565549
83-
```
84-
85-
On most recent CPUs, the performance of the dot product is bounded by
86-
the speed at which it can load data; most recent x86_64 CPUs can perform
87-
two aligned loads and two fused multiply adds (`fma`) per clock cycle.
88-
However, the dot product requires two loads per `fma`.
89-
90-
A self-dot function, on the otherhand, requires one load per fma:
91-
```julia
92-
julia> function myselfdot(a)
93-
s = 0.0
94-
@inbounds @simd for i eachindex(a)
95-
s += a[i]*a[i]
96-
end
97-
s
98-
end
99-
myselfdot (generic function with 1 method)
100-
101-
julia> function myselfdotavx(a)
102-
s = 0.0
103-
@avx for i eachindex(a)
104-
s += a[i]*a[i]
105-
end
106-
s
107-
end
108-
myselfdotavx (generic function with 1 method)
109-
110-
julia> a = rand(256);
111-
112-
julia> @btime myselfdot($a)
113-
8.578 ns (0 allocations: 0 bytes)
114-
90.16636687132868
115-
116-
julia> @btime myselfdotavx($a)
117-
9.560 ns (0 allocations: 0 bytes)
118-
90.16636687132868
119-
120-
julia> @btime myselfdot($b)
121-
28.923 ns (0 allocations: 0 bytes)
122-
83.20114563267853
123-
124-
julia> @btime myselfdotavx($b)
125-
9.174 ns (0 allocations: 0 bytes)
126-
83.20114563267856
127-
```
128-
For this reason, the `@avx` version is roughly twice as fast. The `@inbounds @simd` version, however, is not, because it runs into the problem of loop carried dependencies: to add `a[i]*b[i]` to `s_new = s_old + a[i-j]*b[i-j]`, we must have first finished calculating `s_new`, but -- while two `fma` instructions can be initiated per cycle -- they each take several clock cycles to complete.
129-
For this reason, we need to unroll the operation to run several independent instances concurrently. The `@avx` macro models this cost to try and pick an optimal unroll factor.
79+
julia> @btime mydot($a, $b) # with loops shorter by 1, the remainder is now 32, and it is slow
80+
36.530 ns (0 allocations: 0 bytes)
81+
61.25056244423578
13082

83+
julia> @btime mydotavx($a, $b) # performance remains mostly unchanged.
84+
12.226 ns (0 allocations: 0 bytes)
85+
61.250562444235776
13186
</p>
13287
</details>
13388

89+
90+
13491
### Matrix Multiply
13592
<details>
13693
<summaryClick me! ></summary>
@@ -139,9 +96,9 @@ For this reason, we need to unroll the operation to run several independent inst
13996
We can also vectorize fancier loops. A likely familiar example to dive into:
14097
```julia
14198
julia> function mygemm!(C, A, B)
142-
@inbounds @fastmath for m 1:size(A,1), n 1:size(B,2)
99+
@inbounds @fastmath for m axes(A,1), n axes(B,2)
143100
Cmn = zero(eltype(C))
144-
for k 1:size(A,2)
101+
for k axes(A,2)
145102
Cmn += A[m,k] * B[k,n]
146103
end
147104
C[m,n] = Cmn
@@ -150,46 +107,46 @@ julia> function mygemm!(C, A, B)
150107
mygemm! (generic function with 1 method)
151108

152109
julia> function mygemmavx!(C, A, B)
153-
@avx for m 1:size(A,1), n 1:size(B,2)
110+
@avx for m axes(A,1), n axes(B,2)
154111
Cmn = zero(eltype(C))
155-
for k 1:size(A,2)
112+
for k axes(A,2)
156113
Cmn += A[m,k] * B[k,n]
157114
end
158115
C[m,n] = Cmn
159116
end
160117
end
161118
mygemmavx! (generic function with 1 method)
162119

163-
julia> M, K, N = 72, 75, 71;
120+
julia> M, K, N = 191, 189, 171;
164121

165122
julia> C1 = Matrix{Float64}(undef, M, N); A = randn(M, K); B = randn(K, N);
166123

167124
julia> C2 = similar(C1); C3 = similar(C1);
168125

169126
julia> @benchmark mygemmavx!($C1, $A, $B)
170-
BenchmarkTools.Trial:
127+
BenchmarkTools.Trial:
171128
memory estimate: 0 bytes
172129
allocs estimate: 0
173130
--------------
174-
minimum time: 7.381 μs (0.00% GC)
175-
median time: 7.415 μs (0.00% GC)
176-
mean time: 7.432 μs (0.00% GC)
177-
maximum time: 15.444 μs (0.00% GC)
131+
minimum time: 111.722 μs (0.00% GC)
132+
median time: 112.528 μs (0.00% GC)
133+
mean time: 112.673 μs (0.00% GC)
134+
maximum time: 189.400 μs (0.00% GC)
178135
--------------
179136
samples: 10000
180-
evals/sample: 4
137+
evals/sample: 1
181138

182139
julia> @benchmark mygemm!($C2, $A, $B)
183-
BenchmarkTools.Trial:
140+
BenchmarkTools.Trial:
184141
memory estimate: 0 bytes
185142
allocs estimate: 0
186143
--------------
187-
minimum time: 230.790 μs (0.00% GC)
188-
median time: 231.288 μs (0.00% GC)
189-
mean time: 231.882 μs (0.00% GC)
190-
maximum time: 275.460 μs (0.00% GC)
144+
minimum time: 4.891 ms (0.00% GC)
145+
median time: 4.899 ms (0.00% GC)
146+
mean time: 4.899 ms (0.00% GC)
147+
maximum time: 5.049 ms (0.00% GC)
191148
--------------
192-
samples: 10000
149+
samples: 1021
193150
evals/sample: 1
194151

195152
julia> using LinearAlgebra, Test
@@ -201,27 +158,30 @@ julia> BLAS.set_num_threads(1); BLAS.vendor()
201158
:mkl
202159

203160
julia> @benchmark mul!($C3, $A, $B)
204-
BenchmarkTools.Trial:
161+
BenchmarkTools.Trial:
205162
memory estimate: 0 bytes
206163
allocs estimate: 0
207164
--------------
208-
minimum time: 6.830 μs (0.00% GC)
209-
median time: 6.861 μs (0.00% GC)
210-
mean time: 6.869 μs (0.00% GC)
211-
maximum time: 15.125 μs (0.00% GC)
165+
minimum time: 117.221 μs (0.00% GC)
166+
median time: 118.745 μs (0.00% GC)
167+
mean time: 118.892 μs (0.00% GC)
168+
maximum time: 193.826 μs (0.00% GC)
212169
--------------
213170
samples: 10000
214-
evals/sample: 5
171+
evals/sample: 1
215172

216173
julia> @test all(C1 .≈ C3)
217174
Test Passed
175+
176+
julia> 2e-9M*K*N ./ (111.722e-6, 4.891e-3, 117.221e-6)
177+
(110.50516460500171, 2.524199141279902, 105.32121377568868)
218178
```
219-
It can produce a decent macro kernel.
220-
In the future, I would like it to also model the cost of memory movement in the L1 and L2 cache, and use these to generate loops around the macro kernel following the work of [Low, et al. (2016)](http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf).
179+
It can produce a good macro kernel. An implementation of matrix multiplication able to handle large matrices would need to be perform blocking and packing of arrays to prevent the operations from being memory bottle-necked.
180+
Some day, LoopVectorization may itself may try to model the costs of memory movement in the L1 and L2 cache, and use these to generate loops around the macro kernel following the work of [Low, et al. (2016)](http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf).
181+
182+
But for now, you should view it as a tool for generating efficient computational kernels, leaving tasks of parallelization and cache efficiency to you.
221183

222-
Until then, performance will degrade rapidly compared to BLAS as the size of the matrices increase. The advantage of the `@avx` macro, however, is that it is general. Not every operation is supported by BLAS.
223184

224-
For example, what if `A` were the outer product of two vectors?
225185
<!-- ```julia -->
226186

227187

@@ -239,24 +199,24 @@ Another example, a straightforward operation expressed well via broadcasting and
239199
```julia
240200
julia> using LoopVectorization, LinearAlgebra, BenchmarkTools, Test; BLAS.set_num_threads(1)
241201

242-
julia> a = rand(48); B = rand(48, 51); c = rand(51); d = rand(49);
202+
julia> A = rand(5,77); B = rand(77, 51); C = rand(51,49); D = rand(49,51);
243203

244-
julia> X1 = a .+ B * (c .+ d');
204+
julia> X1 = view(A,1,:) .+ B * (C .+ D');
245205

246-
julia> X2 = @avx @. a + B *ˡ (c + d');
206+
julia> X2 = @avx view(A,1,:) .+ B .*ˡ (C .+ D');
247207

248208
julia> @test X1 X2
249209
Test Passed
250210

251-
julia> buf1 = Matrix{Float64}(undef, length(c), length(d));
211+
julia> buf1 = Matrix{Float64}(undef, size(C,1), size(C,2));
252212

253213
julia> buf2 = similar(X1);
254214

255-
julia> @btime $X1 .= $a .+ mul!($buf2, $B, ($buf1 .= $c .+ $d'));
256-
3.499 μs (0 allocations: 0 bytes)
215+
julia> @btime $X1 .= view($A,1,:) .+ mul!($buf2, $B, ($buf1 .= $C .+ $D'));
216+
7.896 μs (0 allocations: 0 bytes)
257217

258-
julia> @btime @avx @. $X2 = $a + $B *ˡ ($c + $d');
259-
3.289 μs (0 allocations: 0 bytes)
218+
julia> @btime @avx $X2 .= view($A,1,:) .+ $B .*ˡ ($C .+ $D');
219+
7.647 μs (0 allocations: 0 bytes)
260220

261221
julia> @test X1 X2
262222
Test Passed
@@ -324,10 +284,10 @@ julia> C1 = StructArray(Matrix{ComplexF64}(undef, M, N));
324284
julia> C2 = collect(similar(C1));
325285

326286
julia> @btime mul_avx!($C1, $A, $B)
327-
13.634 μs (0 allocations: 0 bytes)
287+
13.525 μs (0 allocations: 0 bytes)
328288

329289
julia> @btime mul!( $C2, $(collect(A)), $(collect(B))); # collect turns the StructArray into a regular Array
330-
14.007 μs (0 allocations: 0 bytes)
290+
14.003 μs (0 allocations: 0 bytes)
331291

332292
julia> @test C1 C2
333293
Test Passed

0 commit comments

Comments
 (0)