You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current version of LoopVectorization provides a simple, dumb, transform on a single loop.
22
-
What I mean by this is that it will not check for the transformations for validity. To be safe, I would straight loops that transform arrays or calculate reductions.
21
+
This library provides the `@avx` macro, which may be used to prefix a `for` loop or broadcast statement.
22
+
It then tries to vectorize the loop to improve runtime performance.
23
+
24
+
The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of noop lest. These limitations should be removed in a future version.
25
+
26
+
A simple example with a single loop is the dot product:
27
+
```julia
28
+
using LoopVectorization, BenchmarkTools
29
+
functionmydot(a, b)
30
+
s =0.0
31
+
@inbounds@simdfor i ∈eachindex(a,b)
32
+
s += a[i]*b[i]
33
+
end
34
+
s
35
+
end
36
+
functionmydotavx(a, b)
37
+
s =0.0
38
+
@avxfor i ∈eachindex(a,b)
39
+
s += a[i]*b[i]
40
+
end
41
+
s
42
+
end
43
+
a =rand(256); b =rand(256);
44
+
@btimemydot($a, $b)
45
+
@btimemydotavx($a, $b)
46
+
a =rand(43); b =rand(43);
47
+
@btimemydot($a, $b)
48
+
@btimemydotavx($a, $b)
49
+
```
50
+
51
+
On most recent CPUs, the performance of the dot product is bounded by
52
+
the speed at which it can load data; most recent x86_64 CPUs can perform
53
+
two aligned loads and two fused multiply adds (`fma`) per clock cycle.
54
+
However, the dot product requires two loads per `fma`.
55
+
56
+
A self-dot function, on the otherhand, requires one load per fma:
57
+
```julia
58
+
functionmyselfdot(a)
59
+
s =0.0
60
+
@inbounds@simdfor i ∈eachindex(a)
61
+
s += a[i]*a[i]
62
+
end
63
+
s
64
+
end
65
+
functionmyselfdotavx(a)
66
+
s =0.0
67
+
@avxfor i ∈eachindex(a)
68
+
s += a[i]*a[i]
69
+
end
70
+
s
71
+
end
72
+
a =rand(256);
73
+
@btimemyselfdotavx($a)
74
+
@btimemyselfdot($a)
75
+
@btimemyselfdotavx($b)
76
+
@btimemyselfdot($b)
77
+
```
78
+
For this reason, the `@avx` version is roughly twice as fast. The `@inbounds @simd` version, however, is not, because it runs into the problem of loop carried dependencies: to add `a[i]*b[i]` to `s_new = s_old + a[i-j]*b[i-j]`, we must have first finished calculating `s_new`, but -- while two `fma` instructions can be initiated per cycle -- they each take several clock cycles to complete.
79
+
For this reason, we need to unroll the operation to run several independent instances concurrently. The `@avx` macro models this cost to try and pick an optimal unroll factor.
80
+
81
+
Note that 14 and 12 nm Ryzen chips can only do 1 full width `fma` per clock cycle (and 2 loads), so they should see similar performance with the dot and selfdot. I haven't verified this, but would like to hear from anyone who can.
82
+
83
+
84
+
We can also vectorize fancier loops. A likely familiar example to dive into:
85
+
```julia
86
+
functionmygemm!(C, A, B)
87
+
@inboundsfor i ∈1:size(A,1), j ∈1:size(B,2)
88
+
Cᵢⱼ =0.0
89
+
@fastmathfor k ∈1:size(A,2)
90
+
Cᵢⱼ += A[i,k] * B[k,j]
91
+
end
92
+
C[i,j] = Cᵢⱼ
93
+
end
94
+
end
95
+
functionmygemmavx!(C, A, B)
96
+
@avxfor i ∈1:size(A,1), j ∈1:size(B,2)
97
+
Cᵢⱼ =0.0
98
+
for k ∈1:size(A,2)
99
+
Cᵢⱼ += A[i,k] * B[k,j]
100
+
end
101
+
C[i,j] = Cᵢⱼ
102
+
end
103
+
end
104
+
M, K, N =72, 75, 71;
105
+
C1 =Matrix{Float64}(undef, M, N); A =randn(M, K); B =randn(K, N);
106
+
C2 =similar(C1); C3 =similar(C1);
107
+
@btimemygemmavx!($C1, $A, $B)
108
+
@btimemygemm!($C2, $A, $B)
109
+
using LinearAlgebra, Test
110
+
@testall(C1 .≈ C2)
111
+
BLAS.set_num_threads(1); BLAS.vendor()
112
+
@btimemul!($C3, $A, $B)
113
+
@testall(C1 .≈ C3)
114
+
```
115
+
It can produce a decent macro kernel.
116
+
In the future, I would like it to also model the cost of memory movement in the L1 and L2 cache, and use these to generate loops around the macro kernel following the work of [Low, et al. (2016)](http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf).
117
+
118
+
Until then, performance will degrade rapidly compared to BLAS as the size of the matrices increase. The advantage of the `@avx` macro, however, is that it is general. Not every operation is supported by BLAS.
119
+
120
+
For example, what if `A` were the outter product of two vectors?
121
+
```julia
122
+
123
+
124
+
```
125
+
126
+
127
+
Another example, a straightforward operation expressed well via broadcasting:
128
+
```julia
129
+
a =rand(37); B =rand(37, 47); c =rand(47); c′ = c';
130
+
131
+
d1 =@. a + B * c′;
132
+
d2 =@avx@. a + B * c′;
133
+
134
+
@testall(d1 .≈ d2)
135
+
136
+
@time@.$d1 =$a +$B *$c′;
137
+
@time@avx@.$d2 =$a +$B *$c′;
138
+
@testall(d1 .≈ d2)
139
+
```
140
+
can be optimized in a similar manner to BLAS, albeit to a much smaller degree because the naive version already benefits from vectorization (unlike the naive BLAS).
141
+
142
+
143
+
144
+
145
+
Originally, LoopVectorization only provided a simple, dumb, transform on a single loop using the `@vectorize` macro. This transformation took element type and unroll factor arguments, performing no analysis of the loop, simply applying the specified arguments.
146
+
For backwards compatability, this macro is still currently supported. However, it may eventually be deprecated.
0 commit comments