Skip to content

Commit 04d577c

Browse files
committed
@turbo, ambiguity fixes.
1 parent 6d3c75c commit 04d577c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+407
-400
lines changed

Project.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "LoopVectorization"
22
uuid = "bdcacae8-1622-11e9-2a5c-532679323890"
33
authors = ["Chris Elrod <[email protected]>"]
4-
version = "0.12.21"
4+
version = "0.12.22"
55

66
[deps]
77
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9"
@@ -27,7 +27,7 @@ OffsetArrays = "1.4.1"
2727
Requires = "1"
2828
SLEEFPirates = "0.6.18"
2929
Static = "0.2"
30-
StrideArraysCore = "0.1.5"
30+
StrideArraysCore = "0.1.11"
3131
ThreadingUtilities = "0.4.2"
3232
UnPack = "1"
3333
VectorizationBase = "0.20.4"

README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,15 @@ LoopVectorization is supported on Julia 1.1 and later. It is tested on Julia 1.5
1717
## Warning
1818

1919
Misusing LoopVectorization can have [serious consequences](http://catb.org/jargon/html/N/nasal-demons.html). Like `@inbounds`, misusing it can lead to segfaults and memory corruption.
20-
We expect that any time you use the `@avx` macro with a given block of code that you:
21-
1. Are not indexing an array out of bounds. `@avx` does not perform any bounds checking.
20+
We expect that any time you use the `@turbo` macro with a given block of code that you:
21+
1. Are not indexing an array out of bounds. `@turbo` does not perform any bounds checking.
2222
2. Are not iterating over an empty collection. Iterating over an empty loop such as `for i ∈ eachindex(Float64[])` is undefined behavior, and will likely result in the out of bounds memory accesses. Ensure that loops behave correctly.
23-
3. Are not relying on a specific execution order. `@avx` can and will re-order operations and loops inside its scope, so the correctness cannot depend on a particular order. You cannot implement `cumsum` with `@avx`.
23+
3. Are not relying on a specific execution order. `@turbo` can and will re-order operations and loops inside its scope, so the correctness cannot depend on a particular order. You cannot implement `cumsum` with `@turbo`.
2424
4. Are not using multiple loops at the same level in nested loops.
2525

2626
## Usage
2727

28-
This library provides the `@avx` macro, which may be used to prefix a `for` loop or broadcast statement.
28+
This library provides the `@turbo` macro, which may be used to prefix a `for` loop or broadcast statement.
2929
It then tries to vectorize the loop to improve runtime performance.
3030

3131
The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of loop nest. These limitations should be removed in a future version.
@@ -60,7 +60,7 @@ mydot (generic function with 1 method)
6060

6161
julia> function mydotavx(a, b)
6262
s = 0.0
63-
@avx for i eachindex(a,b)
63+
@turbo for i eachindex(a,b)
6464
s += a[i]*b[i]
6565
end
6666
s
@@ -111,7 +111,7 @@ julia> function mygemm!(C, A, B)
111111
mygemm! (generic function with 1 method)
112112

113113
julia> function mygemmavx!(C, A, B)
114-
@avx for m axes(A,1), n axes(B,2)
114+
@turbo for m axes(A,1), n axes(B,2)
115115
Cmn = zero(eltype(C))
116116
for k axes(A,2)
117117
Cmn += A[m,k] * B[k,n]
@@ -207,7 +207,7 @@ julia> A = rand(5,77); B = rand(77, 51); C = rand(51,49); D = rand(49,51);
207207

208208
julia> X1 = view(A,1,:) .+ B * (C .+ D');
209209

210-
julia> X2 = @avx view(A,1,:) .+ B .*ˡ (C .+ D');
210+
julia> X2 = @turbo view(A,1,:) .+ B .*ˡ (C .+ D');
211211

212212
julia> @test X1 X2
213213
Test Passed
@@ -219,7 +219,7 @@ julia> buf2 = similar(X1);
219219
julia> @btime $X1 .= view($A,1,:) .+ mul!($buf2, $B, ($buf1 .= $C .+ $D'));
220220
9.188 μs (0 allocations: 0 bytes)
221221

222-
julia> @btime @avx $X2 .= view($A,1,:) .+ $B .*ˡ ($C .+ $D');
222+
julia> @btime @turbo $X2 .= view($A,1,:) .+ $B .*ˡ ($C .+ $D');
223223
6.751 μs (0 allocations: 0 bytes)
224224

225225
julia> @test X1 X2
@@ -238,7 +238,7 @@ This may improve as the optimizations within LoopVectorization improve.
238238
Note that loops will be faster than broadcasting in general. This is because the behavior of broadcasts is determined by runtime information (i.e., dimensions other than the leading dimension of size `1` will be broadcasted; it is not known which these will be at compile time).
239239
```julia
240240
julia> function AmulBtest!(C,A,Bk,Bn,d)
241-
@avx for m axes(A,1), n axes(Bk,2)
241+
@turbo for m axes(A,1), n axes(Bk,2)
242242
ΔCmn = zero(eltype(C))
243243
for k axes(A,2)
244244
ΔCmn += A[m,k] * (Bk[k,n] + Bn[n,k])
@@ -276,7 +276,7 @@ BenchmarkTools.Trial:
276276
<summaryClick me! ></summary>
277277
<p>
278278

279-
The key to the `@avx` macro's performance gains is leveraging knowledge of exactly how data like `Float64`s and `Int`s are handled by a CPU. As such, it is not strightforward to generalize the `@avx` macro to work on arrays containing structs such as `Matrix{Complex{Float64}}`. Instead, it is currently recommended that users wishing to apply `@avx` to arrays of structs use packages such as [StructArrays.jl](https://github.com/JuliaArrays/StructArrays.jl) which transform an array where each element is a struct into a struct where each element is an array. Using StructArrays.jl, we can write a matrix multiply (gemm) kernel that works on matrices of `Complex{Float64}`s and `Complex{Int}`s:
279+
The key to the `@turbo` macro's performance gains is leveraging knowledge of exactly how data like `Float64`s and `Int`s are handled by a CPU. As such, it is not strightforward to generalize the `@turbo` macro to work on arrays containing structs such as `Matrix{Complex{Float64}}`. Instead, it is currently recommended that users wishing to apply `@turbo` to arrays of structs use packages such as [StructArrays.jl](https://github.com/JuliaArrays/StructArrays.jl) which transform an array where each element is a struct into a struct where each element is an array. Using StructArrays.jl, we can write a matrix multiply (gemm) kernel that works on matrices of `Complex{Float64}`s and `Complex{Int}`s:
280280
```julia
281281
using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test
282282

@@ -285,7 +285,7 @@ BLAS.set_num_threads(1); @show BLAS.vendor()
285285
const MatrixFInt64 = Union{Matrix{Float64}, Matrix{Int}}
286286

287287
function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
288-
@avx for m 1:size(A,1), n 1:size(B,2)
288+
@turbo for m 1:size(A,1), n 1:size(B,2)
289289
Cmn = zero(eltype(C))
290290
for k 1:size(A,2)
291291
Cmn += A[m,k] * B[k,n]
@@ -295,7 +295,7 @@ function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
295295
end
296296

297297
function mul_add_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64, factor=1)
298-
@avx for m 1:size(A,1), n 1:size(B,2)
298+
@turbo for m 1:size(A,1), n 1:size(B,2)
299299
ΔCmn = zero(eltype(C))
300300
for k 1:size(A,2)
301301
ΔCmn += A[m,k] * B[k,n]

benchmark/benchmarkflops.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,7 @@ end
275275
function exp_bench!(br, s, i)
276276
a = rand(s); b = similar(a)
277277
n_gflop = 1e-9*s # not really gflops
278-
br[1,i] = n_gflop / @belapsed @avx @. $b = exp($a)
278+
br[1,i] = n_gflop / @belapsed @turbo @. $b = exp($a)
279279
baseb = copy(b)
280280
br[2,i] = n_gflop / @belapsed @. $b = exp($a)
281281
@assert b baseb "LoopVec wrong?"
@@ -296,7 +296,7 @@ function aplusBc_bench!(br, s, i)
296296
a = rand(M); B = rand(M,N); c = rand(N);
297297
c′ = c'; D = similar(B)
298298
n_gflop = 2e-9 * M*N
299-
br[1,i] = n_gflop / @belapsed @avx @. $D = $a + $B * $c′
299+
br[1,i] = n_gflop / @belapsed @turbo @. $D = $a + $B * $c′
300300
Dcopy = copy(D); fill!(D, NaN);
301301
br[2,i] = n_gflop / @belapsed @. $D = $a + $B * $c′
302302
@assert D Dcopy "LoopVec wrong?"
@@ -319,7 +319,7 @@ end
319319
function AplusAt_bench!(br, s, i)
320320
A = rand(s,s); B = similar(A)
321321
n_gflop = 1e-9*s^2
322-
br[1,i] = n_gflop / @belapsed @avx @. $B = $A + $A'
322+
br[1,i] = n_gflop / @belapsed @turbo @. $B = $A + $A'
323323
baseB = copy(B); fill!(B, NaN);
324324
br[2,i] = n_gflop / @belapsed @. $B = $A + $A'
325325
@assert B baseB "LoopVec wrong?"

benchmark/looptests.jl

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ function jgemm!(𝐂, 𝐀ᵀ::Adjoint, 𝐁ᵀ::Adjoint)
6464
end
6565
end
6666
function gemmavx!(𝐂, 𝐀, 𝐁)
67-
@avx for m indices((𝐀,𝐂),1), n indices((𝐁,𝐂),2)
67+
@turbo for m indices((𝐀,𝐂),1), n indices((𝐁,𝐂),2)
6868
𝐂ₘₙ = zero(eltype(𝐂))
6969
for k indices((𝐀,𝐁),(2,1))
7070
𝐂ₘₙ += 𝐀[m,k] * 𝐁[k,n]
@@ -76,7 +76,7 @@ function gemmavx!(Cc::AbstractMatrix{Complex{T}}, Ac::AbstractMatrix{Complex{T}}
7676
A = reinterpret(reshape, T, Ac)
7777
B = reinterpret(reshape, T, Bc)
7878
C = reinterpret(reshape, T, Cc)
79-
@avx for m indices((A,C),2), n indices((B,C),3)
79+
@turbo for m indices((A,C),2), n indices((B,C),3)
8080
Cre = zero(T)
8181
Cim = zero(T)
8282
for k indices((A,B),(3,2))
@@ -88,7 +88,7 @@ function gemmavx!(Cc::AbstractMatrix{Complex{T}}, Ac::AbstractMatrix{Complex{T}}
8888
end
8989
end
9090
function gemmavxt!(𝐂, 𝐀, 𝐁)
91-
@avxt for m indices((𝐀,𝐂),1), n indices((𝐁,𝐂),2)
91+
@tturbo for m indices((𝐀,𝐂),1), n indices((𝐁,𝐂),2)
9292
𝐂ₘₙ = zero(eltype(𝐂))
9393
for k indices((𝐀,𝐁),(2,1))
9494
𝐂ₘₙ += 𝐀[m,k] * 𝐁[k,n]
@@ -100,7 +100,7 @@ function gemmavxt!(Cc::AbstractMatrix{Complex{T}}, Ac::AbstractMatrix{Complex{T}
100100
A = reinterpret(reshape, T, Ac)
101101
B = reinterpret(reshape, T, Bc)
102102
C = reinterpret(reshape, T, Cc)
103-
@avxt for m indices((A,C),2), n indices((B,C),3)
103+
@tturbo for m indices((A,C),2), n indices((B,C),3)
104104
Cre = zero(T)
105105
Cim = zero(T)
106106
for k indices((A,B),(3,2))
@@ -121,16 +121,16 @@ function jdot(a, b)
121121
end
122122
function jdotavx(a, b)
123123
s = zero(eltype(a))
124-
# @avx for i ∈ eachindex(a,b)
125-
@avx for i eachindex(a)
124+
# @turbo for i ∈ eachindex(a,b)
125+
@turbo for i eachindex(a)
126126
s += a[i] * b[i]
127127
end
128128
s
129129
end
130130
function jdotavxt(a, b)
131131
s = zero(eltype(a))
132-
# @avx for i ∈ eachindex(a,b)
133-
@avxt for i eachindex(a)
132+
# @turbo for i ∈ eachindex(a,b)
133+
@tturbo for i eachindex(a)
134134
s += a[i] * b[i]
135135
end
136136
s
@@ -144,7 +144,7 @@ function jselfdot(a)
144144
end
145145
function jselfdotavx(a)
146146
s = zero(eltype(a))
147-
@avx for i eachindex(a)
147+
@turbo for i eachindex(a)
148148
s += a[i] * a[i]
149149
end
150150
s
@@ -160,7 +160,7 @@ end
160160
function jdot3v2avx(x, A, y)
161161
M, N = size(A)
162162
s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
163-
@avx for n 1:N, m 1:M
163+
@turbo for n 1:N, m 1:M
164164
s += x[m] * A[m,n] * y[n]
165165
end
166166
s
@@ -178,7 +178,7 @@ function jdot3(x, A, y)
178178
end
179179
function jdot3avx(x, A, y)
180180
s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
181-
@avx for n axes(A,2)
181+
@turbo for n axes(A,2)
182182
t = zero(s)
183183
for m axes(A,1)
184184
t += x[m] * A[m,n]
@@ -193,7 +193,7 @@ function jvexp!(b, a)
193193
end
194194
end
195195
function jvexpavx!(b, a)
196-
@avx for i eachindex(a)
196+
@turbo for i eachindex(a)
197197
b[i] = exp(a[i])
198198
end
199199
end
@@ -206,7 +206,7 @@ function jsvexp(a)
206206
end
207207
function jsvexpavx(a)
208208
s = zero(eltype(a))
209-
@avx for i eachindex(a)
209+
@turbo for i eachindex(a)
210210
s += exp(a[i])
211211
end
212212
s
@@ -230,7 +230,7 @@ function jgemv!(𝐲, 𝐀ᵀ::Adjoint, 𝐱)
230230
end
231231
end
232232
function jgemvavx!(𝐲, 𝐀, 𝐱)
233-
@avx for i eachindex(𝐲)
233+
@turbo for i eachindex(𝐲)
234234
𝐲ᵢ = zero(eltype(𝐲))
235235
for j eachindex(𝐱)
236236
𝐲ᵢ += 𝐀[i,j] * 𝐱[j]
@@ -248,7 +248,7 @@ function jvar!(𝐬², 𝐀, x̄)
248248
end
249249
end
250250
function jvaravx!(𝐬², 𝐀, x̄)
251-
@avx for j eachindex(𝐬²)
251+
@turbo for j eachindex(𝐬²)
252252
𝐬²ⱼ = zero(eltype(𝐬²))
253253
x̄ⱼ = x̄[j]
254254
for i 1:size(𝐀,2)
@@ -259,7 +259,7 @@ function jvaravx!(𝐬², 𝐀, x̄)
259259
end
260260
end
261261
japlucBc!(D, a, B, c) = @. D = a + B * c';
262-
japlucBcavx!(D, a, B, c) = @avx @. D = a + B * c';
262+
japlucBcavx!(D, a, B, c) = @turbo @. D = a + B * c';
263263

264264
function jOLSlp(y, X, β)
265265
lp = zero(eltype(y))
@@ -274,7 +274,7 @@ function jOLSlp(y, X, β)
274274
end
275275
function jOLSlp_avx(y, X, β)
276276
lp = zero(eltype(y))
277-
@avx for i eachindex(y)
277+
@turbo for i eachindex(y)
278278
δ = y[i]
279279
for j eachindex(β)
280280
δ -= X[i,j] * β[j]
@@ -300,7 +300,7 @@ function randomaccessavx(P, basis, coeffs::Vector{T}) where {T}
300300
C = length(coeffs)
301301
A = size(P, 1)
302302
p = zero(T)
303-
@avx for c 1:C
303+
@turbo for c 1:C
304304
pc = coeffs[c]
305305
for a = 1:A
306306
pc *= P[a, basis[a, c]]
@@ -319,7 +319,7 @@ end
319319
function jlogdettriangleavx(B::Union{LowerTriangular,UpperTriangular})
320320
A = parent(B) # No longer supported
321321
ld = zero(eltype(A))
322-
@avx for n axes(A,1)
322+
@turbo for n axes(A,1)
323323
ld += log(A[n,n])
324324
end
325325
ld
@@ -339,7 +339,7 @@ function filter2d!(out::AbstractMatrix, A::AbstractMatrix, kern)
339339
out
340340
end
341341
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
342-
@avx for J in CartesianIndices(out)
342+
@turbo for J in CartesianIndices(out)
343343
tmp = zero(eltype(out))
344344
for I CartesianIndices(kern)
345345
tmp += A[I + J] * kern[I]
@@ -364,7 +364,7 @@ end
364364
function filter2dunrolledavx!(out::AbstractMatrix, A::AbstractMatrix, kern::SizedOffsetMatrix{T,-1,1,-1,1}) where {T}
365365
rng1, rng2 = axes(out)
366366
Base.Cartesian.@nexprs 3 jk -> Base.Cartesian.@nexprs 3 ik -> kern_ik_jk = kern[ik-2,jk-2]
367-
@avx for j in rng2, i in rng1
367+
@turbo for j in rng2, i in rng1
368368
tmp_0 = zero(eltype(out))
369369
Base.Cartesian.@nexprs 3 jk -> Base.Cartesian.@nexprs 3 ik -> tmp_{ik+(jk-1)*3} = A[i+(ik-2),j+(jk-2)] * kern_ik_jk + tmp_{ik+(jk-1)*3-1}
370370
out[i,j] = tmp_9
@@ -379,7 +379,7 @@ end
379379
# end
380380
# end
381381
# function smooth_line_avx!(sl,nrm1,j,i1,sl,rl,ih2,denom)
382-
# @avx for i=i1:2:nrm1
382+
# @turbo for i=i1:2:nrm1
383383
# sl[i,j]=denom*(rl[i,j]+ih2*(sl[i,j-1]+sl[i-1,j]+sl[i+1,j]+sl[i,j+1]))
384384
# end
385385
# end

docs/src/api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
## Macros
44

55
```@docs
6-
@avx
6+
@turbo
77
@_avx
88
```
99

docs/src/devdocs/constructing_loopsets.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
## Loop expressions
44

5-
When applying `@avx` to a loop expression, it creates a `LoopSet` without awareness to type information, and then [condenses the information](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/condense_loopset.jl) into a summary which is passed as type information to a generated function.
5+
When applying `@turbo` to a loop expression, it creates a `LoopSet` without awareness to type information, and then [condenses the information](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/condense_loopset.jl) into a summary which is passed as type information to a generated function.
66
```julia
7-
julia> @macroexpand @avx for m 1:M, n 1:N
7+
julia> @macroexpand @turbo for m 1:M, n 1:N
88
C[m,n] = zero(eltype(B))
99
for k 1:K
1010
C[m,n] += A[m,k] * B[k,n]
@@ -36,7 +36,7 @@ and the set of loop bounds:
3636

3737
## Broadcasting
3838

39-
When applying the `@avx` macro to a broadcast expression, there are no explicit loops, and even the dimensionality of the operation is unknown. Consequently the `LoopSet` object must be constructed at compile time. The function and involved operations are their relationships are straightforward to infer from the structure of nested broadcasts:
39+
When applying the `@turbo` macro to a broadcast expression, there are no explicit loops, and even the dimensionality of the operation is unknown. Consequently the `LoopSet` object must be constructed at compile time. The function and involved operations are their relationships are straightforward to infer from the structure of nested broadcasts:
4040
```julia
4141
julia> Meta.@lower @. f(g(a,b) + c) / d
4242
:($(Expr(:thunk, CodeInfo(
@@ -49,7 +49,7 @@ julia> Meta.@lower @. f(g(a,b) + c) / d
4949
└── return %5
5050
))))
5151
52-
julia> @macroexpand @avx @. f(g(a,b) + c) / d
52+
julia> @macroexpand @turbo @. f(g(a,b) + c) / d
5353
quote
5454
var"##262" = Base.broadcasted(g, a, b)
5555
var"##263" = Base.broadcasted(+, var"##262", c)

docs/src/devdocs/reference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ LoopVectorization.ArrayReferenceMeta
3333

3434
## Condensed types
3535

36-
These are used when encoding the `@avx` block as a type parameter for passing through
36+
These are used when encoding the `@turbo` block as a type parameter for passing through
3737
to the `@generated` function.
3838

3939
```@docs

docs/src/examples/array_interface.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ By supporting the interface, using `LoopVectorization` can simplify implementing
1414
using StaticArrays, LoopVectorization
1515

1616
@inline function AmulB!(C, A, B)
17-
@avx for n axes(C,2), m axes(C,1)
17+
@turbo for n axes(C,2), m axes(C,1)
1818
Cmn = zero(eltype(C))
1919
for k axes(B,1)
2020
Cmn += A[m,k] * B[k,n]
@@ -93,7 +93,7 @@ C_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}
9393
# A is M x K x I x L
9494
# B is K x N x L x J
9595
function bmul!(C, A, B)
96-
@avx for n in axes(C,2), m in axes(C,1), j in axes(C,4), i in axes(C,3)
96+
@turbo for n in axes(C,2), m in axes(C,1), j in axes(C,4), i in axes(C,3)
9797
Cmnji = zero(eltype(C))
9898
for k in axes(B,1), l in axes(B,3)
9999
Cmnji += A[m,k,i,l] * B[k,n,l,j]

0 commit comments

Comments
 (0)