Skip to content

Commit 320cf3a

Browse files
committed
merged main
2 parents df1424c + befd727 commit 320cf3a

File tree

11 files changed

+153
-44
lines changed

11 files changed

+153
-44
lines changed

Project.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
name = "LoopVectorization"
22
uuid = "bdcacae8-1622-11e9-2a5c-532679323890"
33
authors = ["Chris Elrod <[email protected]>"]
4-
version = "0.12.160"
4+
version = "0.12.163"
5+
56

67
[deps]
78
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9"

README.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,30 @@ BenchmarkTools.Trial:
273273
evals/sample: 6
274274
```
275275

276+
Note: `@turbo` does not support passing of kwargs to function calls to which it is applied, e.g:
277+
```julia
278+
julia> @turbo round.(rand(10))
279+
280+
julia> @turbo round.(rand(10); digits = 3)
281+
ERROR: TypeError: in typeassert, expected Expr, got a value of type GlobalRef
282+
```
283+
284+
You can work around this by creating a anonymous function before applying `@turbo` as follows:
285+
```julia
286+
struct KwargCall{F,T}
287+
f::F
288+
x::T
289+
end
290+
@inline (f::KwargCall)(args...) = f.f(args...; f.x...)
291+
292+
f = KwargCall(round, (digits = 3,));
293+
@turbo f.(rand(10))
294+
10-element Vector{Float64}:
295+
0.763
296+
297+
0.851
298+
```
299+
276300
</p>
277301
</details>
278302

docs/src/devdocs/evaluating_loops.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Determining the strategy for evaluating loops
22

3-
The heart of the optimizatizations performed by LoopVectorization are given in the [determinestrategy.jl](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/determinestrategy.jl) file utilizing instruction costs specified in [costs.jl](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/costs.jl).
3+
The heart of the optimizations performed by LoopVectorization are given in the [determinestrategy.jl](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/modeling/determinestrategy.jl) file utilizing instruction costs specified in [costs.jl](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/modeling/costs.jl).
44
Essentially, it estimates the cost of different means of evaluating the loops. It iterates through the different possible loop orders, as well as considering which loops to unroll, and which to vectorize. It will consider unrolling 1 or 2 loops (but it could settle on unrolling by a factor of 1, i.e. not unrolling), and vectorizing 1.
55

66
The cost estimate is based on the costs of individual instructions and the number of times each one needs to be executed for the given strategy. The instruction cost can be broken into several components:
@@ -14,7 +14,7 @@ Data on individual instructions for specific architectures can be found on [Agne
1414
Examples of how these come into play:
1515
- Vectorizing a loop will result in each instruction evaluating multiple iterations, but the costs of loads and stores will change based on the memory layouts of the accessed arrays.
1616
- Unrolling can help reduce the number of times an operation must be performed, for example if it can allow us to reuse memory multiple times rather than reloading it every time it is needed.
17-
- When there is a reduction, such as performing a sum, there is a dependency chain. Each `+` has to wait for the previous `+` to finish executing before it can begin, thus execution time is bounded by latency rather than minimum of the throughput of the `+` and load operations. By unrolling the loop, we can create multiple independent dependency chains.
17+
- When there is a reduction, such as performing a sum, there is a dependency chain. Each `+` has to wait for the previous `+` to finish executing before it can begin, thus execution time is bounded by latency rather than the minimum of the throughput of the `+` and load operations. By unrolling the loop, we can create multiple independent dependency chains.
1818

1919

2020

docs/src/getting_started.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,31 @@ Aside from loops, `LoopVectorization.jl` also supports broadcasting.
3838
!!! danger
3939
Broadcasting an `Array` `A` when `size(A,1) == 1` is NOT SUPPORTED, unless this is known at compile time (e.g., broadcasting a transposed vector is fine). Otherwise, you will probably crash Julia.
4040

41+
Note: `@turbo` does not support passing of kwargs to function calls to which it is applied, e.g:
42+
```julia
43+
julia> @turbo round.(rand(10));
44+
45+
julia> @turbo round.(rand(10); digits = 3);
46+
ERROR: TypeError: in typeassert, expected Expr, got a value of type GlobalRef
47+
```
48+
49+
You can work around this by creating a anonymous function before applying `@turbo` as follows:
50+
```julia
51+
struct KwargCall{F,T}
52+
f::F
53+
x::T
54+
end
55+
@inline (f::KwargCall)(args...) = f.f(args...; f.x...)
56+
57+
f = KwargCall(round, (digits = 3,));
58+
@turbo f.(rand(10))
59+
10-element Vector{Float64}:
60+
0.763
61+
62+
0.851
63+
```
64+
65+
4166
```julia
4267
julia> using LoopVectorization, BenchmarkTools
4368

docs/src/vectorized_convenience_functions.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,4 +132,22 @@ julia> @btime mapreduce(hypot, +, $x, $y)
132132
96.75538300513509
133133
```
134134

135+
## vsum
136+
137+
Vectorized version of `sum`. `vsum(f, a)` applies `f(a[i])` for `i in eachindex(a)`, then sums the results.
138+
139+
```julia
140+
julia> using LoopVectorization, BenchmarkTools
141+
142+
julia> x = rand(127);
143+
144+
julia> @btime vsum(hypot, $x)
145+
12.095 ns (0 allocations: 0 bytes)
146+
66.65246070098374
147+
148+
julia> @btime sum(hypot, $x)
149+
16.992 ns (0 allocations: 0 bytes)
150+
66.65246070098372
151+
```
152+
135153

src/LoopVectorization.jl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@ export LowDimArray,
196196
vfilter,
197197
vfilter!,
198198
vmapreduce,
199+
vsum,
199200
vreduce,
200201
vcount
201202

@@ -245,6 +246,7 @@ loop-reordering so as to improve performance:
245246
- [`@turbo`](@ref): transform `for`-loops and broadcasting
246247
- [`vmapreduce`](@ref): vectorized version of `mapreduce`
247248
- [`vreduce`](@ref): vectorized version of `reduce`
249+
- [`vsum`](@ref): vectorized version of `sum`
248250
- [`vmap`](@ref) and `vmap!`: vectorized version of `map` and `map!`
249251
- [`vmapnt`](@ref) and `vmapnt!`: non-temporal variants of `vmap` and `vmap!`
250252
- [`vmapntt`](@ref) and `vmapntt!`: threaded variants of `vmapnt` and `vmapnt!`

src/parse/add_compute.jl

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -497,7 +497,7 @@ function add_compute!(
497497
return add_pow!(ls, var, args[1], arg2num, elementbytes, position)
498498
end
499499
elseif instr.instr === :oftype && length(args) == 2
500-
return getop(ls, args[2], elementbytes)
500+
return get_arg!(ls, args[2], elementbytes, position)
501501
end
502502
vparents = Operation[]
503503
deps = Symbol[]
@@ -760,6 +760,34 @@ function add_compute_ifelse!(
760760
)
761761
pushop!(ls, op, LHS)
762762
end
763+
function get_arg!(
764+
ls::LoopSet,
765+
@nospecialize(x),
766+
elementbytes::Int,
767+
position::Int
768+
)::Operation
769+
if x isa Expr
770+
add_operation!(
771+
ls,
772+
Symbol("###xpow###$(length(operations(ls)))###"),
773+
x,
774+
elementbytes,
775+
position
776+
)::Operation
777+
elseif x isa Symbol
778+
if x ls.loopsymbols
779+
add_loopvalue!(ls, x, elementbytes)
780+
else
781+
xo = get(ls.opdict, x, nothing)
782+
xo === nothing && return add_constant!(ls, x, elementbytes)::Operation
783+
return xo
784+
end
785+
elseif x isa Number
786+
return add_constant!(ls, x^p, elementbytes, var)::Operation
787+
else
788+
throw("objects of type $x not supported as arg")
789+
end
790+
end
763791

764792
# adds x ^ (p::Real)
765793
function add_pow!(

src/simdfunctionals/mapreduce.jl

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import VectorizationBase: vsum
12

23
@inline vreduce(::typeof(+), v::VectorizationBase.AbstractSIMDVector) = vsum(v)
34
@inline vreduce(::typeof(*), v::VectorizationBase.AbstractSIMDVector) = vprod(v)
@@ -107,6 +108,16 @@ end
107108
end
108109
@inline vmapreduce(f, op, args...) = mapreduce(f, op, args...)
109110

111+
"""
112+
vsum(A::DenseArray)
113+
vsum(f, A::DenseArray)
114+
115+
Vectorized version of `sum`. Providing a function as the first argument
116+
will apply the function to each element of `A` before summing.
117+
"""
118+
@inline vsum(f::F, A::AbstractArray{T}) where {F,T<:NativeTypes} = vmapreduce(f, +, A)
119+
@inline vsum(A::AbstractArray{T}) where {T<:NativeTypes} = vsum(identity, A)
120+
110121
length_one_axis(::Base.OneTo) = Base.OneTo(1)
111122
length_one_axis(::Any) = 1:1
112123

src/vectorizationbase_compat/contract_pass.jl

Lines changed: 33 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,42 @@
11

2-
mulexprcost(::Number) = 0
3-
mulexprcost(::Symbol) = 1
4-
function mulexprcost(ex::Expr)
5-
base = ex.head === :call ? 10 : 1
6-
base + length(ex.args)
2+
const ProdArg = Union{Symbol,Expr,Number}
3+
function mulexprcost(@nospecialize(x::ProdArg))::Int
4+
if x isa Number
5+
return 0
6+
elseif x isa Symbol
7+
return 1
8+
else
9+
ex = x::Expr
10+
base = ex.head === :call ? 10 : 1
11+
return base + length(ex.args)
12+
end
713
end
8-
function mul_fast_expr(args)
14+
function mul_fast_expr(
15+
args::SubArray{Any,1,Vector{Any},Tuple{UnitRange{Int}},true}
16+
)::Expr
917
b = Expr(:call, :mul_fast)
1018
for i 2:length(args)
1119
push!(b.args, args[i])
1220
end
1321
b
1422
end
15-
function mulexpr(mulexargs)
16-
a = (mulexargs[1])::Union{Symbol,Expr,Number}
17-
if length(mulexargs) == 2
18-
return (a, mulexargs[2]::Union{Symbol,Expr,Number})
19-
elseif length(mulexargs) == 3
20-
# We'll calc the product between the guesstimated cheaper two args first, for better out of order execution
21-
b = (mulexargs[2])::Union{Symbol,Expr,Number}
22-
c = (mulexargs[3])::Union{Symbol,Expr,Number}
23-
ac = mulexprcost(a)
24-
bc = mulexprcost(b)
25-
cc = mulexprcost(c)
26-
maxc = max(ac, bc, cc)
27-
if ac == maxc
28-
return (a, Expr(:call, :mul_fast, b, c))
29-
elseif bc == maxc
30-
return (b, Expr(:call, :mul_fast, a, c))
31-
else
32-
return (c, Expr(:call, :mul_fast, a, b))
33-
end
34-
else
35-
return (a, mul_fast_expr(mulexargs))
36-
end
37-
a = (mulexargs[1])::Union{Symbol,Expr,Number}
38-
b = if length(mulexargs) == 2 # two arg mul
39-
(mulexargs[2])::Union{Symbol,Expr,Number}
40-
else
41-
mul_fast_expr(mulexargs)
42-
end
43-
a, b
23+
function mulexpr(
24+
mulexargs::SubArray{Any,1,Vector{Any},Tuple{UnitRange{Int}},true}
25+
)::Tuple{ProdArg,ProdArg}
26+
a = (mulexargs[1])::ProdArg
27+
Nexpr = length(mulexargs)
28+
Nexpr == 2 && return (a, mulexargs[2]::ProdArg)
29+
Nexpr != 3 && return (a, mul_fast_expr(mulexargs))
30+
# We'll calc the product between the guesstimated cheaper two args first, for better out of order execution
31+
b = (mulexargs[2])::ProdArg
32+
c = (mulexargs[3])::ProdArg
33+
ac = mulexprcost(a)
34+
bc = mulexprcost(b)
35+
cc = mulexprcost(c)
36+
maxc = max(ac, bc, cc)
37+
ac == maxc && return (a, Expr(:call, :mul_fast, b, c))
38+
bc == maxc && return (b, Expr(:call, :mul_fast, c, a))
39+
return (c, Expr(:call, :mul_fast, a, b))
4440
end
4541
function append_args_skip!(call, args, i, mod)
4642
for j eachindex(args)
@@ -222,7 +218,8 @@ function capture_a_muladd(ex::Expr, mod)
222218
end
223219
true, call
224220
end
225-
capture_muladd(ex::Expr, mod) = while true
221+
capture_muladd(ex::Expr, mod) =
222+
while true
226223
ex.head === :ref && return ex
227224
if Meta.isexpr(ex, :call, 2)
228225
if (ex.args[1] === :(-))

test/mapreduce.jl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,9 @@
6060
end
6161
@test vmapreduce(log, +, x) sum(log, x)
6262
@test vmapreduce(abs2, +, x) sum(abs2, x)
63+
@test vsum(log, x) sum(log, x)
64+
@test vsum(abs2, x) sum(abs2, x)
65+
@test vsum(x) sum(x)
6366
@test maximum(x) == vreduce(max, x) == maximum_avx(x)
6467
@test minimum(x) == vreduce(min, x) == minimum_avx(x)
6568

0 commit comments

Comments
 (0)