Skip to content

Commit c070363

Browse files
committed
Update URLs to JuliaSIMD. Upgrade to ThreadingUtilities 0.4. Add threading tests. Fix bug that occured when there were two unrolled loops and an outer reduction that was directly unrolle dby one and depended on an operation unrolled by the other.
1 parent 49d5ae9 commit c070363

File tree

15 files changed

+280
-121
lines changed

15 files changed

+280
-121
lines changed

Project.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,16 @@ VectorizationBase = "3d5dd08c-fd9d-11e8-17fa-ed2836048c2f"
1919

2020
[compat]
2121
ArrayInterface = "3"
22-
CheapThreads = "0.1.1"
22+
CheapThreads = "0.1.2"
2323
DocStringExtensions = "0.8"
2424
IfElse = "0.1"
2525
OffsetArrays = "1.4.1, 1.5"
2626
Requires = "1"
2727
SLEEFPirates = "0.6.12"
2828
Static = "0.2"
29-
ThreadingUtilities = "0.3"
29+
ThreadingUtilities = "0.4"
3030
UnPack = "1"
31-
VectorizationBase = "0.19.6"
31+
VectorizationBase = "0.19.8"
3232
julia = "1.5"
3333

3434
[extras]

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# LoopVectorization
22

3-
[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://chriselrod.github.io/LoopVectorization.jl/stable)
4-
[![Latest](https://img.shields.io/badge/docs-latest-blue.svg)](https://chriselrod.github.io/LoopVectorization.jl/latest)
5-
[![CI](https://github.com/chriselrod/LoopVectorization.jl/workflows/CI/badge.svg)](https://github.com/chriselrod/LoopVectorization.jl/actions?query=workflow%3ACI)
6-
[![CI (Julia nightly)](https://github.com/chriselrod/LoopVectorization.jl/workflows/CI%20(Julia%20nightly)/badge.svg)](https://github.com/chriselrod/LoopVectorization.jl/actions?query=workflow%3A%22CI+%28Julia+nightly%29%22)
7-
[![Codecov](https://codecov.io/gh/chriselrod/LoopVectorization.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/chriselrod/LoopVectorization.jl)
3+
[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaSIMD.github.io/LoopVectorization.jl/stable)
4+
[![Latest](https://img.shields.io/badge/docs-latest-blue.svg)](https://JuliaSIMD.github.io/LoopVectorization.jl/latest)
5+
[![CI](https://github.com/JuliaSIMD/LoopVectorization.jl/workflows/CI/badge.svg)](https://github.com/JuliaSIMD/LoopVectorization.jl/actions?query=workflow%3ACI)
6+
[![CI (Julia nightly)](https://github.com/JuliaSIMD/LoopVectorization.jl/workflows/CI%20(Julia%20nightly)/badge.svg)](https://github.com/JuliaSIMD/LoopVectorization.jl/actions?query=workflow%3A%22CI+%28Julia+nightly%29%22)
7+
[![Codecov](https://codecov.io/gh/JuliaSIMD/LoopVectorization.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/JuliaSIMD/LoopVectorization.jl)
88

99
## Installation
1010

@@ -39,7 +39,7 @@ Please see the documentation for benchmarks versus base Julia, Clang, icc, ifort
3939

4040
LLVM/Julia by default generate essentially optimal code for a primary vectorized part of this loop. In many cases -- such as the dot product -- this vectorized part of the loop computes 4*SIMD-vector-width iterations at a time.
4141
On the CPU I'm running these benchmarks on with `Float64` data, the SIMD-vector-width is 8, meaning it will compute 32 iterations at a time.
42-
However, LLVM is very slow at handling the tails, `length(iterations) % 32`. For this reason, [in benchmark plots](https://chriselrod.github.io/LoopVectorization.jl/latest/examples/dot_product/) you can see performance drop as the size of the remainder increases.
42+
However, LLVM is very slow at handling the tails, `length(iterations) % 32`. For this reason, [in benchmark plots](https://JuliaSIMD.github.io/LoopVectorization.jl/latest/examples/dot_product/) you can see performance drop as the size of the remainder increases.
4343

4444
For simple loops like a dot product, LoopVectorization.jl's most important optimization is to handle these tails more efficiently:
4545
<details>
@@ -346,7 +346,7 @@ Similar approaches can be taken to make kernels working with a variety of numeri
346346
* [Gaius.jl](https://github.com/MasonProtter/Gaius.jl)
347347
* [MaBLAS.jl](https://github.com/YingboMa/MaBLAS.jl)
348348
* [Octavian.jl](https://github.com/JuliaLinearAlgebra/Octavian.jl)
349-
* [PaddedMatrices.jl](https://github.com/chriselrod/PaddedMatrices.jl)
349+
* [PaddedMatrices.jl](https://github.com/JuliaSIMD/PaddedMatrices.jl)
350350
* [RecursiveFactorization.jl](https://github.com/YingboMa/RecursiveFactorization.jl)
351351
* [SnpArrays.jl](https://github.com/OpenMendel/SnpArrays.jl)
352352
* [Tullio.jl](https://github.com/mcabbott/Tullio.jl)

docs/make.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,12 +27,12 @@ makedocs(;
2727
"devdocs/reference.md"
2828
]
2929
],
30-
# repo="https://github.com/chriselrod/LoopVectorization.jl/blob/{commit}{path}#L{line}",
30+
# repo="https://github.com/JuliaSIMD/LoopVectorization.jl/blob/{commit}{path}#L{line}",
3131
sitename="LoopVectorization.jl",
3232
authors="Chris Elrod"
3333
# assets=[],
3434
)
3535

3636
deploydocs(;
37-
repo="github.com/chriselrod/LoopVectorization.jl",
37+
repo="github.com/JuliaSIMD/LoopVectorization.jl",
3838
)

docs/src/devdocs/constructing_loopsets.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Loop expressions
44

5-
When applying `@avx` to a loop expression, it creates a `LoopSet` without awareness to type information, and then [condenses the information](https://github.com/chriselrod/LoopVectorization.jl/blob/master/src/condense_loopset.jl) into a summary which is passed as type information to a generated function.
5+
When applying `@avx` to a loop expression, it creates a `LoopSet` without awareness to type information, and then [condenses the information](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/condense_loopset.jl) into a summary which is passed as type information to a generated function.
66
```julia
77
julia> @macroexpand @avx for m 1:M, n 1:N
88
C[m,n] = zero(eltype(B))
@@ -19,7 +19,7 @@ quote
1919
end
2020
end
2121
```
22-
When the corresponding method gets compiled for specific type of `A`, `B`, and `C`, the call to the `@generated` function `_avx_!` get compiled. This causes the summary to be [reconstructed](https://github.com/chriselrod/LoopVectorization.jl/blob/master/src/reconstruct_loopset.jl) using the available type information. This type information can be used, for example, to realize an array has been transposed, and thus correctly identify which axis contains contiguous elements that are efficient to load from. This kind of information cannot be extracted from the raw expression, which is why these decisions are made when the method gets compiled for specific types via the `@generated` function `_avx_!`.
22+
When the corresponding method gets compiled for specific type of `A`, `B`, and `C`, the call to the `@generated` function `_avx_!` get compiled. This causes the summary to be [reconstructed](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/reconstruct_loopset.jl) using the available type information. This type information can be used, for example, to realize an array has been transposed, and thus correctly identify which axis contains contiguous elements that are efficient to load from. This kind of information cannot be extracted from the raw expression, which is why these decisions are made when the method gets compiled for specific types via the `@generated` function `_avx_!`.
2323

2424
The three chief components of the summaries are the definitions of operations, e.g.:
2525
```julia
@@ -58,4 +58,4 @@ quote
5858
var"##266" = LoopVectorization.vmaterialize(var"##265", Val{:Main}())
5959
end
6060
```
61-
These nested broadcasted objects already express information very similar to what the `LoopSet` objects hold. The dimensionality of the objects provides the information on the associated loop dependencies, but again this information is available only when the method is compiled for specific types. The `@generated` function `vmaterialize` constructs the LoopSet by recursively evaluating [add_broadcast!](https://github.com/chriselrod/LoopVectorization.jl/blob/master/src/broadcast.jl#L166) on all the fields.
61+
These nested broadcasted objects already express information very similar to what the `LoopSet` objects hold. The dimensionality of the objects provides the information on the associated loop dependencies, but again this information is available only when the method is compiled for specific types. The `@generated` function `vmaterialize` constructs the LoopSet by recursively evaluating [add_broadcast!](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/broadcast.jl#L166) on all the fields.

docs/src/devdocs/evaluating_loops.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Determining the strategy for evaluating loops
22

3-
The heart of the optimizatizations performed by LoopVectorization are given in the [determinestrategy.jl](https://github.com/chriselrod/LoopVectorization.jl/blob/master/src/determinestrategy.jl) file utilizing instruction costs specified in [costs.jl](https://github.com/chriselrod/LoopVectorization.jl/blob/master/src/costs.jl).
3+
The heart of the optimizatizations performed by LoopVectorization are given in the [determinestrategy.jl](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/determinestrategy.jl) file utilizing instruction costs specified in [costs.jl](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/costs.jl).
44
Essentially, it estimates the cost of different means of evaluating the loops. It iterates through the different possible loop orders, as well as considering which loops to unroll, and which to vectorize. It will consider unrolling 1 or 2 loops (but it could settle on unrolling by a factor of 1, i.e. not unrolling), and vectorizing 1.
55

66
The cost estimate is based on the costs of individual instructions and the number of times each one needs to be executed for the given strategy. The instruction cost can be broken into several components:

docs/src/devdocs/overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Here I will try to explain how the library works for the curious or any would-be contributors.
44

5-
The library uses a [LoopSet](https://github.com/chriselrod/LoopVectorization.jl/blob/master/src/graphs.jl#L146) object to model loops. The key components of the library can be divided into:
5+
The library uses a [LoopSet](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/src/graphs.jl#L146) object to model loops. The key components of the library can be divided into:
66
1. Defining the LoopSet objects.
77
2. Constructing the LoopSet objects.
88
3. Determining the strategy of how to evaluate loops.

docs/src/examples/matrix_multiplication.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ and this can handle all transposed/not-tranposed permutations. LoopVectorization
1919
Letting all three matrices be square and `Size` x `Size`, we attain the following benchmark results:
2020

2121
![AmulB](../assets/bench_AmulB_v2.png)
22-
This is classic GEMM, `𝐂 = 𝐀 * 𝐁`. GFortran's intrinsic `matmul` function does fairly well. But all the compilers are well behind LoopVectorization here, which falls behind MKL's `gemm` beyond 70x70 or so. The problem imposed by alignment is also striking: performance is much higher when the sizes are integer multiplies of 8. Padding arrays so that each column is aligned regardless of the number of rows can thus be very profitable. [PaddedMatrices.jl](https://github.com/chriselrod/PaddedMatrices.jl) offers just such arrays in Julia. I believe that is also what the [-pad](https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-pad-qpad) compiler flag does when using Intel's compilers.
22+
This is classic GEMM, `𝐂 = 𝐀 * 𝐁`. GFortran's intrinsic `matmul` function does fairly well. But all the compilers are well behind LoopVectorization here, which falls behind MKL's `gemm` beyond 70x70 or so. The problem imposed by alignment is also striking: performance is much higher when the sizes are integer multiplies of 8. Padding arrays so that each column is aligned regardless of the number of rows can thus be very profitable. [PaddedMatrices.jl](https://github.com/JuliaSIMD/PaddedMatrices.jl) offers just such arrays in Julia. I believe that is also what the [-pad](https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-pad-qpad) compiler flag does when using Intel's compilers.
2323

2424
![AmulBt](../assets/bench_AmulBt_v2.png)
2525
The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` instrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the ecplicit copy.

src/codegen/lower_compute.jl

Lines changed: 63 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -183,12 +183,27 @@ end
183183
1+1
184184
ifelselastexpr(false, M, (V1,V2), 2, S, true)
185185
end
186+
@generated function subset_vec_unroll(vu::VecUnroll{N}, ::StaticInt{S}) where {N,S}
187+
(1 S N + 1) || throw(ArgumentError("`vu` isa `VecUnroll` of `$(N+1)` elements, but trying to subset $S of them."))
188+
t = Expr(:tuple)
189+
gf = GlobalRef(Core,:getfield)
190+
S == 1 && return Expr(:block, Expr(:meta,:inline), :($gf($gf(vu,1),1,false)))
191+
for s 1:S
192+
push!(t.args, Expr(:call, gf, :vud, s, false))
193+
end
194+
quote
195+
$(Expr(:meta,:inline))
196+
vud = $gf(vu, 1)
197+
VecUnroll($t)
198+
end
199+
end
186200
# `S` is the ind to replace with the return value of previous invocation ("S" for "self") if reducing
187201
@generated function partialmap(f::F, default::D, ::StaticInt{M}, ::StaticInt{S}, vargs::Vararg{Any,K}) where {F,M,K,D,S}
188202
lengths = Vector{Int}(undef, K);
189203
q = Expr(:block, Expr(:meta,:inline))
190204
syms = Vector{Symbol}(undef, K)
191205
isnotpartial = true
206+
gf = GlobalRef(Core, :getfield)
192207
for k 1:K
193208
l = vecunrolllen(vargs[k])
194209
# if l
@@ -197,27 +212,27 @@ end
197212
lengths[k] = l
198213
@assert (l == -1) || (l M)
199214
syms[k] = symk = Symbol(:vargs_,k)
200-
extractq = :(getfield(vargs, $k, false))
215+
extractq = :($gf(vargs, $k, false))
201216
if l != -1
202217
extractq = :(data($extractq))
203218
end
204219
push!(q.args, :($symk = $extractq))
205220
end
206-
if isnotpartial
221+
Dlen = vecunrolllen(D)
222+
N = maximum(lengths)
223+
Sreduced = (S > 0) && (lengths[S] == -1) && N != -1
224+
if isnotpartial & (Sreduced | (Dlen == N))
207225
q = Expr(:call, :f)
208226
for k 1:K
209-
push!(q.args, :(getfield(vargs, $k, false)))
227+
push!(q.args, :($gf(vargs, $k, false)))
210228
end
211229
return Expr(:block, Expr(:meta, :inline), q)
212230
end
213-
N = maximum(lengths)
214-
Dlen = vecunrolllen(D)
215-
Sreduced = (S > 0) && (lengths[S] == -1) && N != -1
216231
if Sreduced
217232
M = N
218233
t = q
219234
else
220-
@assert (N == Dlen)
235+
@assert (N Dlen)
221236
if Dlen == -1
222237
@assert (M == 1)
223238
else
@@ -231,10 +246,11 @@ end
231246
if lengths[k] == -1
232247
push!(call.args, syms[k])
233248
else
234-
push!(call.args, Expr(:call, :getfield, syms[k], m, false))
249+
push!(call.args, Expr(:call, gf, syms[k], m, false))
235250
end
236251
end
237-
if N == -1
252+
# minimal change in behavior to fix case when !Sreduced by N -> Dlen; TODO: what should Dlen be here?
253+
if Sreduced ? (N == -1) : (Dlen == -1)
238254
push!(q.args, call)
239255
return q
240256
end
@@ -245,8 +261,8 @@ end
245261
end
246262
end
247263
Sreduced && return q
248-
for m M+1:N
249-
push!(t.args, :(getfield(dd, $m, false)))
264+
for m M+1:max(N,Dlen)
265+
push!(t.args, :($gf(dd, $m, false)))
250266
end
251267
push!(q.args, :(VecUnroll($t)))
252268
q
@@ -257,6 +273,7 @@ function parent_op_name(
257273
)
258274
opp = parents_op[n]
259275
parent = mangledvar(opp)
276+
u = 0
260277
if n == tiledouterreduction
261278
parent = Symbol(parent, modsuffix)
262279
else
@@ -275,7 +292,15 @@ function parent_op_name(
275292
if opisvectorized && isload(opp) && (!isvectorized(opp))
276293
parent = Symbol(parent, "##broadcasted##")
277294
end
278-
parent
295+
parent, u
296+
end
297+
function getuouterreduct(ls::LoopSet, op::Operation, suffix)
298+
us = ls.unrollspecification[]
299+
if us.vloopnum === us.u₁loopnum # unroll u₂
300+
suffix
301+
else # unroll u₁
302+
us.u₁
303+
end
279304
end
280305

281306
function getu₁full(ls::LoopSet, u₁::Int)
@@ -324,6 +349,7 @@ function lower_compute!(
324349
opunrolled = u₁unrolledsym || isu₁unrolled(op)
325350
# parent_names, parents_u₁syms, parents_u₂syms = parent_unroll_status(op, u₁loop, u₂loop, suffix)
326351
parents_u₁syms, parents_u₂syms = parent_unroll_status(op, u₁loopsym, u₂loopsym, vloopsym, u₂max)
352+
# tiledouterreduction = if num_loops(ls) == 1# (suffix == -1)# || (vloopsym === u₂loopsym)
327353
tiledouterreduction = if (suffix == -1)# || (vloopsym === u₂loopsym)
328354
suffix_ = Symbol("")
329355
-1
@@ -367,6 +393,9 @@ function lower_compute!(
367393
# parentsyms = [opp.variable for opp ∈ parents(op)]
368394
Uiter = opunrolled ? u₁ - 1 : 0
369395
isreduct = isreduction(op)
396+
# if isreduct
397+
# @show u₁unrolledsym, u₂unrolledsym, isu₁unrolled(op), isu₂unrolled(op) op
398+
# end
370399
if Base.libllvm_version < v"11.0.0" && (suffix -1) && isreduct# && (iszero(suffix) || (ls.unrollspecification[].u₂ - 1 == suffix))
371400
# if (length(reduceddependencies(op)) > 0) | (length(reducedchildren(op)) > 0)# && (iszero(suffix) || (ls.unrollspecification[].u₂ - 1 == suffix))
372401
# instrfid = findfirst(isequal(instr.instr), (:vfmadd, :vfnmadd, :vfmsub, :vfnmsub))
@@ -408,8 +437,8 @@ function lower_compute!(
408437
# modsuffix = ls.unrollspecification[].u₁#getu₁full(ls, u₁)#u₁
409438
# Symbol(mangledvar(op), '_', modsuffix)
410439
# else
411-
modsuffix = suffix % tiled_outerreduct_unroll(ls)
412-
Symbol(mangledvar(op), modsuffix)
440+
modsuffix = 0#suffix % tiled_outerreduct_unroll(ls)
441+
Symbol(mangledvar(op), modsuffix)
413442
# end
414443
# dopartialmap = u₁ > 1
415444

@@ -421,7 +450,8 @@ function lower_compute!(
421450
# isouterreduct = true
422451
isouterreduct = isanouterreduction(ls, op)
423452
u₁reduct = isouterreduct ? getu₁full(ls, u₁) : getu₁forreduct(ls, op, u₁)
424-
dopartialmap = u₁reduct > u₁
453+
# @show isouterreduct, u₁reduct, op
454+
dopartialmap = u₁reduct u₁
425455
Symbol(mvar, '_', u₁reduct)
426456
else
427457
Symbol(mvar, '_', u₁)
@@ -441,7 +471,10 @@ function lower_compute!(
441471
if ((isvectorized(first(parents_op)) && !isvectorized(op)) && !dependent_outer_reducts(ls, op)) ||
442472
(parents_u₁syms[n] != u₁unrolledsym) || (parents_u₂syms[n] != u₂unrolledsym)
443473

444-
selfopname = parent_op_name(ls, parents_op, n, modsuffix, suffix_, parents_u₁syms, parents_u₂syms, u₁, opisvectorized, tiledouterreduction)
474+
selfopname, uₚ = parent_op_name(ls, parents_op, n, modsuffix, suffix_, parents_u₁syms, parents_u₂syms, u₁, opisvectorized, tiledouterreduction)
475+
# if (uₚ ≠ 0) & (uₚ ≠ u₁)
476+
# dopartialmap = true
477+
# end
445478
push!(instrcall.args, selfopname)
446479
else
447480
push!(instrcall.args, varsym)
@@ -450,8 +483,20 @@ function lower_compute!(
450483
# this checks if the parent is u₂ unrolled but this operation is not, in which case we need to reduce it.
451484
push!(instrcall.args, reduce_expr_u₂(mangledvar(opp), instruction(opp), ureduct(ls)))
452485
else
453-
parent = parent_op_name(ls, parents_op, n, modsuffix, suffix_, parents_u₁syms, parents_u₂syms, u₁, opisvectorized, tiledouterreduction)
454-
push!(instrcall.args, parent)
486+
parent, uₚ = parent_op_name(ls, parents_op, n, modsuffix, suffix_, parents_u₁syms, parents_u₂syms, u₁, opisvectorized, tiledouterreduction)
487+
# if name(op) === Symbol("##op#9536")
488+
# @show parent
489+
# end
490+
if (selfdep == 0) && search_tree(parents(opp), name(op))
491+
selfdep = n
492+
push!(instrcall.args, parent)
493+
elseif (uₚ 0) & (uₚ > u₁)
494+
push!(instrcall.args, :(subset_vec_unroll($parent, StaticInt{$u₁}())))
495+
else
496+
push!(instrcall.args, parent)
497+
end
498+
499+
# @show uₚ, u₁, op
455500
end
456501
end
457502
selfdepreduce = ifelse(((!u₁unrolledsym) & isu₁unrolled(op)) & (u₁ > 1), selfdep, 0)

0 commit comments

Comments
 (0)