Skip to content

Commit adc6d66

Browse files
authored
Merge pull request #205 from JuliaSIMD/vecunroll
Use VecUnroll, add threading support
2 parents d3de8dd + e64841b commit adc6d66

File tree

81 files changed

+6256
-2898
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

81 files changed

+6256
-2898
lines changed

.github/workflows/ci-julia-nightly.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,19 @@ on:
33
pull_request:
44
branches:
55
- master
6+
paths-ignore:
7+
- 'LICENSE.md'
8+
- 'README.md'
9+
- 'utils/*'
10+
- '.github/workflows/TagBot.yml'
611
push:
712
branches:
813
- master
14+
paths-ignore:
15+
- 'LICENSE.md'
16+
- 'README.md'
17+
- 'utils/*'
18+
- '.github/workflows/TagBot.yml'
919
tags: '*'
1020
jobs:
1121
test-julia-nightly:

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,15 @@ on:
66
paths-ignore:
77
- 'LICENSE.md'
88
- 'README.md'
9+
- 'utils/*'
910
- '.github/workflows/TagBot.yml'
1011
push:
1112
branches:
1213
- master
1314
paths-ignore:
1415
- 'LICENSE.md'
1516
- 'README.md'
17+
- 'utils/*'
1618
- '.github/workflows/TagBot.yml'
1719
tags: '*'
1820
jobs:

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,9 @@
1212
*.s
1313
*#
1414
*.jld2
15+
Manifest.toml
16+
test/Manifest.toml
17+
test/*#*
18+
*#*
19+
20+

Project.toml

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,34 @@
11
name = "LoopVectorization"
22
uuid = "bdcacae8-1622-11e9-2a5c-532679323890"
33
authors = ["Chris Elrod <[email protected]>"]
4-
version = "0.11.2"
4+
version = "0.12.0"
55

66
[deps]
77
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9"
8+
CheapThreads = "b630d9fa-e28e-4980-896d-83ce5e2106b2"
89
DocStringExtensions = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
910
IfElse = "615f187c-cbe4-4ef1-ba3b-2fcf58d6d173"
1011
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
1112
OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
1213
Requires = "ae029012-a4dd-5104-9daa-d747884805df"
1314
SLEEFPirates = "476501e8-09a2-5ece-8869-fb82de89a1fa"
15+
Static = "aedffcd0-7271-4cad-89d0-dc628f76c6d3"
1416
ThreadingUtilities = "8290d209-cae3-49c0-8002-c8c24d57dab5"
1517
UnPack = "3a884ed6-31ef-47d7-9d2a-63182c4928ed"
1618
VectorizationBase = "3d5dd08c-fd9d-11e8-17fa-ed2836048c2f"
1719

1820
[compat]
19-
ArrayInterface = "3"
21+
ArrayInterface = "3.1.4"
22+
CheapThreads = "0.1.2"
2023
DocStringExtensions = "0.8"
2124
IfElse = "0.1"
2225
OffsetArrays = "1.4.1, 1.5"
2326
Requires = "1"
24-
SLEEFPirates = "0.6.7"
25-
ThreadingUtilities = "0.2.3"
27+
SLEEFPirates = "0.6.12"
28+
Static = "0.2"
29+
ThreadingUtilities = "0.4"
2630
UnPack = "1"
27-
VectorizationBase = "0.18.1,0.19"
31+
VectorizationBase = "0.19.8"
2832
julia = "1.5"
2933

3034
[extras]

README.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# LoopVectorization
22

3-
[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://chriselrod.github.io/LoopVectorization.jl/stable)
4-
[![Latest](https://img.shields.io/badge/docs-latest-blue.svg)](https://chriselrod.github.io/LoopVectorization.jl/latest)
5-
[![CI](https://github.com/chriselrod/LoopVectorization.jl/workflows/CI/badge.svg)](https://github.com/chriselrod/LoopVectorization.jl/actions?query=workflow%3ACI)
6-
[![CI (Julia nightly)](https://github.com/chriselrod/LoopVectorization.jl/workflows/CI%20(Julia%20nightly)/badge.svg)](https://github.com/chriselrod/LoopVectorization.jl/actions?query=workflow%3A%22CI+%28Julia+nightly%29%22)
7-
[![Codecov](https://codecov.io/gh/chriselrod/LoopVectorization.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/chriselrod/LoopVectorization.jl)
3+
[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaSIMD.github.io/LoopVectorization.jl/stable)
4+
[![Latest](https://img.shields.io/badge/docs-latest-blue.svg)](https://JuliaSIMD.github.io/LoopVectorization.jl/latest)
5+
[![CI](https://github.com/JuliaSIMD/LoopVectorization.jl/workflows/CI/badge.svg)](https://github.com/JuliaSIMD/LoopVectorization.jl/actions?query=workflow%3ACI)
6+
[![CI (Julia nightly)](https://github.com/JuliaSIMD/LoopVectorization.jl/workflows/CI%20(Julia%20nightly)/badge.svg)](https://github.com/JuliaSIMD/LoopVectorization.jl/actions?query=workflow%3A%22CI+%28Julia+nightly%29%22)
7+
[![Codecov](https://codecov.io/gh/JuliaSIMD/LoopVectorization.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/JuliaSIMD/LoopVectorization.jl)
88

99
## Installation
1010

@@ -21,7 +21,6 @@ We expect that any time you use the `@avx` macro with a given block of code that
2121
1. Are not indexing an array out of bounds. `@avx` does not perform any bounds checking.
2222
2. Are not iterating over an empty collection. Iterating over an empty loop such as `for i ∈ eachindex(Float64[])` is undefined behavior, and will likely result in the out of bounds memory accesses. Ensure that loops behave correctly.
2323
3. Are not relying on a specific execution order. `@avx` can and will re-order operations and loops inside its scope, so the correctness cannot depend on a particular order. You cannot implement `cumsum` with `@avx`.
24-
4. Loops increment by 1 on each iteration, e.g. `1:2:N` is not supported at the moment. (This requirement will eventually be lifted.)
2524

2625
## Usage
2726

@@ -39,7 +38,7 @@ Please see the documentation for benchmarks versus base Julia, Clang, icc, ifort
3938

4039
LLVM/Julia by default generate essentially optimal code for a primary vectorized part of this loop. In many cases -- such as the dot product -- this vectorized part of the loop computes 4*SIMD-vector-width iterations at a time.
4140
On the CPU I'm running these benchmarks on with `Float64` data, the SIMD-vector-width is 8, meaning it will compute 32 iterations at a time.
42-
However, LLVM is very slow at handling the tails, `length(iterations) % 32`. For this reason, [in benchmark plots](https://chriselrod.github.io/LoopVectorization.jl/latest/examples/dot_product/) you can see performance drop as the size of the remainder increases.
41+
However, LLVM is very slow at handling the tails, `length(iterations) % 32`. For this reason, [in benchmark plots](https://JuliaSIMD.github.io/LoopVectorization.jl/latest/examples/dot_product/) you can see performance drop as the size of the remainder increases.
4342

4443
For simple loops like a dot product, LoopVectorization.jl's most important optimization is to handle these tails more efficiently:
4544
<details>
@@ -346,7 +345,7 @@ Similar approaches can be taken to make kernels working with a variety of numeri
346345
* [Gaius.jl](https://github.com/MasonProtter/Gaius.jl)
347346
* [MaBLAS.jl](https://github.com/YingboMa/MaBLAS.jl)
348347
* [Octavian.jl](https://github.com/JuliaLinearAlgebra/Octavian.jl)
349-
* [PaddedMatrices.jl](https://github.com/chriselrod/PaddedMatrices.jl)
348+
* [PaddedMatrices.jl](https://github.com/JuliaSIMD/PaddedMatrices.jl)
350349
* [RecursiveFactorization.jl](https://github.com/YingboMa/RecursiveFactorization.jl)
351350
* [SnpArrays.jl](https://github.com/OpenMendel/SnpArrays.jl)
352351
* [Tullio.jl](https://github.com/mcabbott/Tullio.jl)

benchmark/driver.jl

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -196,30 +196,30 @@ end
196196
sizes = 256:-1:2
197197
longsizes = 1024:-1:2
198198

199-
logdettriangle_bench = benchmark_logdettriangle(sizes); println("logdet(LowerTriangular(A)) benchmark results:"); println(logdettriangle_bench)
200-
dot3_bench = benchmark_dot3(sizes); println("x' * A * y benchmark results:"); println(dot3_bench)
199+
println("logdet(LowerTriangular(A)) benchmark results:"); logdettriangle_bench = benchmark_logdettriangle(sizes); println(logdettriangle_bench)
200+
println("x' * A * y benchmark results:"); dot3_bench = benchmark_dot3(sizes); println(dot3_bench)
201201

202-
AmulB_bench = benchmark_AmulB(sizes); println("A * B benchmark results:"); println(AmulB_bench)
203-
AmulBt_bench = benchmark_AmulBt(sizes); println("A * B' benchmark results:"); println(AmulBt_bench)
204-
AtmulBt_bench = benchmark_AtmulBt(sizes); println("A' * B' benchmark results:"); println(AtmulBt_bench)
205-
AtmulB_bench = benchmark_AtmulB(sizes); println("A' * B benchmark results:"); println(AtmulB_bench)
202+
println("A * B benchmark results:"); AmulB_bench = benchmark_AmulB(sizes); println(AmulB_bench)
203+
println("A * B' benchmark results:"); AmulBt_bench = benchmark_AmulBt(sizes); println(AmulBt_bench)
204+
println("A' * B' benchmark results:"); AtmulBt_bench = benchmark_AtmulBt(sizes); println(AtmulBt_bench)
205+
println("A' * B benchmark results:"); AtmulB_bench = benchmark_AtmulB(sizes); println(AtmulB_bench)
206206

207-
Amulvb_bench = benchmark_Amulvb(sizes); println("A * b benchmark results:"); println(Amulvb_bench)
208-
Atmulvb_bench = benchmark_Atmulvb(sizes); println("A' * b benchmark results:"); println(Atmulvb_bench)
207+
println("A * b benchmark results:"); Amulvb_bench = benchmark_Amulvb(sizes); println(Amulvb_bench)
208+
println("A' * b benchmark results:"); Atmulvb_bench = benchmark_Atmulvb(sizes); println(Atmulvb_bench)
209209

210-
dot_bench = benchmark_dot(longsizes); println("a' * b benchmark results:"); println(dot_bench)
211-
selfdot_bench = benchmark_selfdot(longsizes); println("a' * a benchmark results:"); println(selfdot_bench)
210+
println("a' * b benchmark results:"); dot_bench = benchmark_dot(longsizes); println(dot_bench)
211+
println("a' * a benchmark results:"); selfdot_bench = benchmark_selfdot(longsizes); println(selfdot_bench)
212212

213-
sse_bench = benchmark_sse(sizes); println("Benchmark resutls of summing squared error:"); println(sse_bench)
214-
aplusBc_bench = benchmark_aplusBc(sizes); println("Benchmark results of a .+ B .* c':"); println(aplusBc_bench)
215-
AplusAt_bench = benchmark_AplusAt(sizes); println("Benchmark results of A .+ A':"); println(AplusAt_bench)
213+
println("Benchmark resutls of summing squared error:"); sse_bench = benchmark_sse(sizes); println(sse_bench)
214+
println("Benchmark results of a .+ B .* c':"); aplusBc_bench = benchmark_aplusBc(sizes); println(aplusBc_bench)
215+
println("Benchmark results of A .+ A':"); AplusAt_bench = benchmark_AplusAt(sizes); println(AplusAt_bench)
216216

217-
filter2d_dynamic_bench = benchmark_filter2ddynamic(sizes); println("Benchmark results for dynamically sized 3x3 convolution:"); println(filter2d_dynamic_bench)
218-
filter2d_3x3_bench = benchmark_filter2d3x3(sizes); println("Benchmark results for statically sized 3x3 convolution:"); println(filter2d_3x3_bench)
219-
filter2d_unrolled_bench = benchmark_filter2dunrolled(sizes); println("Benchmark results for unrolled 3x3 convolution:"); println(filter2d_unrolled_bench)
217+
println("Benchmark results for dynamically sized 3x3 convolution:"); filter2d_dynamic_bench = benchmark_filter2ddynamic(sizes); println(filter2d_dynamic_bench)
218+
println("Benchmark results for statically sized 3x3 convolution:"); filter2d_3x3_bench = benchmark_filter2d3x3(sizes); println(filter2d_3x3_bench)
219+
println("Benchmark results for unrolled 3x3 convolution:"); filter2d_unrolled_bench = benchmark_filter2dunrolled(sizes); println(filter2d_unrolled_bench)
220220

221-
vexp_bench = benchmark_exp(sizes); println("Benchmark results of exponentiating a vector:"); println(vexp_bench)
222-
randomaccess_bench = benchmark_random_access(sizes); println("Benchmark results from using a vector of indices:"); println(randomaccess_bench)
221+
println("Benchmark results of exponentiating a vector:"); vexp_bench = benchmark_exp(sizes); println(vexp_bench)
222+
println("Benchmark results from using a vector of indices:"); randomaccess_bench = benchmark_random_access(sizes); println(randomaccess_bench)
223223

224224
const v = 2
225225
using Cairo, Fontconfig
@@ -242,6 +242,7 @@ saveplot("bench_AtmulBt_v", AtmulBt_bench);
242242
saveplot("bench_Amulvb_v", Amulvb_bench);
243243
saveplot("bench_Atmulvb_v", Atmulvb_bench);
244244

245+
245246
saveplot("bench_logdettriangle_v", logdettriangle_bench);
246247
saveplot("bench_filter2d_dynamic_v", filter2d_dynamic_bench);
247248
saveplot("bench_filter2d_3x3_v", filter2d_3x3_bench);

benchmark/looptests.jl

Lines changed: 64 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,14 +64,53 @@ function jgemm!(𝐂, 𝐀ᵀ::Adjoint, 𝐁ᵀ::Adjoint)
6464
end
6565
end
6666
function gemmavx!(𝐂, 𝐀, 𝐁)
67-
@avx for m axes(𝐀,1), n axes(𝐁,2)
67+
@avx for m indices((𝐀,𝐂),1), n indices((𝐁,𝐂),2)
6868
𝐂ₘₙ = zero(eltype(𝐂))
69-
for k axes(𝐀,2)
69+
for k indices((𝐀,𝐁),(2,1))
7070
𝐂ₘₙ += 𝐀[m,k] * 𝐁[k,n]
7171
end
7272
𝐂[m,n] = 𝐂ₘₙ
7373
end
7474
end
75+
function gemmavx!(Cc::AbstractMatrix{Complex{T}}, Ac::AbstractMatrix{Complex{T}}, Bc::AbstractMatrix{Complex{T}}) where {T}
76+
A = reinterpret(reshape, T, Ac)
77+
B = reinterpret(reshape, T, Bc)
78+
C = reinterpret(reshape, T, Cc)
79+
@avx for m indices((A,C),2), n indices((B,C),3)
80+
Cre = zero(T)
81+
Cim = zero(T)
82+
for k indices((A,B),(3,2))
83+
Cre += A[1,m,k]*B[1,k,n] - A[2,m,k]*B[2,k,n]
84+
Cim += A[1,m,k]*B[2,k,n] + A[2,m,k]*B[1,k,n]
85+
end
86+
C[1,m,n] = Cre
87+
C[2,m,n] = Cim
88+
end
89+
end
90+
function gemmavxt!(𝐂, 𝐀, 𝐁)
91+
@avxt for m indices((𝐀,𝐂),1), n indices((𝐁,𝐂),2)
92+
𝐂ₘₙ = zero(eltype(𝐂))
93+
for k indices((𝐀,𝐁),(2,1))
94+
𝐂ₘₙ += 𝐀[m,k] * 𝐁[k,n]
95+
end
96+
𝐂[m,n] = 𝐂ₘₙ
97+
end
98+
end
99+
function gemmavxt!(Cc::AbstractMatrix{Complex{T}}, Ac::AbstractMatrix{Complex{T}}, Bc::AbstractMatrix{Complex{T}}) where {T}
100+
A = reinterpret(reshape, T, Ac)
101+
B = reinterpret(reshape, T, Bc)
102+
C = reinterpret(reshape, T, Cc)
103+
@avxt for m indices((A,C),2), n indices((B,C),3)
104+
Cre = zero(T)
105+
Cim = zero(T)
106+
for k indices((A,B),(3,2))
107+
Cre += A[1,m,k]*B[1,k,n] - A[2,m,k]*B[2,k,n]
108+
Cim += A[1,m,k]*B[2,k,n] + A[2,m,k]*B[1,k,n]
109+
end
110+
C[1,m,n] = Cre
111+
C[2,m,n] = Cim
112+
end
113+
end
75114
function jdot(a, b)
76115
s = zero(eltype(a))
77116
# @inbounds @simd ivdep for i ∈ eachindex(a,b)
@@ -88,6 +127,14 @@ function jdotavx(a, b)
88127
end
89128
s
90129
end
130+
function jdotavxt(a, b)
131+
s = zero(eltype(a))
132+
# @avx for i ∈ eachindex(a,b)
133+
@avxt for i eachindex(a)
134+
s += a[i] * b[i]
135+
end
136+
s
137+
end
91138
function jselfdot(a)
92139
s = zero(eltype(a))
93140
@inbounds @simd ivdep for i eachindex(a)
@@ -324,3 +371,18 @@ function filter2dunrolledavx!(out::AbstractMatrix, A::AbstractMatrix, kern::Size
324371
end
325372
out
326373
end
374+
375+
376+
# function smooth_line!(sl,nrm1,j,i1,rl,ih2,denom)
377+
# @fastmath @inbounds @simd ivdep for i=i1:2:nrm1
378+
# sl[i,j]=denom*(rl[i,j]+ih2*(sl[i,j-1]+sl[i-1,j]+sl[i+1,j]+sl[i,j+1]))
379+
# end
380+
# end
381+
# function smooth_line_avx!(sl,nrm1,j,i1,sl,rl,ih2,denom)
382+
# @avx for i=i1:2:nrm1
383+
# sl[i,j]=denom*(rl[i,j]+ih2*(sl[i,j-1]+sl[i-1,j]+sl[i+1,j]+sl[i,j+1]))
384+
# end
385+
# end
386+
387+
388+

benchmark/openmp.c

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
#include<omp.h>
2+
3+
double dot(double* a, double* b, long N){
4+
double s = 0.0;
5+
#pragma omp parallel for reduction(+: s)
6+
for(long n = 0; n < N; n++){
7+
s += a[n]*b[n];
8+
}
9+
return s;
10+
}
11+
12+
void cdot(double* c, double* a, double* b, long N){
13+
double r = 0.0, i = 0.0;
14+
#pragma omp parallel for reduction(+: r, i)
15+
for(long n = 0; n < N; n++){
16+
r += a[2*n] * b[2*n ] + a[2*n+1] * b[2*n+1];
17+
i += a[2*n] * b[2*n+1] - a[2*n+1] * b[2*n ];
18+
}
19+
c[0] = r;
20+
c[1] = i;
21+
return;
22+
}
23+
24+
void cdot3(double* c, double* x, double* A, double* y, long M, long N){
25+
double sr = 0.0, si = 0.0;
26+
#pragma omp parallel for reduction(+: sr, si)
27+
for (long n = 0; n < N; n++){
28+
double tr = 0.0, ti = 0.0;
29+
for(long m = 0; m < M; m++){
30+
tr += x[2*m] * A[2*m + 2*n*N] + x[2*m+1] * A[2*m+1 + 2*n*N];
31+
ti += x[2*m] * A[2*m+1 + 2*n*N] - x[2*m+1] * A[2*m + 2*n*N];
32+
}
33+
sr += tr * y[2*n ] - ti * y[2*n+1];
34+
si += tr * y[2*n+1] + ti * y[2*n ];
35+
}
36+
c[0] = sr;
37+
c[1] = si;
38+
return;
39+
}
40+
41+
void conv(double* B, double* A, double* K, long M, long N){
42+
const long offset = 2;
43+
#pragma omp parallel for collapse(2)
44+
for (long i = offset; i < N-offset; i++){
45+
for (long j = offset; j < M-offset; j++){
46+
double tmp = 0.0;
47+
for (long k = -offset; k < offset + 1; k++){
48+
for (long l = -offset; l < offset + 1; l++){
49+
tmp += A[(j+l) + (i+k)*M] * K[(l+offset) + (k+offset)*(2*offset+1)];
50+
}
51+
}
52+
B[(j-offset) + (i-offset) * (M-2*offset)] = tmp;
53+
}
54+
}
55+
return;
56+
}
57+
58+

docs/make.jl

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ makedocs(;
77
"Home" => "index.md",
88
"Getting Started" => "getting_started.md",
99
"Examples" => [
10+
"examples/multithreading.md",
1011
"examples/matrix_multiplication.md",
1112
"examples/array_interface.md",
1213
"examples/matrix_vector_ops.md",
@@ -27,12 +28,12 @@ makedocs(;
2728
"devdocs/reference.md"
2829
]
2930
],
30-
# repo="https://github.com/chriselrod/LoopVectorization.jl/blob/{commit}{path}#L{line}",
31+
# repo="https://github.com/JuliaSIMD/LoopVectorization.jl/blob/{commit}{path}#L{line}",
3132
sitename="LoopVectorization.jl",
3233
authors="Chris Elrod"
3334
# assets=[],
3435
)
3536

3637
deploydocs(;
37-
repo="github.com/chriselrod/LoopVectorization.jl",
38+
repo="github.com/JuliaSIMD/LoopVectorization.jl",
3839
)

docs/src/assets/bench_AmulB_v2.png

-17.7 KB
Loading

0 commit comments

Comments
 (0)