Skip to content

Commit 27867b1

Browse files
committed
2 parents 4ad1034 + f560133 commit 27867b1

19 files changed

+456
-225
lines changed

Manifest.toml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -61,15 +61,15 @@ uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
6161

6262
[[SIMDPirates]]
6363
deps = ["MacroTools", "VectorizationBase"]
64-
git-tree-sha1 = "6c6a77a41c846c08c61a0e556183a9c33b53e3d1"
64+
git-tree-sha1 = "c0f42ddb2645c54b8620979df5dc979c4742db59"
6565
uuid = "21efa798-c60a-11e8-04d3-e1a92915a26a"
66-
version = "0.1.3"
66+
version = "0.1.4"
6767

6868
[[SLEEFPirates]]
6969
deps = ["SIMDPirates", "VectorizationBase"]
70-
git-tree-sha1 = "1c5b6827da87a12bdb7a4c893f44c3adbce3389d"
70+
git-tree-sha1 = "547bcf7d30967d87d4c62b3fe5efdb0e57a6e436"
7171
uuid = "476501e8-09a2-5ece-8869-fb82de89a1fa"
72-
version = "0.1.1"
72+
version = "0.1.2"
7373

7474
[[Serialization]]
7575
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
@@ -83,6 +83,6 @@ uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
8383

8484
[[VectorizationBase]]
8585
deps = ["CpuId", "LinearAlgebra"]
86-
git-tree-sha1 = "54f5ba672c7d684fb0312825721368e22354ecd5"
86+
git-tree-sha1 = "81c1b3171d93e64345d75a9f08d190a155e9f009"
8787
uuid = "3d5dd08c-fd9d-11e8-17fa-ed2836048c2f"
88-
version = "0.1.5"
88+
version = "0.1.7"

Project.toml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "LoopVectorization"
22
uuid = "bdcacae8-1622-11e9-2a5c-532679323890"
33
authors = ["Chris Elrod <[email protected]>"]
4-
version = "0.3.2"
4+
version = "0.3.5"
55

66
[deps]
77
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
@@ -14,9 +14,9 @@ VectorizationBase = "3d5dd08c-fd9d-11e8-17fa-ed2836048c2f"
1414
[compat]
1515
MacroTools = "0.5"
1616
Parameters = "0.12.0"
17-
SIMDPirates = "0.1.3"
18-
SLEEFPirates = "0.1.1"
19-
VectorizationBase = "0.1.5"
17+
SIMDPirates = "0.1.4"
18+
SLEEFPirates = "0.1.2"
19+
VectorizationBase = "0.1.7"
2020
julia = "1.3.0"
2121

2222
[extras]

README.md

Lines changed: 85 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,12 @@ It then tries to vectorize the loop to improve runtime performance.
2020

2121
The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of noop lest. These limitations should be removed in a future version.
2222

23+
## Examples
24+
### Dot Product
25+
<details>
26+
<summaryClick me! ></summary>
27+
<p>
28+
2329
A simple example with a single loop is the dot product:
2430
```julia
2531
using LoopVectorization, BenchmarkTools
@@ -77,6 +83,13 @@ For this reason, we need to unroll the operation to run several independent inst
7783

7884
Note that 14 and 12 nm Ryzen chips can only do 1 full width `fma` per clock cycle (and 2 loads), so they should see similar performance with the dot and selfdot. I haven't verified this, but would like to hear from anyone who can.
7985

86+
</p>
87+
</details>
88+
89+
### Matrix Multiply
90+
<details>
91+
<summaryClick me! ></summary>
92+
<p>
8093

8194
We can also vectorize fancier loops. A likely familiar example to dive into:
8295
```julia
@@ -114,12 +127,19 @@ In the future, I would like it to also model the cost of memory movement in the
114127

115128
Until then, performance will degrade rapidly compared to BLAS as the size of the matrices increase. The advantage of the `@avx` macro, however, is that it is general. Not every operation is supported by BLAS.
116129

117-
For example, what if `A` were the outter product of two vectors?
130+
For example, what if `A` were the outer product of two vectors?
118131
<!-- ```julia -->
119132

120133

121134
<!-- ``` -->
122135

136+
</p>
137+
</details>
138+
139+
### Broadcasting
140+
<details>
141+
<summaryClick me! ></summary>
142+
<p>
123143

124144
Another example, a straightforward operation expressed well via broadcasting:
125145
```julia
@@ -137,13 +157,76 @@ d2 = @avx @. a + B * c′;
137157
can be optimized in a similar manner to BLAS, albeit to a much smaller degree because the naive version already benefits from vectorization (unlike the naive BLAS).
138158

139159

140-
You can also use `\ast` for lazy matrix multiplication that can fuse with broadcasts. `.\ast` behaves similarly, espcaping the broadcast (it is not applied elementwise). This allows you to use `@.` and fuse all the loops, even if the arguments to `\ast` are themselves broadcasted objects. However, it will often be the case that creating an intermediary is faster. I would recomend always checking if splitting the operation into pieces, or at least isolating the matrix multiplication, increases performance. That will often be the case, especially if the matrices are large, where a separate multiplication can leverage BLAS (and perhaps take advantage of threads).
160+
You can also use `` (which is typed `*\^l`) for lazy matrix multiplication that can fuse with broadcasts. `.` behaves similarly, espcaping the broadcast (it is not applied elementwise). This allows you to use `@.` and fuse all the loops, even if the arguments to `` are themselves broadcasted objects. However, it will often be the case that creating an intermediary is faster. I would recomend always checking if splitting the operation into pieces, or at least isolating the matrix multiplication, increases performance. That will often be the case, especially if the matrices are large, where a separate multiplication can leverage BLAS (and perhaps take advantage of threads).
141161

142162
At small sizes, this can be fast.
143163
```julia
144164

145165
```
146166

167+
</p>
168+
</details>
169+
170+
171+
### Dealing with structs
172+
<details>
173+
<summaryClick me! ></summary>
174+
<p>
175+
176+
The key to the `@avx` macro's performance gains is leveraging knowledge of exactly how data like `Float64`s and `Int`s are handled by a CPU. As such, it is not strightforward to generalize the `@avx` macro to work on arrays containing structs such as `Matrix{Complex{Float64}}`. Instead, it is currently recommended that users wishing to apply `@avx` to arrays of structs use packages such as [StructArrays.jl](https://github.com/JuliaArrays/StructArrays.jl) which transform an array where each element is a struct into a struct where each element is an array. Using StructArrays.jl, we can write a matrix multiply (gemm) kernel that works on matrices of `Complex{Float64}`s and `Complex{Int}`s:
177+
```julia
178+
using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test
179+
180+
BLAS.set_num_threads(1); @show BLAS.vendor()
181+
182+
const MatrixFInt64 = Union{Matrix{Float64}, Matrix{Int}}
147183

184+
function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
185+
z = zero(eltype(C))
186+
@avx for i 1:size(A,1), j 1:size(B,2)
187+
Cᵢⱼ = z
188+
for k 1:size(A,2)
189+
Cᵢⱼ += A[i,k] * B[k,j]
190+
end
191+
C[i,j] = Cᵢⱼ
192+
end
193+
end
194+
195+
function mul_add_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64, factor=1)
196+
z = zero(eltype(C))
197+
@avx for i 1:size(A,1), j 1:size(B,2)
198+
ΔCᵢⱼ = z
199+
for k 1:size(A,2)
200+
ΔCᵢⱼ += A[i,k] * B[k,j]
201+
end
202+
C[i,j] += factor * ΔCᵢⱼ
203+
end
204+
end
205+
206+
const StructMatrixComplexFInt64 = Union{StructArray{ComplexF64,2}, StructArray{Complex{Int},2}}
207+
208+
function mul_avx!(C:: StructMatrixComplexFInt64, A::StructMatrixComplexFInt64, B::StructMatrixComplexFInt64)
209+
mul_avx!( C.re, A.re, B.re) # C.re = A.re * B.re
210+
mul_add_avx!(C.re, A.im, B.im, -1) # C.re = C.re - A.im * B.im
211+
mul_avx!( C.im, A.re, B.im) # C.im = A.re * B.im
212+
mul_add_avx!(C.im, A.im, B.re) # C.im = C.im + A.im * B.re
213+
end
214+
```
215+
this `mul_avx!` kernel can now accept `StructArray` matrices of complex numbers and multiply them efficiently:
216+
```julia
217+
M, K, N = 50, 51, 52
218+
219+
A = StructArray(randn(ComplexF64, M, K));
220+
B = StructArray(randn(ComplexF64, K, N));
221+
C1 = StructArray(Matrix{ComplexF64}(undef, M, N));
222+
C2 = collect(similar(C1));
223+
224+
@btime mul_avx!($C1, $A, $B)
225+
@btime mul!( $C2, $(collect(A)), $(collect(B))) # collect turns the StructArray into a regular Array
226+
@test C1 C2
227+
```
148228

229+
Similar approaches can be taken to make kernels working with a variety of numeric struct types such as [dual numbers](https://github.com/JuliaDiff/DualNumbers.jl), [DoubleFloats](https://github.com/JuliaMath/DoubleFloats.jl), etc.
149230

231+
</p>
232+
</details>
File renamed without changes.

benchmark/benchmarks.jl

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
2+
using BenchmarkTools
3+
4+
const SUITE = BenchmarkGroup()
5+
SUITE["linalg"] = BenchmarkGroup(["matmul","dot"])
6+
7+
include(joinpath(@__DIR__, "looptests.jl"))
8+
9+
SUITE["linalg"]["matmul"] = BenchmarkGroup()
10+
SUITE["linalg"]["dot"] = BenchmarkGroup()
11+
for n 1:64
12+
A = rand(n,n);
13+
A′ = copy(A');
14+
B = rand(n,n);
15+
C = Matrix{Float64}(undef, n, n);
16+
SUITE["linalg"]["matmul"]["AmulB", n] = @benchmarkable gemmavx!($C, $A, $B)
17+
SUITE["linalg"]["matmul"]["A′mulB", n] = @benchmarkable jAtmulBavx!($C, $A′, $B)
18+
x = rand(n); y = rand(n);
19+
SUITE["linalg"]["dot"]["dot", n] = @benchmarkable jdotavx($x, $y)
20+
SUITE["linalg"]["dot"]["selfdot", n] = @benchmarkable jselfdotavx($x)
21+
SUITE["linalg"]["dot"]["dot3", n] = @benchmarkable jdot3avx($x, $A, $y)
22+
end
23+
24+

benchmarks/driver.jl renamed to benchmark/driver.jl

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
# const LOOPVECBENCHDIR = joinpath(pkgdir("LoopVectorization"), "benchmarks")
33
# includet(joinpath(LOOPVECBENCHDIR, "driver.jl"))
44

5+
pkgdir(pkg::String) = abspath(joinpath(dirname(Base.find_package(pkg)), ".."))
6+
const LOOPVECBENCHDIR = joinpath(pkgdir("LoopVectorization"), "benchmarks")
7+
include(joinpath(LOOPVECBENCHDIR, "benchmarkflops.jl"))
8+
include(joinpath(LOOPVECBENCHDIR, "plotbenchmarks.jl"))
9+
10+
511
using Distributed
612

713
addprocs(9);
@@ -33,5 +39,23 @@ exp_bench = fetch(exp_future)
3339
aplusBc_bench = fetch(aplusBc_future)
3440

3541

36-
include(joinpath(LOOPVECBENCHDIR, "plotbenchmarks.jl"))
42+
plot(gemm_bench)
43+
plot(AtmulB_bench)
44+
plot(dot_bench)
45+
plot(selfdot_bench)
46+
plot(gemv_bench)
47+
plot(dot3_bench)
48+
plot(sse_bench)
49+
plot(exp_bench)
50+
plot(aplusBc_bench)
51+
52+
save(joinpath("~/Pictures", "bench_gemm_v3.png"), plot(gemm_bench));
53+
save(joinpath("~/Pictures", "bench_AtmulB_v3.png"), plot(AtmulB_bench));
54+
save(joinpath("~/Pictures", "bench_dot_v3.png"), plot(dot_bench));
55+
save(joinpath("~/Pictures", "bench_selfdot_v3.png"), plot(selfdot_bench));
56+
save(joinpath("~/Pictures", "bench_gemv_v3.png"), plot(gemv_bench));
57+
save(joinpath("~/Pictures", "bench_dot3_v3.png"), plot(dot3_bench));
58+
save(joinpath("~/Pictures", "bench_sse_v3.png"), plot(sse_bench));
59+
save(joinpath("~/Pictures", "bench_exp_v3.png"), plot(exp_bench));
60+
save(joinpath("~/Pictures", "bench_aplusBc_v3.png"), plot(aplusBc_bench));
3761

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)