Skip to content

Commit 7962715

Browse files
authored
Update README.md
1 parent f55123e commit 7962715

File tree

1 file changed

+82
-2
lines changed

1 file changed

+82
-2
lines changed

README.md

Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,12 @@ It then tries to vectorize the loop to improve runtime performance.
2020

2121
The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of noop lest. These limitations should be removed in a future version.
2222

23+
## Examples
24+
### Dot Product
25+
<details>
26+
<summaryClick me! ></summary>
27+
<p>
28+
2329
A simple example with a single loop is the dot product:
2430
```julia
2531
using LoopVectorization, BenchmarkTools
@@ -77,6 +83,13 @@ For this reason, we need to unroll the operation to run several independent inst
7783

7884
Note that 14 and 12 nm Ryzen chips can only do 1 full width `fma` per clock cycle (and 2 loads), so they should see similar performance with the dot and selfdot. I haven't verified this, but would like to hear from anyone who can.
7985

86+
</p>
87+
</details>
88+
89+
### Matrix Multiply
90+
<details>
91+
<summaryClick me! ></summary>
92+
<p>
8093

8194
We can also vectorize fancier loops. A likely familiar example to dive into:
8295
```julia
@@ -114,12 +127,19 @@ In the future, I would like it to also model the cost of memory movement in the
114127

115128
Until then, performance will degrade rapidly compared to BLAS as the size of the matrices increase. The advantage of the `@avx` macro, however, is that it is general. Not every operation is supported by BLAS.
116129

117-
For example, what if `A` were the outter product of two vectors?
130+
For example, what if `A` were the outer product of two vectors?
118131
<!-- ```julia -->
119132

120133

121134
<!-- ``` -->
122135

136+
</p>
137+
</details>
138+
139+
### Broadcasting
140+
<details>
141+
<summaryClick me! ></summary>
142+
<p>
123143

124144
Another example, a straightforward operation expressed well via broadcasting:
125145
```julia
@@ -137,13 +157,73 @@ d2 = @avx @. a + B * c′;
137157
can be optimized in a similar manner to BLAS, albeit to a much smaller degree because the naive version already benefits from vectorization (unlike the naive BLAS).
138158

139159

140-
You can also use `\ast` for lazy matrix multiplication that can fuse with broadcasts. `.\ast` behaves similarly, espcaping the broadcast (it is not applied elementwise). This allows you to use `@.` and fuse all the loops, even if the arguments to `\ast` are themselves broadcasted objects. However, it will often be the case that creating an intermediary is faster. I would recomend always checking if splitting the operation into pieces, or at least isolating the matrix multiplication, increases performance. That will often be the case, especially if the matrices are large, where a separate multiplication can leverage BLAS (and perhaps take advantage of threads).
160+
You can also use `` (which is typed `\ast` and not to be confused with `*`) for lazy matrix multiplication that can fuse with broadcasts. `.\ast` behaves similarly, espcaping the broadcast (it is not applied elementwise). This allows you to use `@.` and fuse all the loops, even if the arguments to `\ast` are themselves broadcasted objects. However, it will often be the case that creating an intermediary is faster. I would recomend always checking if splitting the operation into pieces, or at least isolating the matrix multiplication, increases performance. That will often be the case, especially if the matrices are large, where a separate multiplication can leverage BLAS (and perhaps take advantage of threads).
141161

142162
At small sizes, this can be fast.
143163
```julia
144164

145165
```
146166

167+
</p>
168+
</details>
169+
170+
171+
### Dealing with structs
172+
<details>
173+
<summaryClick me! ></summary>
174+
<p>
175+
176+
The key to the `@avx` macro's performance gains is leveraging knowledge of exactly how data like `Float64`s and `Int`s are handled by a CPU. As such, it is not strightforward to generalize the `@avx` macro to work on arrays conatining structs such as `Matrix{Complex{Float64}}`. Instead, it is currently reccomended that users wishing to apply `@avx` to arrays of structs use packages such as [StructArrays.jl](https://github.com/JuliaArrays/StructArrays.jl) which transform an array where each element is a struct into a struct where each element is an array. Using StructArrays.jl, we can write a matrix multiply (gemm) kernel that works on matrices of `Complex{Float64}`s and `Complex{Int}`s:
177+
```julia
178+
using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test
179+
180+
BLAS.set_num_threads(1); @show BLAS.vendor()
181+
182+
const MatrixFInt64 = Union{Matrix{Float64}, Matrix{Int}}
183+
184+
function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
185+
z = zero(eltype(C))
186+
@avx for i 1:size(A,1), j 1:size(B,2)
187+
Cᵢⱼ = z
188+
for k 1:size(A,2)
189+
Cᵢⱼ += A[i,k] * B[k,j]
190+
end
191+
C[i,j] = Cᵢⱼ
192+
end
193+
end
147194

195+
function mul_add_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64, factor=1)
196+
z = zero(eltype(C))
197+
@avx for i 1:size(A,1), j 1:size(B,2)
198+
ΔCᵢⱼ = z
199+
for k 1:size(A,2)
200+
ΔCᵢⱼ += A[i,k] * B[k,j]
201+
end
202+
C[i,j] += factor * ΔCᵢⱼ
203+
end
204+
end
148205

206+
const StructMatrixComplexFInt64 = Union{StructArray{ComplexF64,2}, StructArray{Complex{Int},2}}
149207

208+
function mul_avx!(C:: StructMatrixComplexFInt64, A::StructMatrixComplexFInt64, B::StructMatrixComplexFInt64)
209+
mul_avx!( C.re, A.re, B.re) # C.re = A.re * B.re
210+
mul_add_avx!(C.re, A.im, B.im, -1) # C.re = C.re - A.im * B.im
211+
mul_avx!( C.im, A.re, B.im) # C.im = A.re * B.im
212+
mul_add_avx!(C.im, A.im, B.re) # C.im = C.im + A.im * B.re
213+
end
214+
```
215+
this `mul_avx!` kernel can now accept `StructArray` matrices of complex numbers and multiply them efficiently:
216+
```juia
217+
M, K, N = 50, 51, 52
218+
219+
A = StructArray(randn(ComplexF64, M, K));
220+
B = StructArray(randn(ComplexF64, K, N));
221+
C1 = StructArray(Matrix{ComplexF64}(undef, M, N));
222+
C2 = collect(similar(C1));
223+
224+
@btime mul_avx!($C1, $A, $B)
225+
@btime mul!( $C2, $(collect(A)), $(collect(B))) # collect turns the StructArray into a regular Array
226+
@test C1 ≈ C2
227+
```
228+
</p>
229+
</details>

0 commit comments

Comments
 (0)