You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+82-2Lines changed: 82 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,6 +20,12 @@ It then tries to vectorize the loop to improve runtime performance.
20
20
21
21
The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of noop lest. These limitations should be removed in a future version.
22
22
23
+
## Examples
24
+
### Dot Product
25
+
<details>
26
+
<summaryClick me! ></summary>
27
+
<p>
28
+
23
29
A simple example with a single loop is the dot product:
24
30
```julia
25
31
using LoopVectorization, BenchmarkTools
@@ -77,6 +83,13 @@ For this reason, we need to unroll the operation to run several independent inst
77
83
78
84
Note that 14 and 12 nm Ryzen chips can only do 1 full width `fma` per clock cycle (and 2 loads), so they should see similar performance with the dot and selfdot. I haven't verified this, but would like to hear from anyone who can.
79
85
86
+
</p>
87
+
</details>
88
+
89
+
### Matrix Multiply
90
+
<details>
91
+
<summaryClick me! ></summary>
92
+
<p>
80
93
81
94
We can also vectorize fancier loops. A likely familiar example to dive into:
82
95
```julia
@@ -114,12 +127,19 @@ In the future, I would like it to also model the cost of memory movement in the
114
127
115
128
Until then, performance will degrade rapidly compared to BLAS as the size of the matrices increase. The advantage of the `@avx` macro, however, is that it is general. Not every operation is supported by BLAS.
116
129
117
-
For example, what if `A` were the outter product of two vectors?
130
+
For example, what if `A` were the outer product of two vectors?
118
131
<!-- ```julia -->
119
132
120
133
121
134
<!-- ``` -->
122
135
136
+
</p>
137
+
</details>
138
+
139
+
### Broadcasting
140
+
<details>
141
+
<summaryClick me! ></summary>
142
+
<p>
123
143
124
144
Another example, a straightforward operation expressed well via broadcasting:
125
145
```julia
@@ -137,13 +157,73 @@ d2 = @avx @. a + B * c′;
137
157
can be optimized in a similar manner to BLAS, albeit to a much smaller degree because the naive version already benefits from vectorization (unlike the naive BLAS).
138
158
139
159
140
-
You can also use `\ast` for lazy matrix multiplication that can fuse with broadcasts. `.\ast` behaves similarly, espcaping the broadcast (it is not applied elementwise). This allows you to use `@.` and fuse all the loops, even if the arguments to `\ast` are themselves broadcasted objects. However, it will often be the case that creating an intermediary is faster. I would recomend always checking if splitting the operation into pieces, or at least isolating the matrix multiplication, increases performance. That will often be the case, especially if the matrices are large, where a separate multiplication can leverage BLAS (and perhaps take advantage of threads).
160
+
You can also use `∗` (which is typed `\ast` and not to be confused with `*`) for lazy matrix multiplication that can fuse with broadcasts. `.\ast` behaves similarly, espcaping the broadcast (it is not applied elementwise). This allows you to use `@.` and fuse all the loops, even if the arguments to `\ast` are themselves broadcasted objects. However, it will often be the case that creating an intermediary is faster. I would recomend always checking if splitting the operation into pieces, or at least isolating the matrix multiplication, increases performance. That will often be the case, especially if the matrices are large, where a separate multiplication can leverage BLAS (and perhaps take advantage of threads).
141
161
142
162
At small sizes, this can be fast.
143
163
```julia
144
164
145
165
```
146
166
167
+
</p>
168
+
</details>
169
+
170
+
171
+
### Dealing with structs
172
+
<details>
173
+
<summaryClick me! ></summary>
174
+
<p>
175
+
176
+
The key to the `@avx` macro's performance gains is leveraging knowledge of exactly how data like `Float64`s and `Int`s are handled by a CPU. As such, it is not strightforward to generalize the `@avx` macro to work on arrays conatining structs such as `Matrix{Complex{Float64}}`. Instead, it is currently reccomended that users wishing to apply `@avx` to arrays of structs use packages such as [StructArrays.jl](https://github.com/JuliaArrays/StructArrays.jl) which transform an array where each element is a struct into a struct where each element is an array. Using StructArrays.jl, we can write a matrix multiply (gemm) kernel that works on matrices of `Complex{Float64}`s and `Complex{Int}`s:
177
+
```julia
178
+
using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test
0 commit comments