Skip to content

Commit ac38b86

Browse files
committed
Updated multithreading image paths; in combination with previous commit, fixes #349.
1 parent 801f17c commit ac38b86

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

docs/src/examples/multithreading.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ julia> @btime cdot($x, $y)
9191
2480.2964467112092
9292
```
9393
All these times are fairly fast; `wait(Threads.@spawn 1+1)` will typically take much longer than even `@cdot` did here.
94-
![realdot](../assets/threadeddotproduct.svg)
94+
![realdot](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/threadeddotproduct.svg)
9595

9696

9797
Now let's look at a more complex example:
@@ -134,7 +134,7 @@ end
134134

135135
The complex dot product is more compute bound. Given the same number of elements, we require `2x` the memory for complex numbers, `4x` the floating point arithmetic,
136136
and as we have an array of structs rather than structs of arrays, we need additional instructions to shuffle the data.
137-
![complexdot](../assets/threadedcomplexdotproduct.svg)
137+
![complexdot](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/threadedcomplexdotproduct.svg)
138138

139139

140140
If we take this further to the three-argument dot product, which isn't implemented in BLAS, `@tturbo` now holds a substantial advantage over the competition:
@@ -186,7 +186,7 @@ function cdot(x::AbstractVector{Complex{Float64}}, A::AbstractMatrix{Complex{Flo
186186
c[]
187187
end
188188
```
189-
![complexdot3](../assets/threadedcomplexdot3product.svg)
189+
![complexdot3](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/threadedcomplexdot3product.svg)
190190

191191
When testing on my laptop, the `C` implentation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading,
192192
or if it's because LoopVectorization's memory access patterns are less friendly.
@@ -208,7 +208,7 @@ function A_mul_B!(C, A, B)
208208
end
209209
```
210210
Benchmarks over the size range `10:5:300`:
211-
![matmul](../assets/gemm_Float64_10_300_cascadelake_AVX512__multithreaded_logscale.svg)
211+
![matmul](https://raw.githubusercontent.com/JuliaSIMD/LoopVectorization.jl/docsassets/docs/src/assets/gemm_Float64_10_500_cascadelake_AVX512__multithreaded.svg)
212212

213213
Because LoopVectorization doesn't do cache optimizations yet, MKL, OpenBLAS, and Octavian will all pull ahead for larger matrices. This CPU has a 1 MiB L2 cache per core and 18 cores:
214214
```julia

0 commit comments

Comments
 (0)