You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The complex dot product is more compute bound. Given the same number of elements, we require `2x` the memory for complex numbers, `4x` the floating point arithmetic,
136
136
and as we have an array of structs rather than structs of arrays, we need additional instructions to shuffle the data.
If we take this further to the three-argument dot product, which isn't implemented in BLAS, `@tturbo` now holds a substantial advantage over the competition:
@@ -186,7 +186,7 @@ function cdot(x::AbstractVector{Complex{Float64}}, A::AbstractMatrix{Complex{Flo
When testing on my laptop, the `C` implentation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading,
192
192
or if it's because LoopVectorization's memory access patterns are less friendly.
Because LoopVectorization doesn't do cache optimizations yet, MKL, OpenBLAS, and Octavian will all pull ahead for larger matrices. This CPU has a 1 MiB L2 cache per core and 18 cores:
0 commit comments