|
| 1 | + |
| 2 | + |
| 3 | +# Array Interface |
| 4 | + |
| 5 | +LoopVectorization uses [ArrayInterface.jl](https://github.com/SciML/ArrayInterface.jl) to describe the memory layout of arrays. By supporting the interface, `LoopVectorization` will be able to support compatible `AbstractArray` types. |
| 6 | +[StaticArrays.jl](https://github.com/JuliaArrays/StaticArrays.jl) and [HybridArrays.jl](https://github.com/mateuszbaran/HybridArrays.jl) are two example libraries providing array types supporting the interface. |
| 7 | + |
| 8 | +`StaticArrays.SArray` itself is not compatible, because `LoopVectorization` needs access to a pointer. However,`StaticArrays.MArray`s are compatible. Loops featuring `StaticArrays.StaticArray` will result in a fall-back loop being executed |
| 9 | +that wasn't optimized by `LoopVectorization`, but instead simply had `@inbounds @fastmath` applied to the loop. This can often still yield reasonable to good performance, saving you from having to write more than one version of the loop |
| 10 | +to get good performance and correct behavior just because the array types happen to be different. |
| 11 | + |
| 12 | +By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matix multiplication in `StaticArrays`, one could simply write: |
| 13 | +```julia |
| 14 | +using StaticArrays, LoopVectorization |
| 15 | + |
| 16 | +@inline function AmulB!(C, A, B) |
| 17 | + @avx for n ∈ axes(C,2), m ∈ axes(C,1) |
| 18 | + Cₘₙ = zero(eltype(C)) |
| 19 | + for k ∈ axes(B,1) |
| 20 | + Cₘₙ += A[m,k] * B[k,n] |
| 21 | + end |
| 22 | + C[m,n] = Cₘₙ |
| 23 | + end |
| 24 | + C |
| 25 | +end |
| 26 | +@inline AmulB(A::MMatrix{M,K,T}, B::MMatrix{K,N,T}) where {M,K,N,T} = AmulB!(MMatrix{M,N,T}(undef), A, B) |
| 27 | +@inline AmulB(A::SMatrix, B::SMatrix) = SMatrix(AmulB(MMatrix(A), MMatrix(B))) |
| 28 | +``` |
| 29 | +Through converting back and fourth between `SMatrix` and `MMatrix`, we can still use `LoopVectorization` to implement `SMatrix` multiplication, and in most cases get better performance than the unrolled methods from the library. Unfortunately, it is still suboptimal because the compiler isn't able to elide the copying, but the temporaries are all stack-allocated, making the code |
| 30 | +allocateion free. We can benchmark our simple implementation vs the `StaticArrays.SMatrix` and `StaticArrays.MMatrix` methods: |
| 31 | +```julia |
| 32 | +using BenchmarkTools, LinearAlgebra, DataFrames, VegaLite |
| 33 | +BLAS.set_num_threads(1); |
| 34 | + |
| 35 | +matdims(x::Integer) = (x, x, x) |
| 36 | +matdims(x::NTuple{3}) = x |
| 37 | +matflop(x::Integer) = 2x^3 |
| 38 | +matflop(x::NTuple{3}) = 2prod(x) |
| 39 | + |
| 40 | +function runbenches(sr, ::Type{T}, fa = identity, fb = identity) where {T} |
| 41 | + bench_results = Matrix{Float64}(undef, length(sr), 4); |
| 42 | + for (i,s) ∈ enumerate(sr) |
| 43 | + M, K, N = matdims(s) |
| 44 | + Aₘ = @MMatrix rand(T, M, K) |
| 45 | + Bₘ = @MMatrix rand(T, K, N) |
| 46 | + Aₛ = Ref(SMatrix(Aₘ)); |
| 47 | + Bₛ = Ref(SMatrix(Bₘ)); |
| 48 | + Cₛₛ = fa(Aₛ[]) * fb(Bₛ[]); |
| 49 | + Cₛₗ = AmulB(fa(Aₛ[]), fb(Bₛ[])) |
| 50 | + Cₘₛ = similar(Cₛₛ); mul!(Cₘₛ, fa(Aₘ), fb(Bₘ)); |
| 51 | + Cₘₗ = similar(Cₛₛ); AmulB!(Cₘₗ, fa(Aₘ), fb(Bₘ)); |
| 52 | + @assert Array(Cₛₛ) ≈ Array(Cₛₗ) ≈ Array(Cₘₛ) ≈ Array(Cₘₗ) # Once upon a time Julia crashed on ≈ for large static arrays |
| 53 | + bench_results[i,1] = @belapsed $fa($Aₛ[]) * $fb($Bₛ[]) |
| 54 | + bench_results[i,2] = @belapsed AmulB($fa($Aₛ[]), $fb($Bₛ[])) |
| 55 | + bench_results[i,3] = @belapsed mul!($Cₘₛ, $fa($Aₘ), $fb($Bₘ)) |
| 56 | + bench_results[i,4] = @belapsed AmulB!($Cₘₗ, $fa($Aₘ), $fb($Bₘ)) |
| 57 | + @show s, bench_results[i,:] |
| 58 | + end |
| 59 | + gflops = @. 1e-9 * matflop(sr) / bench_results |
| 60 | + array_type = append!(fill("Static", 2length(sr)), fill("Mutable", 2length(sr))) |
| 61 | + sa = fill("StaticArrays", length(sr)); lv = fill("LoopVectorization", length(sr)); |
| 62 | + matmul_lib = vcat(sa, lv, sa, lv); |
| 63 | + sizes = reduce(vcat, (sr for _ ∈ 1:4)) |
| 64 | + DataFrame( |
| 65 | + Size = sizes, Time = vec(bench_results), GFLOPS = vec(gflops), |
| 66 | + ArrayType = array_type, MatmulLib = matmul_lib, MulType = array_type .* ' ' .* matmul_lib |
| 67 | + ) |
| 68 | +end |
| 69 | + |
| 70 | +df = runbenches(1:24, Float64); |
| 71 | +df |> @vlplot(:line, x = :Size, y = :GFLOPS, color = :MulType, height=640,width=960) |> save("sarraymatmul.svg") |
| 72 | +``` |
| 73 | +This yields: |
| 74 | + |
| 75 | +Our `AmulB!` for `MMatrix`es was the fastest at all sizes except `2`x`2`, where it lost out to `AmulB` for `SMatrix`, which in turn was faster than the hundreds of lines of |
| 76 | +`StaticArray`s code at all sizes except `3`x`3`, `5`x`5`, and `6`x`6`. |
| 77 | + |
| 78 | + |
| 79 | + |
| 80 | +Additionally, `HybridArrays.jl` can be used when we have a mix of dynamic and statically sized arrays. Maybe we want to multiply two matrices, where each element is a `3`x`3` matrix: |
| 81 | +```julia |
| 82 | +using HybridArrays, StaticArrays, LoopVectorization, BenchmarkTools |
| 83 | + |
| 84 | +A_static = [@SMatrix(rand(3,3)) for i in 1:32, j in 1:32]; |
| 85 | +B_static = [@SMatrix(rand(3,3)) for i in 1:32, j in 1:32]; |
| 86 | +C_static = similar(A_static); |
| 87 | + |
| 88 | +A_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}(permutedims(reshape(reinterpret(Float64, A_static), (3,3,size(A_static)...)), (3,4,1,2))); |
| 89 | +B_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}(permutedims(reshape(reinterpret(Float64, B_static), (3,3,size(B_static)...)), (3,4,1,2))); |
| 90 | +C_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}(permutedims(reshape(reinterpret(Float64, C_static), (3,3,size(C_static)...)), (3,4,1,2))); |
| 91 | + |
| 92 | +# C is M x N x I x J |
| 93 | +# A is M x K x I x L |
| 94 | +# B is K x N x L x J |
| 95 | +function bmul!(C, A, B) |
| 96 | + @avx for n in axes(C,2), m in axes(C,1), j in axes(C,4), i in axes(C,3) |
| 97 | + Cₘₙⱼᵢ = zero(eltype(C)) |
| 98 | + for k in axes(B,1), l in axes(B,3) |
| 99 | + Cₘₙⱼᵢ += A[m,k,i,l] * B[k,n,l,j] |
| 100 | + end |
| 101 | + C[m,n,i,j] = Cₘₙⱼᵢ |
| 102 | + end |
| 103 | +end |
| 104 | +``` |
| 105 | +This yields |
| 106 | +```julia |
| 107 | +julia> @benchmark bmul!($C_hybrid, $A_hybrid, $B_hybrid) |
| 108 | +BenchmarkTools.Trial: |
| 109 | + memory estimate: 0 bytes |
| 110 | + allocs estimate: 0 |
| 111 | + -------------- |
| 112 | + minimum time: 15.550 μs (0.00% GC) |
| 113 | + median time: 15.663 μs (0.00% GC) |
| 114 | + mean time: 15.685 μs (0.00% GC) |
| 115 | + maximum time: 50.286 μs (0.00% GC) |
| 116 | + -------------- |
| 117 | + samples: 10000 |
| 118 | + evals/sample: 1 |
| 119 | + |
| 120 | +julia> @benchmark mul!($C_static, $A_static, $B_static) |
| 121 | +BenchmarkTools.Trial: |
| 122 | + memory estimate: 336 bytes |
| 123 | + allocs estimate: 6 |
| 124 | + -------------- |
| 125 | + minimum time: 277.736 μs (0.00% GC) |
| 126 | + median time: 278.035 μs (0.00% GC) |
| 127 | + mean time: 278.310 μs (0.00% GC) |
| 128 | + maximum time: 299.259 μs (0.00% GC) |
| 129 | + -------------- |
| 130 | + samples: 10000 |
| 131 | + evals/sample: 1 |
| 132 | + |
| 133 | +julia> all(I -> C_hybrid[Tuple(I)[1],Tuple(I)[2],:,:] ≈ C_static[I], CartesianIndices(C_static)) |
| 134 | +true |
| 135 | + |
| 136 | +julia> length(C_hybrid) * size(B_hybrid,1) * size(B_hybrid,3) * 2e-9 / 15.55e-6 # GFLOPS loops + hybrid arrays |
| 137 | +113.79241157556271 |
| 138 | + |
| 139 | +julia> length(C_hybrid) * size(B_hybrid,1) * size(B_hybrid,3) * 2e-9 / 277.736e-6 # GFLOPS LinearAlgebra.mul! + StaticArrays |
| 140 | +6.371057407034018 |
| 141 | +``` |
| 142 | +When using `LoopVectorization` + `HybridArrays`, you may often find that you often get the best performance when the leading dimensions are either an even multiple of 8, or relatively large. |
| 143 | +This will often mean not leading with a small static dimension, which is commonly best practice when not using `LoopVectorization`. |
| 144 | + |
| 145 | +If you happen to like tensor operations such as from this last example, you're also strongly encouraged to check out [Tullio.jl](https://github.com/mcabbott/Tullio.jl) which provides index-notation that is both much more convenient and much less error-prone than writing out loops, and uses both `LoopVectorization` (if you `using LoopVectorization` before `@tullio`) as well as multiple threads to maximize performance. |
| 146 | + |
| 147 | + |
| 148 | + |
0 commit comments