Skip to content

Commit fded771

Browse files
committed
Added ArrayInterface docs.
1 parent b32132c commit fded771

File tree

4 files changed

+150
-0
lines changed

4 files changed

+150
-0
lines changed

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ makedocs(;
88
"Getting Started" => "getting_started.md",
99
"Examples" => [
1010
"examples/matrix_multiplication.md",
11+
"examples/array_interface.md",
1112
"examples/matrix_vector_ops.md",
1213
"examples/dot_product.md",
1314
"examples/special_functions.md",

docs/src/assets/sarraymatmul.png

91.2 KB
Loading

docs/src/examples/array_interface.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
2+
3+
# Array Interface
4+
5+
LoopVectorization uses [ArrayInterface.jl](https://github.com/SciML/ArrayInterface.jl) to describe the memory layout of arrays. By supporting the interface, `LoopVectorization` will be able to support compatible `AbstractArray` types.
6+
[StaticArrays.jl](https://github.com/JuliaArrays/StaticArrays.jl) and [HybridArrays.jl](https://github.com/mateuszbaran/HybridArrays.jl) are two example libraries providing array types supporting the interface.
7+
8+
`StaticArrays.SArray` itself is not compatible, because `LoopVectorization` needs access to a pointer. However,`StaticArrays.MArray`s are compatible. Loops featuring `StaticArrays.StaticArray` will result in a fall-back loop being executed
9+
that wasn't optimized by `LoopVectorization`, but instead simply had `@inbounds @fastmath` applied to the loop. This can often still yield reasonable to good performance, saving you from having to write more than one version of the loop
10+
to get good performance and correct behavior just because the array types happen to be different.
11+
12+
By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matix multiplication in `StaticArrays`, one could simply write:
13+
```julia
14+
using StaticArrays, LoopVectorization
15+
16+
@inline function AmulB!(C, A, B)
17+
@avx for n axes(C,2), m axes(C,1)
18+
Cₘₙ = zero(eltype(C))
19+
for k axes(B,1)
20+
Cₘₙ += A[m,k] * B[k,n]
21+
end
22+
C[m,n] = Cₘₙ
23+
end
24+
C
25+
end
26+
@inline AmulB(A::MMatrix{M,K,T}, B::MMatrix{K,N,T}) where {M,K,N,T} = AmulB!(MMatrix{M,N,T}(undef), A, B)
27+
@inline AmulB(A::SMatrix, B::SMatrix) = SMatrix(AmulB(MMatrix(A), MMatrix(B)))
28+
```
29+
Through converting back and fourth between `SMatrix` and `MMatrix`, we can still use `LoopVectorization` to implement `SMatrix` multiplication, and in most cases get better performance than the unrolled methods from the library. Unfortunately, it is still suboptimal because the compiler isn't able to elide the copying, but the temporaries are all stack-allocated, making the code
30+
allocateion free. We can benchmark our simple implementation vs the `StaticArrays.SMatrix` and `StaticArrays.MMatrix` methods:
31+
```julia
32+
using BenchmarkTools, LinearAlgebra, DataFrames, VegaLite
33+
BLAS.set_num_threads(1);
34+
35+
matdims(x::Integer) = (x, x, x)
36+
matdims(x::NTuple{3}) = x
37+
matflop(x::Integer) = 2x^3
38+
matflop(x::NTuple{3}) = 2prod(x)
39+
40+
function runbenches(sr, ::Type{T}, fa = identity, fb = identity) where {T}
41+
bench_results = Matrix{Float64}(undef, length(sr), 4);
42+
for (i,s) enumerate(sr)
43+
M, K, N = matdims(s)
44+
Aₘ = @MMatrix rand(T, M, K)
45+
Bₘ = @MMatrix rand(T, K, N)
46+
Aₛ = Ref(SMatrix(Aₘ));
47+
Bₛ = Ref(SMatrix(Bₘ));
48+
Cₛₛ = fa(Aₛ[]) * fb(Bₛ[]);
49+
Cₛₗ = AmulB(fa(Aₛ[]), fb(Bₛ[]))
50+
Cₘₛ = similar(Cₛₛ); mul!(Cₘₛ, fa(Aₘ), fb(Bₘ));
51+
Cₘₗ = similar(Cₛₛ); AmulB!(Cₘₗ, fa(Aₘ), fb(Bₘ));
52+
@assert Array(Cₛₛ) Array(Cₛₗ) Array(Cₘₛ) Array(Cₘₗ) # Once upon a time Julia crashed on ≈ for large static arrays
53+
bench_results[i,1] = @belapsed $fa($Aₛ[]) * $fb($Bₛ[])
54+
bench_results[i,2] = @belapsed AmulB($fa($Aₛ[]), $fb($Bₛ[]))
55+
bench_results[i,3] = @belapsed mul!($Cₘₛ, $fa($Aₘ), $fb($Bₘ))
56+
bench_results[i,4] = @belapsed AmulB!($Cₘₗ, $fa($Aₘ), $fb($Bₘ))
57+
@show s, bench_results[i,:]
58+
end
59+
gflops = @. 1e-9 * matflop(sr) / bench_results
60+
array_type = append!(fill("Static", 2length(sr)), fill("Mutable", 2length(sr)))
61+
sa = fill("StaticArrays", length(sr)); lv = fill("LoopVectorization", length(sr));
62+
matmul_lib = vcat(sa, lv, sa, lv);
63+
sizes = reduce(vcat, (sr for _ 1:4))
64+
DataFrame(
65+
Size = sizes, Time = vec(bench_results), GFLOPS = vec(gflops),
66+
ArrayType = array_type, MatmulLib = matmul_lib, MulType = array_type .* ' ' .* matmul_lib
67+
)
68+
end
69+
70+
df = runbenches(1:24, Float64);
71+
df |> @vlplot(:line, x = :Size, y = :GFLOPS, color = :MulType, height=640,width=960) |> save("sarraymatmul.svg")
72+
```
73+
This yields:
74+
![sarray_benchmarks](../assets/sarraymatmul.png)
75+
Our `AmulB!` for `MMatrix`es was the fastest at all sizes except `2`x`2`, where it lost out to `AmulB` for `SMatrix`, which in turn was faster than the hundreds of lines of
76+
`StaticArray`s code at all sizes except `3`x`3`, `5`x`5`, and `6`x`6`.
77+
78+
79+
80+
Additionally, `HybridArrays.jl` can be used when we have a mix of dynamic and statically sized arrays. Maybe we want to multiply two matrices, where each element is a `3`x`3` matrix:
81+
```julia
82+
using HybridArrays, StaticArrays, LoopVectorization, BenchmarkTools
83+
84+
A_static = [@SMatrix(rand(3,3)) for i in 1:32, j in 1:32];
85+
B_static = [@SMatrix(rand(3,3)) for i in 1:32, j in 1:32];
86+
C_static = similar(A_static);
87+
88+
A_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}(permutedims(reshape(reinterpret(Float64, A_static), (3,3,size(A_static)...)), (3,4,1,2)));
89+
B_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}(permutedims(reshape(reinterpret(Float64, B_static), (3,3,size(B_static)...)), (3,4,1,2)));
90+
C_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}(permutedims(reshape(reinterpret(Float64, C_static), (3,3,size(C_static)...)), (3,4,1,2)));
91+
92+
# C is M x N x I x J
93+
# A is M x K x I x L
94+
# B is K x N x L x J
95+
function bmul!(C, A, B)
96+
@avx for n in axes(C,2), m in axes(C,1), j in axes(C,4), i in axes(C,3)
97+
Cₘₙⱼᵢ = zero(eltype(C))
98+
for k in axes(B,1), l in axes(B,3)
99+
Cₘₙⱼᵢ += A[m,k,i,l] * B[k,n,l,j]
100+
end
101+
C[m,n,i,j] = Cₘₙⱼᵢ
102+
end
103+
end
104+
```
105+
This yields
106+
```julia
107+
julia> @benchmark bmul!($C_hybrid, $A_hybrid, $B_hybrid)
108+
BenchmarkTools.Trial:
109+
memory estimate: 0 bytes
110+
allocs estimate: 0
111+
--------------
112+
minimum time: 15.550 μs (0.00% GC)
113+
median time: 15.663 μs (0.00% GC)
114+
mean time: 15.685 μs (0.00% GC)
115+
maximum time: 50.286 μs (0.00% GC)
116+
--------------
117+
samples: 10000
118+
evals/sample: 1
119+
120+
julia> @benchmark mul!($C_static, $A_static, $B_static)
121+
BenchmarkTools.Trial:
122+
memory estimate: 336 bytes
123+
allocs estimate: 6
124+
--------------
125+
minimum time: 277.736 μs (0.00% GC)
126+
median time: 278.035 μs (0.00% GC)
127+
mean time: 278.310 μs (0.00% GC)
128+
maximum time: 299.259 μs (0.00% GC)
129+
--------------
130+
samples: 10000
131+
evals/sample: 1
132+
133+
julia> all(I -> C_hybrid[Tuple(I)[1],Tuple(I)[2],:,:] C_static[I], CartesianIndices(C_static))
134+
true
135+
136+
julia> length(C_hybrid) * size(B_hybrid,1) * size(B_hybrid,3) * 2e-9 / 15.55e-6 # GFLOPS loops + hybrid arrays
137+
113.79241157556271
138+
139+
julia> length(C_hybrid) * size(B_hybrid,1) * size(B_hybrid,3) * 2e-9 / 277.736e-6 # GFLOPS LinearAlgebra.mul! + StaticArrays
140+
6.371057407034018
141+
```
142+
When using `LoopVectorization` + `HybridArrays`, you may often find that you often get the best performance when the leading dimensions are either an even multiple of 8, or relatively large.
143+
This will often mean not leading with a small static dimension, which is commonly best practice when not using `LoopVectorization`.
144+
145+
If you happen to like tensor operations such as from this last example, you're also strongly encouraged to check out [Tullio.jl](https://github.com/mcabbott/Tullio.jl) which provides index-notation that is both much more convenient and much less error-prone than writing out loops, and uses both `LoopVectorization` (if you `using LoopVectorization` before `@tullio`) as well as multiple threads to maximize performance.
146+
147+
148+

docs/src/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Please file an issue if you run into any problems.
99
Pages = [
1010
"getting_started.md",
1111
"examples/matrix_multiplication.md",
12+
"examples/array_interface.md",
1213
"examples/matrix_vector_ops.md",
1314
"examples/dot_product.md",
1415
"examples/filtering.md",

0 commit comments

Comments
 (0)