Improved cost modeling of loops, basing cost on getindex instead of setindex!

chriselrod · chriselrod · commit f98363b59f4b · 2019-12-29T18:06:02.000-06:00
diff --git a/README.md b/README.md
@@ -18,8 +18,132 @@ Pkg.add(PackageSpec(url="https://github.com/chriselrod/LoopVectorization.jl"))
 
 ## Usage
 
-The current version of LoopVectorization provides a simple, dumb, transform on a single loop.
-What I mean by this is that it will not check for the transformations for validity. To be safe, I would straight loops that transform arrays or calculate reductions.
+This library provides the `@avx` macro, which may be used to prefix a `for` loop or broadcast statement.
+It then tries to vectorize the loop to improve runtime performance.
+
+The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of noop lest. These limitations should be removed in a future version.
+
+A simple example with a single loop is the dot product:
+```julia
+using LoopVectorization, BenchmarkTools
+function mydot(a, b)
+    s = 0.0
+    @inbounds @simd for i ∈ eachindex(a,b)
+        s += a[i]*b[i]
+    end
+    s
+end
+function mydotavx(a, b)
+    s = 0.0
+    @avx for i ∈ eachindex(a,b)
+        s += a[i]*b[i]
+    end
+    s
+end
+a = rand(256); b = rand(256);
+@btime mydot($a, $b)
+@btime mydotavx($a, $b)
+a = rand(43); b = rand(43);
+@btime mydot($a, $b)
+@btime mydotavx($a, $b)
+```
+
+On most recent CPUs, the performance of the dot product is bounded by
+the speed at which it can load data; most recent x86_64 CPUs can perform
+two aligned loads and two fused multiply adds (`fma`) per clock cycle.
+However, the dot product requires two loads per `fma`.
+
+A self-dot function, on the otherhand, requires one load per fma:
+```julia
+function myselfdot(a)
+    s = 0.0
+    @inbounds @simd for i ∈ eachindex(a)
+        s += a[i]*a[i]
+    end
+    s
+end
+function myselfdotavx(a)
+    s = 0.0
+    @avx for i ∈ eachindex(a)
+        s += a[i]*a[i]
+    end
+    s
+end
+a = rand(256);
+@btime myselfdotavx($a)
+@btime myselfdot($a)
+@btime myselfdotavx($b)
+@btime myselfdot($b)
+```
+For this reason, the `@avx` version is roughly twice as fast. The `@inbounds @simd` version, however, is not, because it runs into the problem of loop carried dependencies: to add `a[i]*b[i]` to `s_new = s_old + a[i-j]*b[i-j]`, we must have first finished calculating `s_new`, but -- while two `fma` instructions can be initiated per cycle -- they each take several clock cycles to complete.
+For this reason, we need to unroll the operation to run several independent instances concurrently. The `@avx` macro models this cost to try and pick an optimal unroll factor.
+
+Note that 14 and 12 nm Ryzen chips can only do 1 full width `fma` per clock cycle (and 2 loads), so they should see similar performance with the dot and selfdot. I haven't verified this, but would like to hear from anyone who can.
+
+
+We can also vectorize fancier loops. A likely familiar example to dive into:
+```julia
+function mygemm!(C, A, B)
+    @inbounds for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
+        Cᵢⱼ = 0.0
+        @fastmath for k ∈ 1:size(A,2)
+            Cᵢⱼ += A[i,k] * B[k,j]
+        end
+        C[i,j] = Cᵢⱼ
+    end
+end
+function mygemmavx!(C, A, B)
+    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
+        Cᵢⱼ = 0.0
+        for k ∈ 1:size(A,2)
+            Cᵢⱼ += A[i,k] * B[k,j]
+        end
+        C[i,j] = Cᵢⱼ
+    end
+end
+M, K, N = 72, 75, 71;
+C1 = Matrix{Float64}(undef, M, N); A = randn(M, K); B = randn(K, N);
+C2 = similar(C1); C3 = similar(C1); 
+@btime mygemmavx!($C1, $A, $B)
+@btime mygemm!($C2, $A, $B)
+using LinearAlgebra, Test
+@test all(C1 .≈ C2)
+BLAS.set_num_threads(1); BLAS.vendor()
+@btime mul!($C3, $A, $B)
+@test all(C1 .≈ C3)
+```
+It can produce a decent macro kernel.
+In the future, I would like it to also model the cost of memory movement in the L1 and L2 cache, and use these to generate loops around the macro kernel following the work of [Low, et al. (2016)](http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf).
+
+Until then, performance will degrade rapidly compared to BLAS as the size of the matrices increase. The advantage of the `@avx` macro, however, is that it is general. Not every operation is supported by BLAS.
+
+For example, what if `A` were the outter product of two vectors?
+```julia
+
+
+```
+
+
+Another example, a straightforward operation expressed well via broadcasting:
+```julia
+a = rand(37); B = rand(37, 47); c = rand(47); c′ = c';
+
+d1 =      @. a + B * c′;
+d2 = @avx @. a + B * c′;
+
+@test all(d1 .≈ d2)
+
+@time @.      $d1 = $a + $B * $c′;
+@time @avx @. $d2 = $a + $B * $c′;
+@test all(d1 .≈ d2)
+```
+can be optimized in a similar manner to BLAS, albeit to a much smaller degree because the naive version already benefits from vectorization (unlike the naive BLAS).
+
+
+
+
+Originally, LoopVectorization only provided a simple, dumb, transform on a single loop using the `@vectorize` macro. This transformation took element type and unroll factor arguments, performing no analysis of the loop, simply applying the specified arguments.
+For backwards compatability, this macro is still currently supported. However, it may eventually be deprecated.
 
 For example,
 ```julia
@@ -33,7 +157,7 @@ end
 using LoopVectorization, BenchmarkTools
 function sum_loopvec(x::AbstractVector{Float64})
     s = 0.0
-    @vvectorize 4 for i ∈ eachindex(x)
+    @vectorize 4 for i ∈ eachindex(x)
         s += x[i]
     end
     s
diff --git a/src/costs.jl b/src/costs.jl
@@ -63,8 +63,8 @@ const OPAQUE_INSTRUCTION = InstructionCost(50, 50.0, -1.0, VectorizationBase.REG
 #    consolidated into a single register. The number of LICM-ed setindex!, on the other
 #    hand, should indicate how many registers we're keeping live for the sake of eventually storing.
 const COST = Dict{Symbol,InstructionCost}(
-    :getindex => InstructionCost(-3.0,0.5,3,0),
-    :setindex! => InstructionCost(-3.0,1.0,3,1),
+    :getindex => InstructionCost(-3.0,0.5,3,1),
+    :setindex! => InstructionCost(-3.0,1.0,3,0),
     :zero => InstructionCost(1,0.5),
     :one => InstructionCost(3,0.5),
     :(+) => InstructionCost(4,0.5),
diff --git a/src/determinestrategy.jl b/src/determinestrategy.jl
@@ -4,7 +4,7 @@
 unitstride(op, s) = first(loopdependencies(op)) === s
 
 function cost(op::Operation, unrolled::Symbol, Wshift::Int, size_T::Int = op.elementbytes)
-    isconstant(op) && return 0.0, 0, 0
+    isconstant(op) && return 0.0, 0, 1
     # Wshift == dependson(op, unrolled) ? Wshift : 0
     # c = first(cost(instruction(op), Wshift, size_T))::Int
     instr = instruction(op)
@@ -71,7 +71,6 @@ function evaluate_cost_unroll(
         for (id,op) ∈ enumerate(operations(ls))
             # won't define if already defined...
             # id = identifier(op)
-            isconstant(op) && continue
             included_vars[id] && continue
             # it must also be a subset of defined symbols
             loopdependencies(op) ⊆ nested_loop_syms || continue
@@ -193,18 +192,20 @@ function solve_tilesize(X, R)
             U, T = Ulow, Thigh
         end
     end
-    # @show Uhigh*Tlow*R[1] + Uhigh*R[2]
-    if RR ≥ Uhigh*Tlow*R[1] + Uhigh*R[2]
-        tcost_temp = tile_cost(X, Uhigh, Tlow)
-        if tcost_temp < tcost
-            tcost = tcost_temp
-            U, T = Uhigh, Tlow
-        end
+    # The RR + 1 is a hack to get it to favor Uhigh in more scenarios
+    Tl = Tlow
+    while RR < Uhigh*Tl*R[1] + Uhigh*R[2]
+        Tl -= 1
+    end
+    tcost_temp = tile_cost(X, Uhigh, Tl)
+    if tcost_temp < tcost
+        tcost = tcost_temp
+        U, T = Uhigh, Tl
     end
     if RR > Uhigh*Thigh*R[1] + Uhigh*R[2]
         throw("Something went wrong when solving for Tfloat and Ufloat.")
     end
-    U, T, tcost
+    min(U,RR), min(T,RR), tcost
 end
 function solve_tilesize_constU(X, R, U)
     floor(Int, (VectorizationBase.REGISTER_COUNT - R[3] - R[4] - U*R[2]) / (U * R[1]))
@@ -258,8 +259,8 @@ function evaluate_cost_tile(
     # @show order
     cost_vec = zeros(Float64, 4)
     reg_pressure = zeros(Int, 4)
-    @inbounds reg_pressure[2] = 1
-    @inbounds reg_pressure[3] = 1
+    # @inbounds reg_pressure[2] = 1
+    # @inbounds reg_pressure[3] = 1
     for n ∈ 1:N
         itersym = order[n]
         # Add to set of defined symbles
@@ -271,7 +272,7 @@ function evaluate_cost_tile(
         end
         # check which vars we can define at this level of loop nest
         for (id, op) ∈ enumerate(operations(ls))
-            isconstant(op) && continue
+            # isconstant(op) && continue
             # @assert id == identifier(op)+1 # testing, for now
             # won't define if already defined...
             included_vars[id] && continue
diff --git a/src/graphs.jl b/src/graphs.jl
@@ -346,6 +346,7 @@ function maybe_cse_load!(ls::LoopSet, expr::Expr, elementbytes::Int = 8)
         @view(expr.args[2+offset:end]),
         Ref(false)
     )::ArrayReference
+    # @show ref.ref
     id = findfirst(r -> r == ref, ls.refs_aliasing_syms)
     if id === nothing
         add_load!( ls, gensym(:temporary), ref, elementbytes )
diff --git a/src/lowering.jl b/src/lowering.jl
@@ -199,6 +199,7 @@ function lower_store_unrolled!(
             push!(q.args, instrcall)
         end
     else
+        sn = findfirst(x -> x === unrolled, loopdependencies(op))::Int
         ustrides = Expr(:call, lv(:vmul), Expr(:call, :stride, ptr, sn), Expr(:call, lv(:vrange), Expr(:call, Expr(:curly, :Val, W))))
         for u ∈ 0:U-1
             instrcall = Expr(:call, lv(:scatter!), ptr, Symbol("##",var,:_,u), Expr(:call,lv(:vadd),mem_offset(op,u*W,unrolled),ustrides))
diff --git a/src/operations.jl b/src/operations.jl
@@ -2,6 +2,16 @@ struct ArrayReference
     array::Symbol
     ref::Vector{Union{Symbol,Int}}
     loaded::Base.RefValue{Bool}
+    function ArrayReference(
+        array, refsin, loadedin = Ref{Bool}(false)
+    )
+        ref = Vector{Union{Symbol,Int}}(undef, length(refsin))
+        for i ∈ eachindex(ref)
+            refᵢ = (refsin[i])::Union{Symbol,Int}
+            ref[i] = refᵢ isa Int ? refᵢ - 1 : refᵢ
+        end
+        new(array, ref, loadedin)
+    end
 end
 function ArrayReference(
     array::Symbol,
diff --git a/test/runtests.jl b/test/runtests.jl
@@ -330,4 +330,11 @@ fill!(d4, 91000.0);
 @avx @. d4 = a + B ∗ c;
 @test all(d3 .≈ d4)
 
+M, K, N = 77, 83, 57;
+A = rand(M,K); B = rand(K,N); C = rand(M,N);
+
+D1 = C .+ A * B;
+D2 = @avx C .+ A ∗ B;
+@test all(D1 .≈ D2)
+
 end