Push MulAddMul away from BLAS.gemm! #47026

amilsted · 2022-10-03T18:58:01Z

As documented in #46865, MulAddMul() slows down BLAS mul!() unnecessarily when constant-prop. fails or when we're just not using constant alpha and beta.

This is an attempt to workaround MulAddMul() in the BLAS.gemm! case, where it is apparently not used for anything anyway.

amilsted · 2022-10-03T23:02:30Z

It would be great to see this added to 1.8.x if approved.

N5N3 · 2022-10-04T02:43:50Z

IIUC this overhead shoud be negligible for large matrices. Just for curiosity, does StaticArray perform better than BLAS at your concened size.

BTW, other wrapper function (syrk,herk) should have similar problem. If we really want this change, then we'd better make their api consistent.

amilsted · 2022-10-04T17:11:15Z

Yes, I suppose this would become negligible with large matrices. It is not negligible at, say, 16x16.

It does indeed look like StaticArrays is faster here for my artificial tests, which is cool! We can try to use them in such cases in real code, but it puts some extra burden on the users of the tools we are providing right now. I think it doesn't hurt to also fix these superfluous allocations for the Array case :)

Edit: It seems using StaticArrays in the problems we actually care about (basically just ODE integration where steps are GEMM or GEMV ops) currently produces more allocations and is slower, even at D=16. Not sure why just now.

Btw, the overhead of these small allocations can be particularly non-negligible when doing many such tasks in parallel using threading, due to GC contention.

amilsted · 2022-10-04T22:32:35Z

Of course it would be nice to remove the reliance on constant-propagation from the whole of gemm_wrapper!() and friends, including the 2x2 and 3x3 specializations, since

it seems to be very fragile and
it is not wrong for alpha and beta to not be constant and these cases should be performant too!

Unfortunately, this seems like a bit of a minefield! Seems like there ought to be branching on alpha and beta values at a fairly high level, then maybe generated functions can be used to avoid duplicating the native matmul cases?

Micket · 2022-10-04T22:51:13Z

When i looked around in the code from this change actually what i expected to see. The MulAddMul thing causing dispatching out to several different methods based on the values seemed very confusing.

Since it comes with significant downsides (performance hits like this when constant prop fails, causing additional compilation due to dispatching, as well as making the code quite a bit more confusing on a first read) I imagine there was compelling reasons to write the code like this in the first place?

Assuming the constant prop worked as it should; the only benefit i could see if that the dispatched code would possibly avoid one or two conditional statements (when optimizing for alpha or beta =0).
But surely this would also only matter for very tiny systems, and I suspect even then probably wouldn't even be detectable.
And, i suspect in most cases where we actually have constants, they would rarely be 0.
And for large systems, it surely matters even less.
So I have a hard time seeing the benefit for MulAddMul.. anywhere..

And, unless gemm_wrapper! is inlined, wouldn't this

return matmul2x2!(C, tA, tB, A, B, MulAddMul(α, β))

surely always result in a dispatch? Because that sounds like a big ouch

stdlib/LinearAlgebra/src/matmul.jl

amilsted · 2022-10-04T22:57:23Z

Is it possible to detect propagated constants at compile time? If so, a hacky (but I think type-stable?) workaround might be to restore MulAddMul to mul!() but construct MulAddMul{false, false, typeof(alpha), typeof(beta)}(alpha, beta) whenever alpha and beta are not compile-time constant.

Of course, this is not optimal when alpha and beta are not constant, but alpha=1 and/or beta=0.

amilsted · 2022-10-04T23:00:04Z

And, unless gemm_wrapper! is inlined, wouldn't this
return matmul2x2!(C, tA, tB, A, B, MulAddMul(α, β))
surely always result in a dispatch? Because that sounds like a big ouch

Apologies, but I don't see why this would always cause a dispatch. If constants get propagated down to this level, then it should just work as before, no?

Micket · 2022-10-05T00:12:35Z

Apologies, but I don't see why this would always cause a dispatch. If constants get propagated down to this level, then it should just work as before, no?

Maybe there is some black magic i don't understand here, but this is what i mean by the function needing to be inlined for the compiler to even be able to perform this optimization. Otherwise, there is just one compiled gemm_wrapper!(...) method which will be called for all values. I mean, no optimization can penetrate a non-inlined function boundary.
Perhaps this is still small enough that the compiler will inline it regardless, but there is no @inline hints here to suggest it, and it doesn't look all that small.

So I just get worried when i see a source of type instability creeping deeper into the code. It's one thing if the instability is right here;

@inline mul!(C::AbstractMatrix, A::AbstractVecOrMat, B::AbstractVecOrMat,
             alpha::Number, beta::Number) =
    generic_matmatmul!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))

because the dispatch is just 1 (inlined) function call deep.
I tried to test out an MWE, but the but i can't get my tiny examples to not inline, despite liberal use of @noinline.
Please correct me if i'm wrong. I'm still learning and I'm certainly no authority on this!

More broadly on this PR; whatever applies to gemm here surely applies to most of the rest of the code that relies on MulAddMul, so I don't think anything should be done just specifically for this one case; it leads to needlessly inconsistent behavior.
Based on the source for MulAddMul

julia/stdlib/LinearAlgebra/src/generic.jl

Line 16 in ec5fe69

A callable for operating short-circuiting version of `x * alpha + y * beta`.

my takeaway is that it's specifically made to help optimize this scenario, and partly for some added convenience.
I see what this is going for, but my gut feeling is that just relying on the compiler to optimize away a iszero(x) at runtime just as well, or that iszero cost isn't appreciable anyway, and, in any scenario where we
Still, I can also still only imagine there was good reasons to add this to start with by someone who knows way more about this code than i.

Regardless, any change here is bound to open a big can of worms

amilsted · 2022-10-05T06:05:32Z

I think the following might be a nicer solution than my first attempt in this PR. We keep MulAddMul in mul!() but branch on alpha and beta values when calling gemm_wrapper!() and friends, constructing a concrete MulAddMul in each branch. For example:

@inline function mul!(C::StridedMatrix{Complex{T}}, A::StridedVecOrMat{Complex{T}}, B::StridedVecOrMat{T},
                    alpha::TA, beta::TB) where {T<:BlasReal,TA<:Number,TB<:Number}
    ais1 = isone(alpha)
    bis0 = iszero(beta)
    ais1 && bis0 && return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{true,true,TA,TB}(alpha, beta))
    ais1 && return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{true,false,TA,TB}(alpha, beta))
    bis0 && return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{false,true,TA,TB}(alpha, beta))
    return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{false,false,TA,TB}(alpha, beta))
end

Rather than write this every time we call gemm_wrapper!(), (there are a lot of mul!() variants) we could have a macro to create the branches.

In case const-prop succeeds, this should behave identically to the previous code (?). If const-prop fails, there is still no runtime dispatch and we get the efficiency gains in the native-julia matmul cases.

It's kind of a shame to branch here on alpha and beta values when this probably happens again inside BLAS in the gemm cases, but I'm guessing it's better than dispatch!

dkarrasch · 2022-10-05T16:37:18Z

As can be seen from the documentation in the code, the goal of MulAddMul was to handle a few edge cases. For instance, think of two BigFloat matrices A and B, then allocating a similar output array C will yield undefined references. So, in 5-arg mul! you don't even want to touch elements in C when b == 0. Similarly, if C contains NaNs, then these should not propagate unless really requested.

With your local PR built, you should perhaps run the addmul.jl test file, and change the value of n_samples in line 122 to Inf. That may take quite a while, but that will make sure the above described behavior is preserved. In any case, the tests in that file indicate the motivation behind MulAddMul.

amilsted · 2022-10-05T20:35:24Z

See #46865 (comment) for a minimal macro-based alt solution to this problem that does not change the depth of MulAddMul() and, hence, should preserve any const-prop gains.

amilsted · 2022-10-05T20:37:39Z

Maybe there is some black magic i don't understand here, but this is what i mean by the function needing to be inlined for the compiler to even be able to perform this optimization.

I could easily be wrong, but I thought this was an inference-stage optimization, so does not rely on the compiler. I think constants are internally replaced by Const types that can be passed between functions, so that constant values are already available at compile time.

amilsted · 2022-10-05T20:39:21Z

@dkarrasch I had guessed that dispatching on the value of alpha and beta also has performance benefits for, say, native-Julia matmul of 2x2 and 3x3 matrices, or indeed generic_matmul. Is that not also true?

Micket · 2022-10-05T23:19:29Z

I could easily be wrong, but I thought this was an inference-stage optimization, so does not rely on the compiler.

Well any of those steps, all relies on that it's inlined up to that point. Otherwise, you are calling a separate function. The same separate function everyone else is/might be calling. There has to be some such cutoff, otherwise we'd be building the entire source code from the ground up with every new function introduced just to optimize it for any given constants. I failed to make a MWE because julia really loves inlining despite my desperate pleading with @noinline.
Pushing the type instability deeper just seems like a code smell to me, even if the compiler might manage to optimize it away in some circumstances.

Edit: Please do correct me if i'm wrong. I love being corrected :)

amilsted · 2022-10-05T23:44:52Z

I do agree that pushing the instability deeper is a little stinky. From what I have gleaned, it indeed makes successful const-prop less likely, but I certainly don't understand enough about how the latter works to know what the limitation is exactly.

Anyway, I think I prefer the macro solution I pointed to above, as it should completely eliminate the type instability.

dkarrasch · 2022-10-14T08:45:30Z

What if we pull the small-matrix case up the call chain, i.e., to the place that would call gemm_wrapper!? Perhaps that is where the scalars are potentially still known as constants. The MulAddMul construction would then be left to generic_matmat_mul!, which may not be too bad?

amilsted · 2022-10-14T17:06:56Z

@dkarrasch We could do that if we are happy making mul!() and co a little fatter (they are marked @inline). It wouldn't fix the type instability for the small-matrix, non-const alpha, beta case, for which I would want to additionally use the @stable_muladdmul macro from #47088.

I can make a hybrid PR that does this, probably via an extra @inline function gemm_wrapper_wrapper!() (with a better name) so that we can easily use it in the various mul!() variants.

dkarrasch · 2024-05-22T10:27:13Z

Superseded by #52439.

Ashley Milsted added 2 commits October 3, 2022 11:55

Push MulAddMul away from BLAS.gemm!

8da57d3

Fixup generic_matmatmul! calls

414eb87

kshyatt added the linear algebra Linear algebra label Oct 3, 2022

Test whether 5-arg mul!() allocates in BLAS case.

f03e2cc

N5N3 added the performance Must go faster label Oct 4, 2022

Micket reviewed Oct 4, 2022

View reviewed changes

stdlib/LinearAlgebra/src/matmul.jl Show resolved Hide resolved

amilsted mentioned this pull request Oct 6, 2022

Avoid type instability when constructing MulAddMul #47088

Closed

amilsted mentioned this pull request Oct 17, 2022

Split up gemm_wrapper and stabilize MulAddMul strategically #47206

Closed

dkarrasch closed this May 22, 2024

Uh oh!

Push MulAddMul away from BLAS.gemm! #47026

Push MulAddMul away from BLAS.gemm! #47026

Uh oh!

Conversation

amilsted commented Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amilsted commented Oct 3, 2022

Uh oh!

N5N3 commented Oct 4, 2022

Uh oh!

amilsted commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amilsted commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Micket commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

amilsted commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amilsted commented Oct 4, 2022

Uh oh!

Micket commented Oct 5, 2022

Uh oh!

amilsted commented Oct 5, 2022

Uh oh!

dkarrasch commented Oct 5, 2022

Uh oh!

amilsted commented Oct 5, 2022

Uh oh!

amilsted commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amilsted commented Oct 5, 2022

Uh oh!

Micket commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amilsted commented Oct 5, 2022

Uh oh!

dkarrasch commented Oct 14, 2022

Uh oh!

amilsted commented Oct 14, 2022

Uh oh!

dkarrasch commented May 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amilsted commented Oct 3, 2022 •

edited

Loading

amilsted commented Oct 4, 2022 •

edited

Loading

amilsted commented Oct 4, 2022 •

edited

Loading

Micket commented Oct 4, 2022 •

edited

Loading

amilsted commented Oct 4, 2022 •

edited

Loading

amilsted commented Oct 5, 2022 •

edited

Loading

Micket commented Oct 5, 2022 •

edited

Loading