Skip to content

Conversation

@amilsted
Copy link
Contributor

@amilsted amilsted commented Oct 3, 2022

As documented in #46865, MulAddMul() slows down BLAS mul!() unnecessarily when constant-prop. fails or when we're just not using constant alpha and beta.

This is an attempt to workaround MulAddMul() in the BLAS.gemm! case, where it is apparently not used for anything anyway.

@kshyatt kshyatt added the linear algebra Linear algebra label Oct 3, 2022
@amilsted
Copy link
Contributor Author

amilsted commented Oct 3, 2022

It would be great to see this added to 1.8.x if approved.

@N5N3 N5N3 added the performance Must go faster label Oct 4, 2022
@N5N3
Copy link
Member

N5N3 commented Oct 4, 2022

IIUC this overhead shoud be negligible for large matrices. Just for curiosity, does StaticArray perform better than BLAS at your concened size.

BTW, other wrapper function (syrk,herk) should have similar problem. If we really want this change, then we'd better make their api consistent.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 4, 2022

Yes, I suppose this would become negligible with large matrices. It is not negligible at, say, 16x16.

It does indeed look like StaticArrays is faster here for my artificial tests, which is cool! We can try to use them in such cases in real code, but it puts some extra burden on the users of the tools we are providing right now. I think it doesn't hurt to also fix these superfluous allocations for the Array case :)

Edit: It seems using StaticArrays in the problems we actually care about (basically just ODE integration where steps are GEMM or GEMV ops) currently produces more allocations and is slower, even at D=16. Not sure why just now.

Btw, the overhead of these small allocations can be particularly non-negligible when doing many such tasks in parallel using threading, due to GC contention.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 4, 2022

Of course it would be nice to remove the reliance on constant-propagation from the whole of gemm_wrapper!() and friends, including the 2x2 and 3x3 specializations, since

  1. it seems to be very fragile and
  2. it is not wrong for alpha and beta to not be constant and these cases should be performant too!

Unfortunately, this seems like a bit of a minefield! Seems like there ought to be branching on alpha and beta values at a fairly high level, then maybe generated functions can be used to avoid duplicating the native matmul cases?

@Micket
Copy link
Contributor

Micket commented Oct 4, 2022

When i looked around in the code from this change actually what i expected to see. The MulAddMul thing causing dispatching out to several different methods based on the values seemed very confusing.

Since it comes with significant downsides (performance hits like this when constant prop fails, causing additional compilation due to dispatching, as well as making the code quite a bit more confusing on a first read) I imagine there was compelling reasons to write the code like this in the first place?

Assuming the constant prop worked as it should; the only benefit i could see if that the dispatched code would possibly avoid one or two conditional statements (when optimizing for alpha or beta =0).
But surely this would also only matter for very tiny systems, and I suspect even then probably wouldn't even be detectable.
And, i suspect in most cases where we actually have constants, they would rarely be 0.
And for large systems, it surely matters even less.
So I have a hard time seeing the benefit for MulAddMul.. anywhere..

And, unless gemm_wrapper! is inlined, wouldn't this

return matmul2x2!(C, tA, tB, A, B, MulAddMul(α, β))

surely always result in a dispatch? Because that sounds like a big ouch

@amilsted
Copy link
Contributor Author

amilsted commented Oct 4, 2022

Is it possible to detect propagated constants at compile time? If so, a hacky (but I think type-stable?) workaround might be to restore MulAddMul to mul!() but construct MulAddMul{false, false, typeof(alpha), typeof(beta)}(alpha, beta) whenever alpha and beta are not compile-time constant.

Of course, this is not optimal when alpha and beta are not constant, but alpha=1 and/or beta=0.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 4, 2022

And, unless gemm_wrapper! is inlined, wouldn't this

return matmul2x2!(C, tA, tB, A, B, MulAddMul(α, β))

surely always result in a dispatch? Because that sounds like a big ouch

Apologies, but I don't see why this would always cause a dispatch. If constants get propagated down to this level, then it should just work as before, no?

@Micket
Copy link
Contributor

Micket commented Oct 5, 2022

Apologies, but I don't see why this would always cause a dispatch. If constants get propagated down to this level, then it should just work as before, no?

Maybe there is some black magic i don't understand here, but this is what i mean by the function needing to be inlined for the compiler to even be able to perform this optimization. Otherwise, there is just one compiled gemm_wrapper!(...) method which will be called for all values. I mean, no optimization can penetrate a non-inlined function boundary.
Perhaps this is still small enough that the compiler will inline it regardless, but there is no @inline hints here to suggest it, and it doesn't look all that small.

So I just get worried when i see a source of type instability creeping deeper into the code. It's one thing if the instability is right here;

@inline mul!(C::AbstractMatrix, A::AbstractVecOrMat, B::AbstractVecOrMat,
             alpha::Number, beta::Number) =
    generic_matmatmul!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))

because the dispatch is just 1 (inlined) function call deep.
I tried to test out an MWE, but the but i can't get my tiny examples to not inline, despite liberal use of @noinline.
Please correct me if i'm wrong. I'm still learning and I'm certainly no authority on this!


More broadly on this PR; whatever applies to gemm here surely applies to most of the rest of the code that relies on MulAddMul, so I don't think anything should be done just specifically for this one case; it leads to needlessly inconsistent behavior.
Based on the source for MulAddMul

A callable for operating short-circuiting version of `x * alpha + y * beta`.

my takeaway is that it's specifically made to help optimize this scenario, and partly for some added convenience.
I see what this is going for, but my gut feeling is that just relying on the compiler to optimize away a iszero(x) at runtime just as well, or that iszero cost isn't appreciable anyway, and, in any scenario where we
Still, I can also still only imagine there was good reasons to add this to start with by someone who knows way more about this code than i.

Regardless, any change here is bound to open a big can of worms

@amilsted
Copy link
Contributor Author

amilsted commented Oct 5, 2022

I think the following might be a nicer solution than my first attempt in this PR. We keep MulAddMul in mul!() but branch on alpha and beta values when calling gemm_wrapper!() and friends, constructing a concrete MulAddMul in each branch. For example:

@inline function mul!(C::StridedMatrix{Complex{T}}, A::StridedVecOrMat{Complex{T}}, B::StridedVecOrMat{T},
                    alpha::TA, beta::TB) where {T<:BlasReal,TA<:Number,TB<:Number}
    ais1 = isone(alpha)
    bis0 = iszero(beta)
    ais1 && bis0 && return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{true,true,TA,TB}(alpha, beta))
    ais1 && return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{true,false,TA,TB}(alpha, beta))
    bis0 && return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{false,true,TA,TB}(alpha, beta))
    return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{false,false,TA,TB}(alpha, beta))
end

Rather than write this every time we call gemm_wrapper!(), (there are a lot of mul!() variants) we could have a macro to create the branches.

In case const-prop succeeds, this should behave identically to the previous code (?). If const-prop fails, there is still no runtime dispatch and we get the efficiency gains in the native-julia matmul cases.

It's kind of a shame to branch here on alpha and beta values when this probably happens again inside BLAS in the gemm cases, but I'm guessing it's better than dispatch!

@dkarrasch
Copy link
Member

As can be seen from the documentation in the code, the goal of MulAddMul was to handle a few edge cases. For instance, think of two BigFloat matrices A and B, then allocating a similar output array C will yield undefined references. So, in 5-arg mul! you don't even want to touch elements in C when b == 0. Similarly, if C contains NaNs, then these should not propagate unless really requested.

With your local PR built, you should perhaps run the addmul.jl test file, and change the value of n_samples in line 122 to Inf. That may take quite a while, but that will make sure the above described behavior is preserved. In any case, the tests in that file indicate the motivation behind MulAddMul.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 5, 2022

See #46865 (comment) for a minimal macro-based alt solution to this problem that does not change the depth of MulAddMul() and, hence, should preserve any const-prop gains.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 5, 2022

Maybe there is some black magic i don't understand here, but this is what i mean by the function needing to be inlined for the compiler to even be able to perform this optimization.

I could easily be wrong, but I thought this was an inference-stage optimization, so does not rely on the compiler. I think constants are internally replaced by Const types that can be passed between functions, so that constant values are already available at compile time.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 5, 2022

@dkarrasch I had guessed that dispatching on the value of alpha and beta also has performance benefits for, say, native-Julia matmul of 2x2 and 3x3 matrices, or indeed generic_matmul. Is that not also true?

@Micket
Copy link
Contributor

Micket commented Oct 5, 2022

I could easily be wrong, but I thought this was an inference-stage optimization, so does not rely on the compiler.

Well any of those steps, all relies on that it's inlined up to that point. Otherwise, you are calling a separate function. The same separate function everyone else is/might be calling. There has to be some such cutoff, otherwise we'd be building the entire source code from the ground up with every new function introduced just to optimize it for any given constants. I failed to make a MWE because julia really loves inlining despite my desperate pleading with @noinline.
Pushing the type instability deeper just seems like a code smell to me, even if the compiler might manage to optimize it away in some circumstances.

Edit: Please do correct me if i'm wrong. I love being corrected :)

@amilsted
Copy link
Contributor Author

amilsted commented Oct 5, 2022

I do agree that pushing the instability deeper is a little stinky. From what I have gleaned, it indeed makes successful const-prop less likely, but I certainly don't understand enough about how the latter works to know what the limitation is exactly.

Anyway, I think I prefer the macro solution I pointed to above, as it should completely eliminate the type instability.

@dkarrasch
Copy link
Member

What if we pull the small-matrix case up the call chain, i.e., to the place that would call gemm_wrapper!? Perhaps that is where the scalars are potentially still known as constants. The MulAddMul construction would then be left to generic_matmat_mul!, which may not be too bad?

@amilsted
Copy link
Contributor Author

@dkarrasch We could do that if we are happy making mul!() and co a little fatter (they are marked @inline). It wouldn't fix the type instability for the small-matrix, non-const alpha, beta case, for which I would want to additionally use the @stable_muladdmul macro from #47088.

I can make a hybrid PR that does this, probably via an extra @inline function gemm_wrapper_wrapper!() (with a better name) so that we can easily use it in the various mul!() variants.

@dkarrasch
Copy link
Member

Superseded by #52439.

@dkarrasch dkarrasch closed this May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

linear algebra Linear algebra performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants