-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Push MulAddMul away from BLAS.gemm! #47026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It would be great to see this added to 1.8.x if approved. |
|
IIUC this overhead shoud be negligible for large matrices. Just for curiosity, does StaticArray perform better than BLAS at your concened size. BTW, other wrapper function ( |
|
Yes, I suppose this would become negligible with large matrices. It is not negligible at, say, 16x16. It does indeed look like StaticArrays is faster here for my artificial tests, which is cool! We can try to use them in such cases in real code, but it puts some extra burden on the users of the tools we are providing right now. I think it doesn't hurt to also fix these superfluous allocations for the Array case :) Edit: It seems using StaticArrays in the problems we actually care about (basically just ODE integration where steps are GEMM or GEMV ops) currently produces more allocations and is slower, even at D=16. Not sure why just now. Btw, the overhead of these small allocations can be particularly non-negligible when doing many such tasks in parallel using threading, due to GC contention. |
|
Of course it would be nice to remove the reliance on constant-propagation from the whole of
Unfortunately, this seems like a bit of a minefield! Seems like there ought to be branching on alpha and beta values at a fairly high level, then maybe generated functions can be used to avoid duplicating the native matmul cases? |
|
When i looked around in the code from this change actually what i expected to see. The MulAddMul thing causing dispatching out to several different methods based on the values seemed very confusing. Since it comes with significant downsides (performance hits like this when constant prop fails, causing additional compilation due to dispatching, as well as making the code quite a bit more confusing on a first read) I imagine there was compelling reasons to write the code like this in the first place? Assuming the constant prop worked as it should; the only benefit i could see if that the dispatched code would possibly avoid one or two conditional statements (when optimizing for alpha or beta =0). And, unless gemm_wrapper! is inlined, wouldn't this surely always result in a dispatch? Because that sounds like a big ouch |
|
Is it possible to detect propagated constants at compile time? If so, a hacky (but I think type-stable?) workaround might be to restore Of course, this is not optimal when alpha and beta are not constant, but alpha=1 and/or beta=0. |
Apologies, but I don't see why this would always cause a dispatch. If constants get propagated down to this level, then it should just work as before, no? |
Maybe there is some black magic i don't understand here, but this is what i mean by the function needing to be inlined for the compiler to even be able to perform this optimization. Otherwise, there is just one compiled gemm_wrapper!(...) method which will be called for all values. I mean, no optimization can penetrate a non-inlined function boundary. So I just get worried when i see a source of type instability creeping deeper into the code. It's one thing if the instability is right here; because the dispatch is just 1 (inlined) function call deep. More broadly on this PR; whatever applies to gemm here surely applies to most of the rest of the code that relies on MulAddMul, so I don't think anything should be done just specifically for this one case; it leads to needlessly inconsistent behavior. julia/stdlib/LinearAlgebra/src/generic.jl Line 16 in ec5fe69
my takeaway is that it's specifically made to help optimize this scenario, and partly for some added convenience. I see what this is going for, but my gut feeling is that just relying on the compiler to optimize away a iszero(x) at runtime just as well, or that iszero cost isn't appreciable anyway, and, in any scenario where weStill, I can also still only imagine there was good reasons to add this to start with by someone who knows way more about this code than i. Regardless, any change here is bound to open a big can of worms |
|
I think the following might be a nicer solution than my first attempt in this PR. We keep @inline function mul!(C::StridedMatrix{Complex{T}}, A::StridedVecOrMat{Complex{T}}, B::StridedVecOrMat{T},
alpha::TA, beta::TB) where {T<:BlasReal,TA<:Number,TB<:Number}
ais1 = isone(alpha)
bis0 = iszero(beta)
ais1 && bis0 && return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{true,true,TA,TB}(alpha, beta))
ais1 && return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{true,false,TA,TB}(alpha, beta))
bis0 && return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{false,true,TA,TB}(alpha, beta))
return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul{false,false,TA,TB}(alpha, beta))
endRather than write this every time we call In case const-prop succeeds, this should behave identically to the previous code (?). If const-prop fails, there is still no runtime dispatch and we get the efficiency gains in the native-julia matmul cases. It's kind of a shame to branch here on alpha and beta values when this probably happens again inside BLAS in the gemm cases, but I'm guessing it's better than dispatch! |
|
As can be seen from the documentation in the code, the goal of With your local PR built, you should perhaps run the addmul.jl test file, and change the value of |
|
See #46865 (comment) for a minimal macro-based alt solution to this problem that does not change the depth of |
I could easily be wrong, but I thought this was an inference-stage optimization, so does not rely on the compiler. I think constants are internally replaced by |
|
@dkarrasch I had guessed that dispatching on the value of alpha and beta also has performance benefits for, say, native-Julia matmul of 2x2 and 3x3 matrices, or indeed |
Well any of those steps, all relies on that it's inlined up to that point. Otherwise, you are calling a separate function. The same separate function everyone else is/might be calling. There has to be some such cutoff, otherwise we'd be building the entire source code from the ground up with every new function introduced just to optimize it for any given constants. I failed to make a MWE because julia really loves inlining despite my desperate pleading with Edit: Please do correct me if i'm wrong. I love being corrected :) |
|
I do agree that pushing the instability deeper is a little stinky. From what I have gleaned, it indeed makes successful const-prop less likely, but I certainly don't understand enough about how the latter works to know what the limitation is exactly. Anyway, I think I prefer the macro solution I pointed to above, as it should completely eliminate the type instability. |
|
What if we pull the small-matrix case up the call chain, i.e., to the place that would call |
|
@dkarrasch We could do that if we are happy making I can make a hybrid PR that does this, probably via an extra |
|
Superseded by #52439. |
As documented in #46865,
MulAddMul()slows down BLASmul!()unnecessarily when constant-prop. fails or when we're just not using constant alpha and beta.This is an attempt to workaround
MulAddMul()in theBLAS.gemm!case, where it is apparently not used for anything anyway.