You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[NVIDIA][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting (#5222)
This is a follow-up to the dotOp hoisting optimization for WGMMA
(MMAv3). See
triton-lang/triton#5003 (comment)
In short, when upcasting operand A in registers prior to WGMMA and when
pipelining is enabled, `AsyncCopyGLobalToLocal`'s src gmem blocked
encoding will have `sizePerThread` > smem view's `vec` (along the
contiguous dimension). This will resulting in multiple `cp.async`
instructions being generated for a contiguous global data segment,
resulting in uncoalesced loads. This was previously confirmed in ncu.
See above comment for an example.
I've added a generalized fix in a new pass after the pipeliner. I've
reused the logic in the LLVM lowering for `AsyncCopyGlobalToLocal` to
calculate the max contiguous copy size. I compare that to the blockEnc's
`sizePerThread` along the inner (contiguous) dimension. If the former is
less than latter, I set the latter to former.
When A is k-major, can verify a small perf improvement and that ncu no
longer reports uncoalesced loads.
When A is m-major, this pass is a no-op because `copy size ==
sizePerThread == 16`
ptal, thanks @ThomasRaoux
0 commit comments