You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[AMD] Optimize shared address calculation for async load (#7153)
On GFX9 direct-to-lds loads write coalesced to LDS and therefore require
the start LDS address as a scalar. This PR refactors the address
calculation to uniformly compute the start address instead of per lane
addresses. This improves final codegen and reduces register usage.
The swizzling computations are now based on the offset instead of the
final addresses which further helps codegen.
The lowering can produce incorrect loads in some cases if we store into
a sub-view which slices along the two minor dimensions, so pipelining is
fine. This was already the case before the refactoring and will be
converted to an error in a follow up PR.
---------
Co-authored-by: Lei Zhang <[email protected]>
0 commit comments