You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[NVIDIA] Replace inline assembly for the lowering of ttn::ClusterCTAIdOp (#7512)
This PR refactors the ClusterCTAIdOp conversion from using inline PTX
assembly to a series of operations (including some NVVM ops that can
generate intrinsic call), preserving more semantic information at the
LLVM level. While the new implementation expands the computation into
separate multiply and add operations, the backend will typically
optimize them into `mad`, so there is no performance regression.
0 commit comments