You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[NVIDIA] Rewrite getSM120DotScaledScaleLayout and Refactor MMAv2 (#8482)
### Context
It is split from #8430 and focuses on LinearLayout-related cleanups that
should land before introducing FP4 support.
### Changes
- The existing implementation `getSM120DotScaledScaleLayout` built the
layout from manual bases. This was hard to understand and actually had
some bugs or was doing weird things. Rewrote it to use LL helpers like
`identity1D` / `zeros1D` together with the direct-sum operator *. It is
much clearer and trivially extends to FP4.
- `MMAv2.cpp` was also doing weird things, like duplicating the same i8
four times into an i32 rather than packing four distinct i8 values. We
now simply sign-extend one i8 into an i32 before every `mma_sync`, and
hardcode `byteId` to 0. This, together with the LL change, allowed us to
significantly simplify the `MMAv2.cpp` code. We also removed non-obvious
uses of hardcoded constants and replaced them with the `NumRegisters`
and `BaseOffset` structs.
### Notes
- No perf change was made with this PR.
- We will follow up with fp4 support for sm_120 shortly.
0 commit comments