You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BACKEND] Don't accumulate the offsets from memdesc_subview on the base (#7515)
Layouts coming from `memdesc_subview` are at heart affine layouts.
This is because partial evaluation of a linear map on some variables
gives you an affine map. More concretely, if `c` is a constant
and `x` is a variable, we have that `A(x ^ c) = A(x) ^ A(c)` where
`A(c)` is a constant. In `memdesc_subview`, `A` is the map from the
matrix into shared memory (the inverse of the shared memory linear
layout)
and `c` are the offsets given in the IR of `memdesc_subview`.
Previously `memdesc_subview` would advance the pointer as `ptr += A(c)`.
This is incorrect, as the actual formula for the address is given by
`ptr + (A(c) ^ A(x))`, so we compensated for this when computing the
address by substracting the offsets and adding them as an xor
https://github.com/triton-lang/triton/blob/8e52b2e483eb072149801443e2e33b0b72d32bc5/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L604-L605
In this PR we untangle these issues by:
1. Just advancing the base ptr when we index over the pipelining
dimensions (this will be split out into its own op in a future PR)
2. Exposing two methods on SharedMemoryObject to compute the affine part
of the layout (what we called `A(c)` above) and compute the bits that
`A(c)` may have on. This second part is useful to perform
optimisations.
3. Decreeing that shared layout will always return the layout of the
full allocShape (minus the pipelining dimension). This means that
`toLinearLayout(MemDescType)` may return a layout of a different
output shape as the tensor it represents. This should not be a
problem because of the next poitn
4. To account for the previous point, we generalise `invertAndCompose`
to solve systems `AX = B` where `A` may have an output dimension larger
than `B`. In this case we still compute the linear map compositon
`A^{-1}B`, which is well defined as `Im(B) \subset Im(A)`
In the future we could consider always describing shared layouts as maps
from the tensor into the offsets. This would allow us to represent the
linear part of the affine map above via linear layouts, and we could
simply return a layout of the correct shape from
`toLinearLayout(MemDescType)`, rather than an oversized one.
This makes also intuitive sense, as we will never want to store the same
element in two different parts of the shared memory, so this map will
always be well-defined, while its inverse may not be, as it is the case
here.
We also take the chance to give a good clean-up to
`emitTransferBetweenRegistersAndShared` now that the logic is better
defined. The generated code should be comparable or better than the
previous one.
---------
Co-authored-by: apgoucher <[email protected]>
0 commit comments