You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Backend] Improve warp-local layout conversion algo using shuffles (#7558)
This PR replaces the existing `transferWithinWarp` warp-shuffle
algorithm for the lowering of `ConvertLayoutOp` with a more precise
algorithm which allows for broadcasting in layouts and which emits fewer
`select` and `shuffle` instructions in general. We additionally
implement register packing for sub-32-bit data types.
### Combinatorial point of view
The new implementation describes and decomposes a layout conversion as a
permutation of hardware index bits
```math
P = P_{\text{mixed}} \circ P_{\text{lane}} \circ P_{\text{reg}}
```
where
- $P_{\text{lane}}$ is a permutation of lane bits,
- $P_{\text{reg}}$ is a permutation of register bits,
- $P_{\text{mixed}}$ is a product of disjoint transpositions which swap
register and lane bits.
### Instruction count differences
Letting $m$ denote the number of such mixed transpositions and $R$
denote the number of registers, the existing algorithm:
- Uses $2 \cdot (2^m - 1) \cdot R$ `select`s
- Uses $R$ `shuffle`s
while the new algorithm, using the in-place bit-swap implementation:
- Uses $2 \cdot m \cdot R$ `select`s
- Uses $(1 - (0.5)^m) \cdot R$ `shuffle`s if $P_{\text{lane}}$ is the
trivial permutation and $R$ `shuffle`s otherwise.
and in the case $m = 1$ with trivial $P_{\text{lane}}$, using the
out-of-place bit-swap implementation:
- Uses $1.5 \cdot R$ `select`s
- Uses $0.5 \cdot R$ `shuffle`s
Despite these improvements, empirical results on Turing T4 GPUs, using a
modification of the test in
triton-lang/triton#5419 (comment),
show that the shared memory approach is faster in most cases when $m >
1$. Only the $m = 2$ case with trivial $P_{\text{lane}}$ using the
out-of-place implementation twice in succession was measured to be
faster.
For this reason, in the new algorithm we still bail out when $m > 1$.
Due to the higher arithmetic pressure, I believe it’s unlikely the
shuffle approach would be profitable compared to the shared memory
approach in these cases, but testing with actual kernels rather than
microbenchmarks and using different hardware may reveal otherwise.
The new algorithm also handles layout conversions that have been
hand-implemented in some cases, such as `convertMMAV3To8BitsDotOperand`.
However, this PR does not attempt to remove or modify any of these code
paths.
<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**
Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->
# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- Select one of the following.
- [x] I have added tests.
- `/test` for `lit` tests
- `/unittest` for C++ tests
- `/python/test` for end-to-end tests
- [ ] This PR does not need a test because `FILL THIS IN`.
- Select one of the following.
- [ ] I have not added any `lit` tests.
- [x ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
---------
Co-authored-by: apgoucher <[email protected]>
0 commit comments