Commit 68752b6
[TLX] Multi-buffer epilogue TMA stores in Blackwell GEMM
Summary:
Use double-buffering for epilogue TMA stores on the non-interleaved
path (used by 32/48 shapes in benchmarks). Instead of using a single
SMEM buffer per MMA group and waiting for all stores to complete
(wait(0)), alternate between two SMEM buffers and wait for all-but-one
(wait(1)). The buffer index is computed as
(group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions
across MMA group boundaries.
Also ensure at least 2 SMEM epilogue buffers are allocated so
multi-buffering works even when NUM_MMA_GROUPS == 1.
The interleaved epilogue path (used by the remaining 16/48 shapes)
already had this optimization via its two-group interleaving pattern.
Also disable heuristic config selection by default, falling back to
autotuning.
On an internal L2 benchmark suite (48 shapes, autotuning):
- Average TFLOPS: 713.1 -> 717.6 (+0.63%)
- Average speedup vs aten: 0.899 -> 0.903
- Biggest wins on bandwidth-bound shapes (small N/K):
(1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%,
(3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0%
Also falling back to full autotune while working on stabilizing the heuristics.
Differential Revision: D950743211 parent 638e25e commit 68752b6
1 file changed
+6
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
706 | 706 | | |
707 | 707 | | |
708 | 708 | | |
709 | | - | |
710 | 709 | | |
711 | 710 | | |
712 | | - | |
713 | | - | |
| 711 | + | |
| 712 | + | |
714 | 713 | | |
715 | 714 | | |
716 | 715 | | |
| |||
943 | 942 | | |
944 | 943 | | |
945 | 944 | | |
946 | | - | |
| 945 | + | |
| 946 | + | |
947 | 947 | | |
948 | 948 | | |
949 | 949 | | |
950 | 950 | | |
951 | | - | |
| 951 | + | |
952 | 952 | | |
953 | 953 | | |
954 | 954 | | |
| |||
1139 | 1139 | | |
1140 | 1140 | | |
1141 | 1141 | | |
1142 | | - | |
| 1142 | + | |
1143 | 1143 | | |
1144 | 1144 | | |
1145 | 1145 | | |
| |||
0 commit comments