[TLX] Multi-buffer epilogue TMA stores in Blackwell GEMM#1028
Closed
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
Closed
[TLX] Multi-buffer epilogue TMA stores in Blackwell GEMM#1028htyu wants to merge 1 commit intofacebookexperimental:mainfrom
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
Conversation
htyu
added a commit
to htyu/triton-1
that referenced
this pull request
Mar 3, 2026
…erimental#1028) Summary: Pull Request resolved: facebookexperimental#1028 Use double-buffering for epilogue TMA stores on the non-interleaved path (used by 32/48 shapes in benchmarks). Instead of using a single SMEM buffer per MMA group and waiting for all stores to complete (wait(0)), alternate between two SMEM buffers and wait for all-but-one (wait(1)). The buffer index is computed as (group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions across MMA group boundaries. Also ensure at least 2 SMEM epilogue buffers are allocated so multi-buffering works even when NUM_MMA_GROUPS == 1. The interleaved epilogue path (used by the remaining 16/48 shapes) already had this optimization via its two-group interleaving pattern. Also disable heuristic config selection by default, falling back to autotuning. On an internal L2 benchmark suite (48 shapes, autotuning): - Average TFLOPS: 713.1 -> 717.6 (+0.63%) - Average speedup vs aten: 0.899 -> 0.903 - Biggest wins on bandwidth-bound shapes (small N/K): (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%, (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0% Also falling back to full autotune while working on stabilizing the heuristics. Differential Revision: D95074321
htyu
added a commit
to htyu/triton-1
that referenced
this pull request
Mar 4, 2026
…erimental#1028) Summary: Pull Request resolved: facebookexperimental#1028 Use double-buffering for epilogue TMA stores on the non-interleaved path (used by 32/48 shapes in benchmarks). Instead of using a single SMEM buffer per MMA group and waiting for all stores to complete (wait(0)), alternate between two SMEM buffers and wait for all-but-one (wait(1)). The buffer index is computed as (group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions across MMA group boundaries. Also ensure at least 2 SMEM epilogue buffers are allocated so multi-buffering works even when NUM_MMA_GROUPS == 1. The interleaved epilogue path (used by the remaining 16/48 shapes) already had this optimization via its two-group interleaving pattern. Also disable heuristic config selection by default, falling back to autotuning. On an internal L2 benchmark suite (48 shapes, autotuning): - Average TFLOPS: 713.1 -> 717.6 (+0.63%) - Average speedup vs aten: 0.899 -> 0.903 - Biggest wins on bandwidth-bound shapes (small N/K): (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%, (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0% Also falling back to full autotune while working on stabilizing the heuristics. Reviewed By: levendlee Differential Revision: D95074321
…erimental#1028) Summary: Use double-buffering for epilogue TMA stores on the non-interleaved path (used by 32/48 shapes in benchmarks). Instead of using a single SMEM buffer per MMA group and waiting for all stores to complete (wait(0)), alternate between two SMEM buffers and wait for all-but-one (wait(1)). The buffer index is computed as (group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions across MMA group boundaries. Also ensure at least 2 SMEM epilogue buffers are allocated so multi-buffering works even when NUM_MMA_GROUPS == 1. The interleaved epilogue path (used by the remaining 16/48 shapes) already had this optimization via its two-group interleaving pattern. Also disable heuristic config selection by default, falling back to autotuning. On an internal L2 benchmark suite (48 shapes, autotuning): - Average TFLOPS: 713.1 -> 717.6 (+0.63%) - Average speedup vs aten: 0.899 -> 0.903 - Biggest wins on bandwidth-bound shapes (small N/K): (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%, (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0% Also falling back to full autotune while working on stabilizing the heuristics. Reviewed By: levendlee Differential Revision: D95074321
|
This pull request has been merged in 4f2f0c0. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Use double-buffering for epilogue TMA stores on the non-interleaved
path (used by 32/48 shapes in benchmarks). Instead of using a single
SMEM buffer per MMA group and waiting for all stores to complete
(wait(0)), alternate between two SMEM buffers and wait for all-but-one
(wait(1)). The buffer index is computed as
(group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions
across MMA group boundaries.
Also ensure at least 2 SMEM epilogue buffers are allocated so
multi-buffering works even when NUM_MMA_GROUPS == 1.
The interleaved epilogue path (used by the remaining 16/48 shapes)
already had this optimization via its two-group interleaving pattern.
Also disable heuristic config selection by default, falling back to
autotuning.
On an internal L2 benchmark suite (48 shapes, autotuning):
(1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%,
(3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0%
Also falling back to full autotune while working on stabilizing the heuristics.
Differential Revision: D95074321