[TLX] Multi-buffer epilogue TMA stores in Blackwell GEMM by htyu · Pull Request #1028 · facebookexperimental/triton

htyu · 2026-03-03T17:01:46Z

Summary:
Use double-buffering for epilogue TMA stores on the non-interleaved
path (used by 32/48 shapes in benchmarks). Instead of using a single
SMEM buffer per MMA group and waiting for all stores to complete
(wait(0)), alternate between two SMEM buffers and wait for all-but-one
(wait(1)). The buffer index is computed as
(group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions
across MMA group boundaries.

Also ensure at least 2 SMEM epilogue buffers are allocated so
multi-buffering works even when NUM_MMA_GROUPS == 1.

The interleaved epilogue path (used by the remaining 16/48 shapes)
already had this optimization via its two-group interleaving pattern.

Also disable heuristic config selection by default, falling back to
autotuning.

On an internal L2 benchmark suite (48 shapes, autotuning):

Average TFLOPS: 713.1 -> 717.6 (+0.63%)
Average speedup vs aten: 0.899 -> 0.903
Biggest wins on bandwidth-bound shapes (small N/K):
(1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%,
(3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0%

Also falling back to full autotune while working on stabilizing the heuristics.

Differential Revision: D95074321

meta-codesync · 2026-03-03T17:02:01Z

@htyu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95074321.

…erimental#1028) Summary: Pull Request resolved: facebookexperimental#1028 Use double-buffering for epilogue TMA stores on the non-interleaved path (used by 32/48 shapes in benchmarks). Instead of using a single SMEM buffer per MMA group and waiting for all stores to complete (wait(0)), alternate between two SMEM buffers and wait for all-but-one (wait(1)). The buffer index is computed as (group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions across MMA group boundaries. Also ensure at least 2 SMEM epilogue buffers are allocated so multi-buffering works even when NUM_MMA_GROUPS == 1. The interleaved epilogue path (used by the remaining 16/48 shapes) already had this optimization via its two-group interleaving pattern. Also disable heuristic config selection by default, falling back to autotuning. On an internal L2 benchmark suite (48 shapes, autotuning): - Average TFLOPS: 713.1 -> 717.6 (+0.63%) - Average speedup vs aten: 0.899 -> 0.903 - Biggest wins on bandwidth-bound shapes (small N/K): (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%, (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0% Also falling back to full autotune while working on stabilizing the heuristics. Differential Revision: D95074321

…erimental#1028) Summary: Pull Request resolved: facebookexperimental#1028 Use double-buffering for epilogue TMA stores on the non-interleaved path (used by 32/48 shapes in benchmarks). Instead of using a single SMEM buffer per MMA group and waiting for all stores to complete (wait(0)), alternate between two SMEM buffers and wait for all-but-one (wait(1)). The buffer index is computed as (group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions across MMA group boundaries. Also ensure at least 2 SMEM epilogue buffers are allocated so multi-buffering works even when NUM_MMA_GROUPS == 1. The interleaved epilogue path (used by the remaining 16/48 shapes) already had this optimization via its two-group interleaving pattern. Also disable heuristic config selection by default, falling back to autotuning. On an internal L2 benchmark suite (48 shapes, autotuning): - Average TFLOPS: 713.1 -> 717.6 (+0.63%) - Average speedup vs aten: 0.899 -> 0.903 - Biggest wins on bandwidth-bound shapes (small N/K): (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%, (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0% Also falling back to full autotune while working on stabilizing the heuristics. Reviewed By: levendlee Differential Revision: D95074321

…erimental#1028) Summary: Use double-buffering for epilogue TMA stores on the non-interleaved path (used by 32/48 shapes in benchmarks). Instead of using a single SMEM buffer per MMA group and waiting for all stores to complete (wait(0)), alternate between two SMEM buffers and wait for all-but-one (wait(1)). The buffer index is computed as (group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions across MMA group boundaries. Also ensure at least 2 SMEM epilogue buffers are allocated so multi-buffering works even when NUM_MMA_GROUPS == 1. The interleaved epilogue path (used by the remaining 16/48 shapes) already had this optimization via its two-group interleaving pattern. Also disable heuristic config selection by default, falling back to autotuning. On an internal L2 benchmark suite (48 shapes, autotuning): - Average TFLOPS: 713.1 -> 717.6 (+0.63%) - Average speedup vs aten: 0.899 -> 0.903 - Biggest wins on bandwidth-bound shapes (small N/K): (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%, (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0% Also falling back to full autotune while working on stabilizing the heuristics. Reviewed By: levendlee Differential Revision: D95074321

meta-codesync · 2026-03-04T10:22:56Z

This pull request has been merged in 4f2f0c0.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 3, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 3, 2026

htyu force-pushed the export-D95074321 branch from 68752b6 to dd46ece Compare March 3, 2026 17:27

htyu force-pushed the export-D95074321 branch from dd46ece to 6fc4d8e Compare March 4, 2026 01:53

htyu force-pushed the export-D95074321 branch from 6fc4d8e to 623ee65 Compare March 4, 2026 02:17

meta-codesync bot closed this in 4f2f0c0 Mar 4, 2026

facebook-github-tools bot added the Merged label Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TLX] Multi-buffer epilogue TMA stores in Blackwell GEMM#1028

[TLX] Multi-buffer epilogue TMA stores in Blackwell GEMM#1028
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
htyu:export-D95074321

htyu commented Mar 3, 2026

Uh oh!

meta-codesync bot commented Mar 3, 2026

Uh oh!

meta-codesync bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

htyu commented Mar 3, 2026

Uh oh!

meta-codesync bot commented Mar 3, 2026

Uh oh!

meta-codesync bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant