Skip to content

[TLX] Multi-buffer epilogue TMA stores in Blackwell GEMM#1028

Closed
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
htyu:export-D95074321
Closed

[TLX] Multi-buffer epilogue TMA stores in Blackwell GEMM#1028
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
htyu:export-D95074321

Conversation

@htyu
Copy link
Contributor

@htyu htyu commented Mar 3, 2026

Summary:
Use double-buffering for epilogue TMA stores on the non-interleaved
path (used by 32/48 shapes in benchmarks). Instead of using a single
SMEM buffer per MMA group and waiting for all stores to complete
(wait(0)), alternate between two SMEM buffers and wait for all-but-one
(wait(1)). The buffer index is computed as
(group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions
across MMA group boundaries.

Also ensure at least 2 SMEM epilogue buffers are allocated so
multi-buffering works even when NUM_MMA_GROUPS == 1.

The interleaved epilogue path (used by the remaining 16/48 shapes)
already had this optimization via its two-group interleaving pattern.

Also disable heuristic config selection by default, falling back to
autotuning.

On an internal L2 benchmark suite (48 shapes, autotuning):

  • Average TFLOPS: 713.1 -> 717.6 (+0.63%)
  • Average speedup vs aten: 0.899 -> 0.903
  • Biggest wins on bandwidth-bound shapes (small N/K):
    (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%,
    (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0%

Also falling back to full autotune while working on stabilizing the heuristics.

Differential Revision: D95074321

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 3, 2026
@meta-codesync
Copy link

meta-codesync bot commented Mar 3, 2026

@htyu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95074321.

htyu added a commit to htyu/triton-1 that referenced this pull request Mar 3, 2026
…erimental#1028)

Summary:
Pull Request resolved: facebookexperimental#1028

Use double-buffering for epilogue TMA stores on the non-interleaved
path (used by 32/48 shapes in benchmarks). Instead of using a single
SMEM buffer per MMA group and waiting for all stores to complete
(wait(0)), alternate between two SMEM buffers and wait for all-but-one
(wait(1)). The buffer index is computed as
(group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions
across MMA group boundaries.

Also ensure at least 2 SMEM epilogue buffers are allocated so
multi-buffering works even when NUM_MMA_GROUPS == 1.

The interleaved epilogue path (used by the remaining 16/48 shapes)
already had this optimization via its two-group interleaving pattern.

Also disable heuristic config selection by default, falling back to
autotuning.

On an internal L2 benchmark suite (48 shapes, autotuning):
- Average TFLOPS: 713.1 -> 717.6 (+0.63%)
- Average speedup vs aten: 0.899 -> 0.903
- Biggest wins on bandwidth-bound shapes (small N/K):
  (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%,
  (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0%

Also falling back to full autotune while working on stabilizing the heuristics.

Differential Revision: D95074321
@htyu htyu force-pushed the export-D95074321 branch from 68752b6 to dd46ece Compare March 3, 2026 17:27
htyu added a commit to htyu/triton-1 that referenced this pull request Mar 4, 2026
…erimental#1028)

Summary:
Pull Request resolved: facebookexperimental#1028

Use double-buffering for epilogue TMA stores on the non-interleaved
path (used by 32/48 shapes in benchmarks). Instead of using a single
SMEM buffer per MMA group and waiting for all stores to complete
(wait(0)), alternate between two SMEM buffers and wait for all-but-one
(wait(1)). The buffer index is computed as
(group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions
across MMA group boundaries.

Also ensure at least 2 SMEM epilogue buffers are allocated so
multi-buffering works even when NUM_MMA_GROUPS == 1.

The interleaved epilogue path (used by the remaining 16/48 shapes)
already had this optimization via its two-group interleaving pattern.

Also disable heuristic config selection by default, falling back to
autotuning.

On an internal L2 benchmark suite (48 shapes, autotuning):
- Average TFLOPS: 713.1 -> 717.6 (+0.63%)
- Average speedup vs aten: 0.899 -> 0.903
- Biggest wins on bandwidth-bound shapes (small N/K):
  (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%,
  (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0%

Also falling back to full autotune while working on stabilizing the heuristics.

Reviewed By: levendlee

Differential Revision: D95074321
@htyu htyu force-pushed the export-D95074321 branch from dd46ece to 6fc4d8e Compare March 4, 2026 01:53
…erimental#1028)

Summary:

Use double-buffering for epilogue TMA stores on the non-interleaved
path (used by 32/48 shapes in benchmarks). Instead of using a single
SMEM buffer per MMA group and waiting for all stores to complete
(wait(0)), alternate between two SMEM buffers and wait for all-but-one
(wait(1)). The buffer index is computed as
(group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions
across MMA group boundaries.

Also ensure at least 2 SMEM epilogue buffers are allocated so
multi-buffering works even when NUM_MMA_GROUPS == 1.

The interleaved epilogue path (used by the remaining 16/48 shapes)
already had this optimization via its two-group interleaving pattern.

Also disable heuristic config selection by default, falling back to
autotuning.

On an internal L2 benchmark suite (48 shapes, autotuning):
- Average TFLOPS: 713.1 -> 717.6 (+0.63%)
- Average speedup vs aten: 0.899 -> 0.903
- Biggest wins on bandwidth-bound shapes (small N/K):
  (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%,
  (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0%


Also falling back to full autotune while working on stabilizing the heuristics.

Reviewed By: levendlee

Differential Revision: D95074321
@meta-codesync
Copy link

meta-codesync bot commented Mar 4, 2026

This pull request has been merged in 4f2f0c0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant