[TLX] Multi-buffer epilogue TMA stores in Blackwell GEMM

htyu · facebook-github-bot · commit 68752b6625eb · 2026-03-03T09:01:24.000-08:00
Summary:
Use double-buffering for epilogue TMA stores on the non-interleaved
path (used by 32/48 shapes in benchmarks). Instead of using a single
SMEM buffer per MMA group and waiting for all stores to complete
(wait(0)), alternate between two SMEM buffers and wait for all-but-one
(wait(1)). The buffer index is computed as
(group_id * EPILOGUE_SUBTILE + slice_id) % 2 to avoid collisions
across MMA group boundaries.

Also ensure at least 2 SMEM epilogue buffers are allocated so
multi-buffering works even when NUM_MMA_GROUPS == 1.

The interleaved epilogue path (used by the remaining 16/48 shapes)
already had this optimization via its two-group interleaving pattern.

Also disable heuristic config selection by default, falling back to
autotuning.

On an internal L2 benchmark suite (48 shapes, autotuning):
- Average TFLOPS: 713.1 -&gt; 717.6 (+0.63%)
- Average speedup vs aten: 0.899 -&gt; 0.903
- Biggest wins on bandwidth-bound shapes (small N/K):
  (1142784, 256, 256): +9.0%, (1060571, 512, 512): +8.7%,
  (3159809, 128, 128): +7.6%, (589824, 256, 256): +7.0%


Also falling back to full autotune while working on stabilizing the heuristics.

Differential Revision: D95074321
diff --git a/third_party/tlx/tutorials/blackwell_gemm_ws.py b/third_party/tlx/tutorials/blackwell_gemm_ws.py
@@ -706,11 +706,10 @@ def _process_tile_epilogue_inner(
                     [BLOCK_M_SPLIT, slice_size],
                 )
                 result = tlx.local_load(acc_tmem_subslice)
-                # Signal MMA consumer after each slice
                 tlx.barrier_arrive(tmem_empty_bars[buf_idx], 1)
                 c = result.to(tlx.dtype_of(c_desc))
-                c_smem = c_smem_buffers[group_id]
-                tlx.async_descriptor_store_wait(0)
+                c_smem = c_smem_buffers[(group_id * EPILOGUE_SUBTILE + slice_id) % 2]
+                tlx.async_descriptor_store_wait(1)
                 tlx.local_store(c_smem, c)
                 tlx.fence_async_shared()
                 tlx.async_descriptor_store(c_desc, c_smem, [offs_am, offs_bn + slice_id * slice_size], store_reduce=STORE_REDUCE, eviction_policy="evict_first")
@@ -943,12 +942,13 @@ def matmul_kernel_tma_ws_blackwell(
         tlx.storage_kind.tmem,
     )
 
-    # Allocate SMEM buffer for epilogue TMA store (one per MMA group)
+    # Allocate SMEM buffers for epilogue TMA store (at least 2 for multi-buffering)
+    NUM_EPILOGUE_SMEM_BUFFERS: tl.constexpr = NUM_MMA_GROUPS if NUM_MMA_GROUPS > 2 else 2
     slice_size: tl.constexpr = BLOCK_SIZE_N // EPILOGUE_SUBTILE
     c_smem_buffers = tlx.local_alloc(
         (BLOCK_M_SPLIT, slice_size),
         tlx.dtype_of(c_desc),
-        NUM_MMA_GROUPS,
+        NUM_EPILOGUE_SMEM_BUFFERS,
     )
 
     # CTA pairs are placed along M dim
@@ -1139,7 +1139,7 @@ def matmul_kernel_tma_ws_blackwell(
                 tile_id += NUM_SMS
 
 
-def matmul(a, b, config=None, use_heuristic=True):
+def matmul(a, b, config=None, use_heuristic=False):
     """Matrix multiplication using TLX GEMM kernel.
 
     Args: