[TRTLLM-11289][fix] Fix MMA accumulation bug in BF16 dense GEMM kernel

peaceh-nv · peaceh-nv · commit 8454f55510ba · 2026-03-10T00:01:26.000-07:00
When mma_inst_tile_k &gt; 1, cute.gemm() generates multiple sub-MMA
instructions that all share the same ACCUMULATE flag. With
ACCUMULATE=False on the first K tile, every sub-MMA cleared the
accumulator so only the last sub-MMA's result survived, losing
(mma_inst_tile_k - 1) * mma_inst_shape_k elements per output tile.

This caused GSM8K accuracy to drop from 64.7% to 28.5%.

Fix by adding an inner kblock loop that iterates sub-MMA instructions
individually and sets ACCUMULATE=True after the first cute.gemm() call,
matching the pattern used by blockscaled_contiguous_grouped_gemm.py.

GSM8K accuracy restored to 64.86% (reference: 64.74%).

Signed-off-by: peaceh &lt;103117813+peaceh-nv@users.noreply.github.com&gt;
diff --git a/tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_gemm_persistent.py b/tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_gemm_persistent.py
@@ -598,15 +598,27 @@ class SharedStorage:
                 if is_leader_cta:
                     acc_pipeline.producer_acquire(acc_producer_state)
 
+                # Reset ACCUMULATE for each new output tile
+                tiled_mma.set(tcgen05.Field.ACCUMULATE, False)
+
                 for k_tile in range(k_tile_cnt):
                     if is_leader_cta:
                         handle = ab_consumer.wait_and_advance(
                             peek_ab_full_status)
 
-                        tiled_mma.set(tcgen05.Field.ACCUMULATE, k_tile != 0)
-                        tile_crd = (None, None, None, handle.index)
-                        cute.gemm(tiled_mma, tCtAcc, tCrA[tile_crd],
-                                  tCrB[tile_crd], tCtAcc)
+                        # Inner loop over kblocks within each K tile.
+                        # Set ACCUMULATE=True after first gemm call to
+                        # avoid clearing the accumulator on each sub-MMA.
+                        num_kblocks = cute.size(tCrA, mode=[2])
+                        for kblock_idx in cutlass.range(
+                                num_kblocks, unroll_full=True):
+                            kblock_crd = (None, None, kblock_idx,
+                                          handle.index)
+                            cute.gemm(tiled_mma, tCtAcc,
+                                      tCrA[kblock_crd],
+                                      tCrB[kblock_crd], tCtAcc)
+                            tiled_mma.set(
+                                tcgen05.Field.ACCUMULATE, True)
 
                         handle.release()