[FlashAttn Backward] Change the config of flex attn bwd kernel (#5152)

chengjunlu · web-flow · commit 71562ec844ab · 2025-09-22T14:25:22.000Z
Change the num_warps and num_stages to 16 and 3 which get better
performance in compiling and running.

Signed-off-by: Lu,Chengjun &lt;chengjun.lu@intel.com&gt;
diff --git a/benchmarks/triton_kernels_benchmark/flash_attention_benchmark.py b/benchmarks/triton_kernels_benchmark/flash_attention_benchmark.py
@@ -508,7 +508,7 @@ def backward(ctx, do):
             dv = torch.empty_like(v)
             BATCH, N_HEAD, N_CTX = q.shape[:3]
             PRE_BLOCK = 128
-            NUM_WARPS, NUM_STAGES = 4, 5
+            NUM_WARPS, NUM_STAGES = 16, 3
             BLOCK_M1, BLOCK_N1, BLOCK_M2, BLOCK_N2 = 32, 128, 128, 32
             BLK_SLICE_FACTOR = 2
             RCP_LN2 = 1.4426950408889634  # = 1.0 / ln(2)