misc: Various Updates to Attention Microbenchmark Suite (#1891)

bkryu · web-flow · commit fdc83389d8a4 · 2025-10-08T23:08:44.000-07:00
## 📌 Description Current PR brings a host of updates to the the attention microbenchmark suites in `flashinfer_benchmark.py` * `testBatchPrefillWithPagedKVCacheWrapper`: * `trtllm-gen-native` that calls `flashinfer.prefill.trtllm_batch_context_with_kv_cache` is added as a backend. Disabled for batch size 1 due to various errors. An issue will be filed to track the error. * `trtllm-gen` and `trtllm-gen-native` backends can now be benchmarked for FP8 * `trtllm-gen` and `trtllm-gen-native` are now disabled for `causal=False`. Previous behavior was silently ignoring the flag and running `causal=True` * `testBatchPrefillWithRaggedKVCacheWrapper`: * `trtllm-gen-native` that calls `flashinfer.prefill.trtllm_ragged_attention_deepseek` is added as a backend. Disabled for batch size 1 due to various errors. An issue will be filed to track the error. * `testBatchMLAPagedAttentionWrapper`: * `cutlass` backend has been added as a backend that can be benchmarked * Misc minor fixes such as correct refcheck failure messages Examples: ``` # python3 flashinfer_benchmark.py --routine BatchMLAPagedAttentionWrapper --backends trtllm-gen-native fa2 cutlass --page_size 32 --batch_size 16 --s_qo 1 --s_kv 8192 --num_qo_heads 128 --num_kv_heads 128 --head_dim_ckv 512 --head_dim_kpe 64 --random_actual_seq_len --refcheck --q_dtype bfloat16 --kv_dtype bfloat16 [PERF] trtllm-gen-nati:: median time 0.031 ms; std 0.000 ms; achieved tflops 553.684 TFLOPs/sec; achieved tb_per_sec 4.960 TB/sec [PERF] fa2 :: median time 0.091 ms; std 0.001 ms; achieved tflops 190.364 TFLOPs/sec; achieved tb_per_sec 1.705 TB/sec [PERF] cutlass :: median time 0.221 ms; std 0.000 ms; achieved tflops 78.342 TFLOPs/sec; achieved tb_per_sec 0.702 TB/sec # python3 flashinfer_benchmark.py --routine BatchPrefillWithPagedKVCacheWrapper --backends fa2 cudnn trtllm-gen trtllm-gen-native --page_size 16 --batch_size 16 --s_qo 8192 --s_kv 8192 --num_qo_heads 64 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128 --random_actual_seq_len --causal --refcheck --q_dtype bfloat16 --kv_dtype bfloat16 [PERF] fa2 :: median time 17.342 ms; std 0.011 ms; achieved tflops 397.579 TFLOPs/sec; achieved tb_per_sec 0.161 TB/sec [PERF] cudnn :: median time 6.230 ms; std 0.032 ms; achieved tflops 1106.685 TFLOPs/sec; achieved tb_per_sec 0.449 TB/sec [PERF] trtllm-gen :: median time 7.181 ms; std 0.040 ms; achieved tflops 960.135 TFLOPs/sec; achieved tb_per_sec 0.390 TB/sec [PERF] trtllm-gen-nati:: median time 6.453 ms; std 0.012 ms; achieved tflops 1068.434 TFLOPs/sec; achieved tb_per_sec 0.434 TB/sec # python3 flashinfer_benchmark.py --routine BatchPrefillWithRaggedKVCacheWrapper --backends fa2 cutlass cudnn trtllm-gen-native --batch_size 16 --s_qo 8192 --s_kv 8192 --num_qo_heads 128 --num_kv_heads 128 --head_dim_qk 192 --head_dim_vo 128 --random_actual_seq_len --refcheck --causal --q_dtype bfloat16 --kv_dtype bfloat16 [PERF] fa2 :: median time 39.797 ms; std 0.023 ms; achieved tflops 433.137 TFLOPs/sec; achieved tb_per_sec 0.312 TB/sec [PERF] cutlass :: median time 18.509 ms; std 0.348 ms; achieved tflops 931.281 TFLOPs/sec; achieved tb_per_sec 0.672 TB/sec [PERF] cudnn :: median time 14.778 ms; std 0.336 ms; achieved tflops 1166.391 TFLOPs/sec; achieved tb_per_sec 0.841 TB/sec [PERF] trtllm-gen-nati:: median time 14.339 ms; std 0.291 ms; achieved tflops 1202.155 TFLOPs/sec; achieved tb_per_sec 0.867 TB/sec ``` **No changes to library code**  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -217,9 +217,9 @@ Legend:
 | Routine | 7.5 | 8.0 | 8.6 | 8.9 | 9.0 | 10.0 | 10.3 | 12.0 |
 |---------|-----|-----|-----|-----|-----|-------|-------|-------|
 | **BatchDecodeWithPagedKVCacheWrapper** | fa2 | fa2, fa2_tc, cudnn | fa2, fa2_tc, cudnn | fa2, fa2_tc, cudnn | fa2, fa2_tc, cudnn | fa2, fa2_tc, cudnn, trtllm-gen, trtllm-gen-native | fa2, fa2_tc, cudnn, trtllm-gen, trtllm-gen-native | fa2, fa2_tc, cudnn |
-| **BatchPrefillWithPagedKVCacheWrapper** |  | fa2, cudnn | fa2, cudnn | fa2, cudnn | fa2, fa3, cudnn | fa2, cudnn, trtllm-gen | fa2, cudnn, trtllm-gen | fa2, cudnn |
-| **BatchPrefillWithRaggedKVCacheWrapper** |  | fa2, cudnn | fa2, cudnn | fa2, cudnn | fa2, fa3, cudnn | fa2, cudnn, cutlass | fa2, cudnn, cutlass | fa2, cudnn |
-| **BatchMLAPagedAttentionWrapper** |  | fa2 | fa2 | fa2 | fa2, fa3 | fa2, trtllm-gen-native | fa2, trtllm-gen-native | fa2 |
+| **BatchPrefillWithPagedKVCacheWrapper** |  | fa2, cudnn | fa2, cudnn | fa2, cudnn | fa2, fa3, cudnn | fa2, cudnn, trtllm-gen, trtllm-gen-native | fa2, cudnn, trtllm-gen, trtllm-gen-native | fa2, cudnn |
+| **BatchPrefillWithRaggedKVCacheWrapper** |  | fa2, cudnn | fa2, cudnn | fa2, cudnn | fa2, fa3, cudnn | fa2, cudnn, cutlass, trtllm-gen-native | fa2, cudnn, cutlass, trtllm-gen-native | fa2, cudnn |
+| **BatchMLAPagedAttentionWrapper** |  | fa2 | fa2 | fa2 | fa2, fa3 | fa2, cutlass, trtllm-gen-native | fa2, cutlass, trtllm-gen-native | fa2 |
 | **gemm_fp8_nt_groupwise** |  |  |  |  |  | cutlass | cutlass |  |
 | **group_gemm_fp8_nt_groupwise** |  |  |  |  |  | cutlass | cutlass |  |
 | **bmm_fp8** |  |  |  | cudnn, cublas | cudnn, cublas | cudnn, cublas, cutlass | cudnn, cublas, cutlass | cudnn, cublas |
diff --git a/benchmarks/routines/attention.py b/benchmarks/routines/attention.py
@@ -545,7 +545,7 @@ def run_backend_wrapper(backend):
                 ) = is_close_stats(reference_output, tested_outputs[i], rtol, atol)
                 if num_different_elements > 0:
                     print(
-                        f"[ERROR] Output tensor mismatch between backends {tested_backends[0]} and {tested_backends[i]}: "
+                        f"[ERROR] Output tensor mismatch between backends fa2 and {tested_backends[i]}: "
                         f"{num_different_elements} / {num_elements} ({num_different_elements_percentage:.2f}%) elements are different"
                     )
                     if not args.allow_output_mismatch:
@@ -689,14 +689,22 @@ def testBatchPrefillWithPagedKVCacheWrapper(args):
 
     if "trtllm-gen" in backends:
         remove_trtllm = False
-        if q_dtype in [torch.float8_e4m3fn, torch.float8_e5m2] or kv_dtype in [
-            torch.float8_e4m3fn,
-            torch.float8_e5m2,
-        ]:
-            print("[INFO] trtllm-gen backend does not support FP8. Skipping.")
+        if not causal:
+            print("[INFO] trtllm-gen backend currently requires causal = True")
             remove_trtllm = True
         if remove_trtllm:
             backends.remove("trtllm-gen")
+    if "trtllm-gen-native" in backends:
+        remove_trtllm_native = False
+        if batch_size == 1:
+            # TO-DO: trtllm-gen-native hits IMA on batch size 1. Investigate and fix.
+            print("[INFO] trtllm-gen-native backend currently requires batch size > 1")
+            remove_trtllm_native = True
+        if not causal:
+            print("[INFO] trtllm-gen-native backend currently requires causal = True")
+            remove_trtllm_native = True
+        if remove_trtllm_native:
+            backends.remove("trtllm-gen-native")
 
     if "cutlass" in backends:
         print("[INFO] CUTLASS backend does not support prefill. Skipping.")
@@ -1006,7 +1014,7 @@ def run_backend_wrapper(backend):
                 ) = is_close_stats(reference_output, tested_outputs[i], rtol, atol)
                 if num_different_elements > 0:
                     print(
-                        f"[ERROR] Output tensor mismatch between backends {tested_backends[0]} and {tested_backends[i]}: "
+                        f"[ERROR] Output tensor mismatch between backends fa2 and {tested_backends[i]}: "
                         f"{num_different_elements} / {num_elements} ({num_different_elements_percentage:.2f}%) elements are different"
                     )
                     if not args.allow_output_mismatch:
@@ -1129,6 +1137,13 @@ def testBatchPrefillWithRaggedKVCacheWrapper(args):
 
     backends = filter_backends_by_compute_capability(backends, args.routine, device)
     # Check for backend-specific constraints
+    if "fa2" in backends:
+        remove_fa2 = False
+        if q_dtype in [torch.float8_e4m3fn, torch.float8_e5m2]:
+            print("[INFO] FA2 backend does not support FP8. Skipping.")
+            remove_fa2 = True
+        if remove_fa2:
+            backends.remove("fa2")
     if "cudnn" in backends:
         remove_cudnn = False
         if q_dtype in [torch.float8_e4m3fn, torch.float8_e5m2] or kv_dtype in [
@@ -1161,6 +1176,25 @@ def testBatchPrefillWithRaggedKVCacheWrapper(args):
         remove_trtllm = True
         if remove_trtllm:
             backends.remove("trtllm-gen")
+    if "trtllm-gen-native" in backends:
+        remove_trtllm_native = False
+        if q_dtype in [torch.float8_e4m3fn, torch.float8_e5m2] or kv_dtype in [
+            torch.float8_e4m3fn,
+            torch.float8_e5m2,
+        ]:
+            print("[INFO] trtllm-gen-native backend does not support FP8. Skipping.")
+            remove_trtllm_native = True
+        if batch_size == 1:
+            # TO-DO: trtllm-gen-native hits IMA on batch size 1. Investigate and fix.
+            print("[INFO] trtllm-gen-native backend currently requires batch size > 1")
+            remove_trtllm_native = True
+        if not (head_dim_qk == 192 and head_dim_vo == 128):
+            print(
+                "[INFO] trtllm-gen-native backend requires head_dim_qk == 192 and head_dim_vo == 128"
+            )
+            remove_trtllm_native = True
+        if remove_trtllm_native:
+            backends.remove("trtllm-gen-native")
 
     if len(backends) == 0:
         print("[ERROR] No backends to test. Exiting.")
@@ -1372,6 +1406,26 @@ def run_backend_wrapper(backend):
                 batch_offsets_stats=batch_offsets_stats,
                 is_cuda_graph_compatible=True,
             )[0]
+        elif backend == "trtllm-gen-native":
+            return flashinfer.prefill.trtllm_ragged_attention_deepseek(
+                query=q,
+                key=k,
+                value=v,
+                workspace_buffer=workspace_buffer,
+                seq_lens=actual_seq_lens_kv_device,
+                max_q_len=s_qo,
+                max_kv_len=s_kv,
+                bmm1_scale=scale,
+                bmm2_scale=1.0,
+                o_sf_scale=-1,
+                batch_size=batch_size,
+                window_left=-1,
+                cum_seq_lens_q=qo_indptr,
+                cum_seq_lens_kv=kv_indptr,
+                enable_pdl=False,
+                is_causal=causal,
+                return_lse=True,
+            )[0]
         else:
             print(f"[ERROR] Backend {backend} not supported")
             return res
@@ -1416,7 +1470,7 @@ def run_backend_wrapper(backend):
                 ) = is_close_stats(reference_output, tested_outputs[i], rtol, atol)
                 if num_different_elements > 0:
                     print(
-                        f"[ERROR] Output tensor mismatch between backends {tested_backends[0]} and {tested_backends[i]}: "
+                        f"[ERROR] Output tensor mismatch between backends fa2 and {tested_backends[i]}: "
                         f"{num_different_elements} / {num_elements} ({num_different_elements_percentage:.2f}%) elements are different"
                     )
                     if not args.allow_output_mismatch:
@@ -1484,7 +1538,7 @@ def run_backend_wrapper(backend):
 def testBatchMLAPagedAttentionWrapper(args):
     """
     Test BatchMLAPagedAttentionWrapper and equivalent APIs.
-    Supports fa2. and trtllm-gen-native.
+    Supports fa2, fa3, cutlass, and trtllm-gen-native.
 
     This test:
     1. Creates paged query and key-value cache tensors
@@ -1565,6 +1619,30 @@ def testBatchMLAPagedAttentionWrapper(args):
             remove_fa3 = True
         if remove_fa3:
             backends.remove("fa3")
+    if "cutlass" in backends:
+        remove_cutlass = False
+        if page_size not in [32, 64]:
+            print(
+                "[INFO] Cutlass MLA backend only supports page size 32 or 64. Skipping."
+            )
+            remove_cutlass = True
+        if q_dtype in [torch.float8_e4m3fn, torch.float8_e5m2] or kv_dtype in [
+            torch.float8_e4m3fn,
+            torch.float8_e5m2,
+        ]:
+            print("[INFO] Cutlass MLA backend does not support FP8. Skipping.")
+            remove_cutlass = True
+        if remove_cutlass:
+            backends.remove("cutlass")
+    if "trtllm-gen-native" in backends:
+        remove_trtllm_native = False
+        if page_size not in [32, 64]:
+            print(
+                "[INFO] trtllm-gen-native backend only supports page size 32 or 64. Skipping."
+            )
+            remove_trtllm_native = True
+        if remove_trtllm_native:
+            backends.remove("trtllm-gen-native")
     if len(backends) == 0:
         print("[ERROR] No backends to test. Exiting.")
         return res
@@ -1629,7 +1707,7 @@ def testBatchMLAPagedAttentionWrapper(args):
         page_size,
         head_dim_kpe,
     )
-    kpe_cache = torch.randn(size=kpe_cache_shape, dtype=q_init_dtype, device=device)
+    kpe_cache = torch.randn(size=kpe_cache_shape, dtype=kv_init_dtype, device=device)
     kv_cache = torch.cat([ckv_cache, kpe_cache], dim=2)
 
     qo_indptr = torch.arange(0, batch_size + 1, device=device).int()
@@ -1657,7 +1735,7 @@ def testBatchMLAPagedAttentionWrapper(args):
             device=device,
         )
 
-    sm_scale = 1.0 / ((head_dim_ckv + head_dim_kpe) ** 0.5)
+    sm_scale = 1.0 / ((128 + 64) ** 0.5)  # For DeepSeek-R1
     workspace_buffer = torch.empty(128 * 1024 * 1024, dtype=torch.int8, device=device)
 
     if args.verbose >= 2:
@@ -1674,7 +1752,7 @@ def testBatchMLAPagedAttentionWrapper(args):
     # Create wrapper
     backend_wrappers = {}
     for backend in backends:
-        if backend in ["fa2", "fa3"]:
+        if backend in ["fa2", "fa3", "cutlass"]:
             backend_wrappers[backend] = flashinfer.mla.BatchMLAPagedAttentionWrapper(
                 float_workspace_buffer=workspace_buffer,
                 use_cuda_graph=is_cuda_graph_compatible,
@@ -1684,20 +1762,21 @@ def testBatchMLAPagedAttentionWrapper(args):
                 kv_len_arr=actual_seq_lens_kv,
                 backend=backend,
             )
-            backend_wrappers[backend].plan(
-                qo_indptr=qo_indptr,
-                kv_indptr=kv_indptr,
-                kv_indices=kv_indices,
-                kv_len_arr=actual_seq_lens_kv,
-                num_heads=num_qo_heads,
-                head_dim_ckv=head_dim_ckv,
-                head_dim_kpe=head_dim_kpe,
-                page_size=page_size,
-                causal=causal,
-                sm_scale=sm_scale,
-                q_data_type=q_dtype,
-                kv_data_type=kv_dtype,
-            )
+            if backend != "cutlass":
+                backend_wrappers[backend].plan(
+                    qo_indptr=qo_indptr,
+                    kv_indptr=kv_indptr,
+                    kv_indices=kv_indices,
+                    kv_len_arr=actual_seq_lens_kv,
+                    num_heads=num_qo_heads,
+                    head_dim_ckv=head_dim_ckv,
+                    head_dim_kpe=head_dim_kpe,
+                    page_size=page_size,
+                    causal=causal,
+                    sm_scale=sm_scale,
+                    q_data_type=q_dtype,
+                    kv_data_type=kv_dtype,
+                )
 
     if q_dtype in [torch.float8_e4m3fn, torch.float8_e5m2]:
         q = q.to(q_dtype)
@@ -1713,6 +1792,16 @@ def run_backend_wrapper(backend):
             return backend_wrappers[backend].run(
                 q_nope, q_pe, ckv_cache, kpe_cache, return_lse=False
             )
+        elif backend == "cutlass":
+            return backend_wrappers[backend].run(
+                q_nope,
+                q_pe,
+                ckv_cache,
+                kpe_cache,
+                kv_len=actual_seq_lens_kv.flatten(),
+                page_table=block_tables,
+                return_lse=False,
+            )
         if backend == "trtllm-gen-native":
             return flashinfer.decode.trtllm_batch_decode_with_kv_cache_mla(
                 query=q.unsqueeze(1),
@@ -1767,7 +1856,7 @@ def run_backend_wrapper(backend):
                 ) = is_close_stats(reference_output, tested_outputs[i], rtol, atol)
                 if num_different_elements > 0:
                     print(
-                        f"[ERROR] Output tensor mismatch between backends {tested_backends[0]} and {tested_backends[i]}: "
+                        f"[ERROR] Output tensor mismatch between backends fa2 and {tested_backends[i]}: "
                         f"{num_different_elements} / {num_elements} ({num_different_elements_percentage:.2f}%) elements are different"
                     )
                     if not args.allow_output_mismatch:
diff --git a/benchmarks/routines/flashinfer_benchmark_utils.py b/benchmarks/routines/flashinfer_benchmark_utils.py
@@ -177,8 +177,8 @@ def dtype_str_to_torch_dtype(dtype_str):
         "8.6": ["fa2", "cudnn"],
         "8.9": ["fa2", "cudnn"],
         "9.0": ["fa2", "fa3", "cudnn"],
-        "10.0": ["fa2", "cudnn", "trtllm-gen"],
-        "10.3": ["fa2", "cudnn", "trtllm-gen"],
+        "10.0": ["fa2", "cudnn", "trtllm-gen", "trtllm-gen-native"],
+        "10.3": ["fa2", "cudnn", "trtllm-gen", "trtllm-gen-native"],
         "12.0": ["fa2", "cudnn"],
     },
     "BatchPrefillWithRaggedKVCacheWrapper": {
@@ -187,8 +187,8 @@ def dtype_str_to_torch_dtype(dtype_str):
         "8.6": ["fa2", "cudnn"],
         "8.9": ["fa2", "cudnn"],
         "9.0": ["fa2", "fa3", "cudnn"],
-        "10.0": ["fa2", "cudnn", "cutlass"],
-        "10.3": ["fa2", "cudnn", "cutlass"],
+        "10.0": ["fa2", "cudnn", "cutlass", "trtllm-gen-native"],
+        "10.3": ["fa2", "cudnn", "cutlass", "trtllm-gen-native"],
         "12.0": ["fa2", "cudnn"],
     },
     "BatchMLAPagedAttentionWrapper": {
@@ -197,8 +197,8 @@ def dtype_str_to_torch_dtype(dtype_str):
         "8.6": ["fa2"],
         "8.9": ["fa2"],
         "9.0": ["fa2", "fa3"],
-        "10.0": ["fa2", "trtllm-gen-native"],
-        "10.3": ["fa2", "trtllm-gen-native"],
+        "10.0": ["fa2", "cutlass", "trtllm-gen-native"],
+        "10.3": ["fa2", "cutlass", "trtllm-gen-native"],
         "12.0": ["fa2"],
     },
     # GEMM