[BUG FIX] Improve error reporting and occupancy in benchmarks by LoserCheems · Pull Request #251 · HKUSTDial/flash-sparse-attention

LoserCheems · 2026-03-18T17:26:06Z

Summary

This update enhances error reporting in benchmark tests and improves launch configuration logic for better occupancy in forward and sparse kernels.

Root Cause

The previous error handling did not provide sufficient detail, making debugging difficult. Additionally, the occupancy settings for certain configurations were suboptimal.

Changes

Added detailed error reporting with traceback in benchmark tests.
Updated launch configuration logic to adjust tile sizes based on qheads_per_kvhead for improved occupancy.

Reproduction

Run benchmark tests with various configurations to observe error messages and occupancy metrics.

Tests

Validated changes by running existing benchmark tests and confirming improved error output and occupancy metrics.

Compatibility

No backward compatibility issues identified.

Checklist

Linked issue provided
Adds or updates tests
Updates docs if needed
No perf regressions

…mprove occupancy

Copilot

Pull request overview

This PR improves benchmark failure diagnostics by including full Python tracebacks in error messages, and updates Triton forward-kernel launch configuration heuristics intended to improve occupancy (especially for split-KV / decode-like cases).

Changes:

Include traceback details in BenchmarkResult.error_message and print multi-line failure output in benchmark scripts.
Adjust split-KV TILE_M selection based on qheads_per_kvhead and retune some SM90 (H100) forward launch configs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
`tests/benchmark_forward.py`	Adds traceback capture to benchmark failures and prints full error details.
`tests/benchmark_decode.py`	Same traceback-enhanced error reporting for decode benchmarks.
`tests/benchmark_backward.py`	Same traceback-enhanced error reporting for backward benchmarks.
`flash_sparse_attn/ops/triton/launch_template.py`	Updates forward (dense/sparse/gated) launch config heuristics, including split-KV `TILE_M` and SM90 configs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

flash_sparse_attn/ops/triton/launch_template.py

        if is_split_kv:
-            if pack_gqa and qheads_per_kvhead > 1:
+            if pack_gqa and qheads_per_kvhead > 16:
                tile_m = triton.next_power_of_2(qheads_per_kvhead)
            else:
-                tile_m = 1
+                tile_m = 16


flash_sparse_attn/ops/triton/launch_template.py

                tile_m = triton.next_power_of_2(qheads_per_kvhead)
            else:
-                tile_m = 1
+                tile_m = 16


flash_sparse_attn/ops/triton/launch_template.py

        if is_split_kv:
-            if pack_gqa and qheads_per_kvhead > 1:
+            if pack_gqa and qheads_per_kvhead > 16:
                tile_m = triton.next_power_of_2(qheads_per_kvhead)
            else:
-                tile_m = 1
+                tile_m = 16


tests/benchmark_forward.py

@@ -224,7 +226,7 @@ def run_benchmark(cfg: BenchmarkConfig) -> BenchmarkResult:
            triton_gated_tflops=None,
            fa_dense_tflops=None,
            cudnn_dense_tflops=None,
-            error_message=str(exc),
+            error_message=full_error,


tests/benchmark_decode.py

@@ -221,7 +223,7 @@ def run_benchmark(cfg: BenchmarkConfig) -> BenchmarkResult:
            triton_gated_tflops=None,
            fa_dense_tflops=None,
            cudnn_dense_tflops=None,
-            error_message=str(exc),
+            error_message=full_error,


tests/benchmark_backward.py

@@ -281,7 +283,7 @@ def run_benchmark(cfg: BenchmarkConfig) -> BenchmarkResult:
            triton_gated_tflops=None,
            fa_dense_tflops=None,
            cudnn_dense_tflops=None,
-            error_message=str(exc),
+            error_message=full_error,
        )


LoserCheems added 2 commits March 19, 2026 00:05

Add detailed error reporting in benchmark tests with traceback

1e7d6d6

Update launch configuration logic for forward and sparse kernels to i…

0424884

…mprove occupancy

Copilot AI review requested due to automatic review settings March 18, 2026 17:26

LoserCheems merged commit 26fe0af into main Mar 18, 2026
3 checks passed

Copilot started reviewing on behalf of LoserCheems March 18, 2026 17:26 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG FIX] Improve error reporting and occupancy in benchmarks#251

[BUG FIX] Improve error reporting and occupancy in benchmarks#251
LoserCheems merged 2 commits intomainfrom
optim_triton_version

LoserCheems commented Mar 18, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LoserCheems commented Mar 18, 2026

Summary

Root Cause

Changes

Reproduction

Tests

Compatibility

Checklist

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants