Refactor masking logic in backward kernel functions by LoserCheems · Pull Request #255 · HKUSTDial/flash-sparse-attention

LoserCheems · 2026-03-20T17:53:07Z

The refactor improves clarity and efficiency of the masking logic in backward kernel functions. It consolidates the mask application to reduce redundancy and enhance maintainability. No bugs were fixed, but the code structure is now cleaner and more efficient.

…fficiency

Copilot

Pull request overview

Refactors masking placement in Triton backward inner kernels so masking is applied immediately after score computation rather than later in the softmax/gradient path. This targets clearer control flow and aligns masking with downstream max/threshold computations used for block skipping.

Changes:

Move mask.apply_mask(...) to directly follow acc_s computation in sparse backward inner kernel.
Move mask.apply_mask(...) to directly follow acc_s computation in gated backward inner kernel.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`flash_sparse_attn/ops/triton/flash_sparse_bwd.py`	Applies the attention mask earlier in `_bwd_inner_sparse_base_kernel`, before block-max/skip logic.
`flash_sparse_attn/ops/triton/flash_gated_bwd.py`	Applies the attention mask earlier in `_bwd_inner_gated_base_kernel`, before block-max/skip logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T17:56:44Z

flash_sparse_attn/ops/triton/flash_sparse_bwd.py

+    if IS_MASK:
+        # Apply mask
+        acc_s = mask.apply_mask(
+            acc_s=acc_s,
+            m_block=m_block,


Moving mask.apply_mask ahead of block_max_curr = tl.max(acc_s) changes the semantics of the block-level skip_softmax decision and block_max updates (masked-out scores can no longer influence the running max). This can change which blocks are skipped and potentially the resulting gradients, so the PR description (“No bugs were fixed…”) may be inaccurate; please confirm this behavioral change is intended and update the description / add a correctness note or test coverage accordingly.

Copilot · 2026-03-20T17:56:44Z

flash_sparse_attn/ops/triton/flash_gated_bwd.py

+        if IS_MASK:
+            # Apply mask
+            acc_s = mask.apply_mask(
+                acc_s=acc_s,
+                m_block=m_block,
+                n_block=n_block,


As in the sparse kernel, applying the mask before computing block_max_curr changes the skip_softmax gating behavior and running block_max for gated backward. This can affect which blocks run the softmax path and therefore can change outputs; please confirm this is intended and reflect it in the PR description / add a targeted correctness check.

Refactor masking logic in backward kernel functions for clarity and e…

753b00f

…fficiency

Copilot AI review requested due to automatic review settings March 20, 2026 17:53

Copilot started reviewing on behalf of LoserCheems March 20, 2026 17:53 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

LoserCheems merged commit 552bdb2 into main Mar 21, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor masking logic in backward kernel functions#255

Refactor masking logic in backward kernel functions#255
LoserCheems merged 1 commit intomainfrom
optim_triton_version

LoserCheems commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LoserCheems commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants