[FEATURE SUPPORT] Add Triton backward support by LoserCheems · Pull Request #235 · HKUSTDial/flash-sparse-attention

LoserCheems · 2026-03-08T15:11:49Z

Summary

This PR introduces end-to-end backward support for the Triton flash attention path, including:

Backward launch configuration selection by GPU architecture.
Backward grid helpers for main, preprocess, and postprocess kernels.
A backward preprocess kernel to compute dPsum, convert LSE to log2 space, and initialize dQ accumulation buffer.
A backward core kernel to compute dQ, dK, and dV with support for causal/local masking, varlen inputs, and GQA accumulation behavior.
A backward postprocess kernel to scale and cast accumulated dQ to output dtype.

The goal is to make backward computation available in the same Triton stack as forward, with architecture-aware launch behavior and varlen-compatible memory layout.

Design

The implementation follows the same staged design as the reference cute pipeline:

Stage 1 (preprocess): prepare numerically stable intermediate tensors and reset dQ accumulation.
Stage 2 (main backward): iterate over N blocks, compute attention-gradient math, atomically accumulate dQ tiles, and produce dK/dV accumulators.
Stage 3 (postprocess): apply scale and cast dQ accumulation into final gradient tensor.

Key design choices:

Keep backward orchestration modular across three files for easier tuning and debugging.
Reuse shared seqlen/padded-offset helpers for fixed-length and varlen consistency.
Use architecture-based launch templates to avoid hardcoding one-size-fits-all kernel configs.
Use float32 accumulators for numerical stability in intermediate reductions.

Alternatives considered:

Monolithic backward kernel without preprocess/postprocess splitting was avoided due to lower maintainability and harder numerical/debug control.
Device-agnostic fixed launch config was avoided due to occupancy/performance risk across Ampere/Hopper/Blackwell classes.

Changes

New/updated functionality includes:

Added backward launch config API in launch templates.
Added backward grid builders for main, preprocess, and postprocess kernels.
Added Triton backward preprocess implementation.
Added Triton backward postprocess implementation.
Added Triton backward main kernel and Python entrypoints for:

Fixed-length backward.
Varlen backward with cu_seqlens and optional seqused.

Public behavior:

Backward path now exists in Triton backend and returns dq, dk, dv outputs for both fixed and varlen use cases.
Varlen API expects max sequence lengths for launch sizing.

Implementation notes

dQ is accumulated in float32 buffer and finalized in postprocess to improve stability.
dK is scaled by softmax_scale before final write, consistent with backward derivation.
GQA path uses atomic accumulation for dK/dV when multiple Q heads map to one KV head.
Current varlen path depends on provided max_seqlen_q/max_seqlen_k for grid sizing.
Follow-up hardening recommended: add automatic fallback inference of max sequence lengths when they are not explicitly provided.

Tests

Validation completed:

Reference parity review against cute backward staging and dataflow:

preprocess semantic parity (dPsum/LSELog2/dQ init).
core backward math flow parity (p, ds, dq/dk/dv accumulation).
postprocess scaling/cast parity for dQ finalization.

Static consistency checks:

fixed-length stride and shape mapping.
varlen padded offset usage for intermediate buffers.
mask/no-mask block range splitting.

Pending runtime validation:

Multi-arch performance sanity checks (A100/H100/B200-class paths).

Documentation

Inline code comments added in kernels for major computation stages.
Recommended follow-up: add a short backward architecture section to developer docs describing:

preprocess/main/postprocess pipeline.
varlen max sequence length requirements.
architecture launch config rationale.

Checklist

Linked issue provided [FEATURE REQUEST] Next-Generation Trainable Sparse Attention Mechanism #219 [FEATURE REQUEST] Triton-based efficient multi-platform, multi-variant attention #222
API stabilised
Tests added or updated
Docs added or updated
No known performance regressions

…functions for improved block calculations in causal and local contexts

…re correct tensor operations

…ion swapping

…_kernel for clarity

Copilot

Pull request overview

This PR aims to address correctness issues in Triton attention tiling/masking by introducing new helper utilities for block boundary computations and pointer construction, and by adjusting masking index computation for the SWAP_AB (swapped Q/K) case.

Changes:

Added a generic Triton JIT pointer-construction helper (make_ptrs) in seqlen_info.py.
Added new Triton JIT helpers in block_info.py for m-block boundary computations under causal/local constraints.
Fixed index assignments in mask.apply_mask for the SWAP_AB branch; adjusted comments in flash_fwd.py.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`flash_sparse_attn/ops/triton/seqlen_info.py`	Adds `make_ptrs` helper for pointer creation (currently no call sites in repo).
`flash_sparse_attn/ops/triton/mask.py`	Adjusts `q_idx`/`k_idx` construction for `SWAP_AB` masking.
`flash_sparse_attn/ops/triton/flash_fwd.py`	Comment-only refactor around pointer creation sections.
`flash_sparse_attn/ops/triton/block_info.py`	Adds new m-block boundary helpers for causal/local scheduling (currently no call sites in repo).

Comments suppressed due to low confidence (1)

flash_sparse_attn/ops/triton/flash_fwd.py:265

The PR description says new tests were added and existing tests validated, but this diff doesn’t include any test changes/additions. Please either add the corresponding tests in this PR or update the description/checklist to reflect what was actually changed.

    # Create pointers
    if not PACK_GQA:
        lse_ptrs = tl.make_block_ptr(
            base=lse_base,
            shape=(actual_seqlen_q,),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

flash_sparse_attn/ops/triton/seqlen_info.py

flash_sparse_attn/ops/triton/block_info.py

flash_sparse_attn/ops/triton/mask.py

…uration based on device architecture

…in launch_grid.py

…wd_preprocess.py

…bwd_postprocess.py

…wd.py

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

flash_sparse_attn/ops/triton/flash_bwd.py

flash_sparse_attn/ops/triton/launch_template.py

flash_sparse_attn/ops/triton/flash_bwd_preprocess.py

flash_sparse_attn/ops/triton/flash_bwd_postprocess.py

… launch_template.py

LoserCheems added 4 commits March 8, 2026 23:08

Adds m_block_min_causal_local_mask and m_block_max_before_local_mask …

afeaa3e

…functions for improved block calculations in causal and local contexts

Fix index assignment for query and key in apply_mask function to ensu…

6c623b1

…re correct tensor operations

Adds make_ptrs function for pointer calculations with optional dimens…

5b9c9f3

…ion swapping

Refactors pointer creation comments in _fwd_base_kernel and _fwd_sm90…

3ea0ecf

…_kernel for clarity

Copilot AI review requested due to automatic review settings March 8, 2026 15:11

Copilot started reviewing on behalf of LoserCheems March 8, 2026 15:12 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

flash_sparse_attn/ops/triton/seqlen_info.py Show resolved Hide resolved

flash_sparse_attn/ops/triton/block_info.py Show resolved Hide resolved

flash_sparse_attn/ops/triton/mask.py Show resolved Hide resolved

LoserCheems added 7 commits March 9, 2026 15:47

Adds a blank line for improved readability in assert_inputs.py

301ee47

Adds get_bwd_launch_config function for backward kernel launch config…

25eaf39

…uration based on device architecture

Adds grid functions for backward kernel operations and preprocessing …

5302477

…in launch_grid.py

Adds a blank line for improved readability in flash_fwd.py

e572a58

Add backward preprocessing kernel and associated functions in flash_b…

199b333

…wd_preprocess.py

Add backward postprocessing kernel and associated functions in flash_…

1ca965b

…bwd_postprocess.py

Add backward kernel implementation for attention mechanism in flash_b…

f584a09

…wd.py

LoserCheems changed the title ~~[BUG FIX] Improve block calculations and tensor operations~~ [FEATURE SUPPORT] Add Triton backward support Mar 9, 2026

LoserCheems requested a review from Copilot March 9, 2026 08:01

Copilot started reviewing on behalf of LoserCheems March 9, 2026 08:02 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Remove unnecessary parameters from get_bwd_launch_config docstring in…

28ca246

… launch_template.py

LoserCheems merged commit 3491059 into main Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE SUPPORT] Add Triton backward support#235

[FEATURE SUPPORT] Add Triton backward support#235
LoserCheems merged 12 commits intomainfrom
optim-triton-version

LoserCheems commented Mar 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LoserCheems commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Changes

Implementation notes

Tests

Documentation

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LoserCheems commented Mar 8, 2026 •

edited

Loading