[FEATURE] Enhance forward combine kernel and split attention by LoserCheems · Pull Request #227 · HKUSTDial/flash-sparse-attention

LoserCheems · 2026-02-27T15:03:49Z

Summary

Introduces a forward combine kernel for split outputs and improves split attention merging with stable normalization.

Root Cause

The need for better performance and flexibility in handling variable-length sequences and autotuning configurations.

Changes

Added forward combine kernel, autotuning configurations, and optimized KV splits for FlashDecoding. Refactored forward attention functions to support the split KV mechanism.

Reproduction

Implement the new forward combine kernel and autotuning configurations in relevant scenarios.

Tests

Validated changes through existing and new tests for the forward combine kernel and split attention mechanisms.

Compatibility

No backward compatibility issues identified.

Checklist

Linked issue provided [FEATURE REQUEST] Next-Generation Trainable Sparse Attention Mechanism #219 [FEATURE REQUEST] Triton-based efficient multi-platform, multi-variant attention #222
Adds or updates tests
Updates docs if needed
No perf regressions

Improves split attention merging with stable normalization Supports variable-length sequences and autotuning

…elated functions

…ding

…d optimize output handling

Copilot

Pull request overview

This PR introduces a forward combine kernel for efficiently merging split attention outputs and enhances the FlashDecoding mechanism by adding support for KV-split parallelization. The changes enable better GPU utilization through parallel processing of attention across the KV sequence dimension, particularly beneficial for long-context scenarios.

Changes:

Introduces a new combine kernel for stable numerically-stable merging of split attention outputs using log-sum-exp normalization
Adds autotuning configurations and heuristics for determining optimal KV split counts
Refactors forward attention kernels to support split KV mechanism with intermediate float32 accumulation for numerical precision

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
flash_sparse_attn/ops/triton/utils.py	Adds `num_splits_heuristic` function, `FWD_COMBINE_AUTOTUNE_KEYS`, `get_fwd_combine_autotune_configs`, updates `get_fwd_base_grid` to support num_splits parameter, adds `get_fwd_combine_grid`, and extends input validation for num_splits
flash_sparse_attn/ops/triton/flash_fwd_combine.py	New file implementing the combine kernel for merging split attention outputs with stable softmax normalization across splits
flash_sparse_attn/ops/triton/flash_fwd.py	Modifies `_fwd_base_kernel` and forward functions to support split KV mechanism, including stride calculations, tensor allocation for partial outputs, and integration with the combine kernel

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T15:10:13Z

flash_sparse_attn/ops/triton/utils.py

+        if arch == "cuda:sm80":
+            return [
+                triton.Config(
+                    {"TILE_M": 32, "TILE_N": 128},
+                    num_warps=4,
+                    num_stages=1,
+                )
+            ]
+        elif arch == "cuda:sm90":
+            return [
+                triton.Config(
+                    {"TILE_M": 32, "TILE_N": 128},
+                    num_warps=4,
+                    num_stages=1,
+                )
+            ]
+        elif arch == "cuda:sm100":
+            return [
+                triton.Config(
+                    {"TILE_M": 32, "TILE_N": 128},
+                    num_warps=4,
+                    num_stages=1,
+                )
+            ]
+        elif arch == "cuda:sm120":
+            return [
+                triton.Config(
+                    {"TILE_M": 32, "TILE_N": 128},
+                    num_warps=4,
+                    num_stages=1,
+                )
+            ]


The non-autotune configurations specify "TILE_N" in the config dictionary (lines 191, 199, 207, 215), but the autotune configurations use "TILE_K" (line 237). Since the combine kernel only uses TILE_M and TILE_K parameters (as seen in the kernel signature), the TILE_N in the non-autotune configs should be renamed to TILE_K for consistency. This mismatch could cause the kernel to fail when autotune is disabled.

…rward combine kernel

LoserCheems added 4 commits February 27, 2026 22:57

Adds forward combine kernel for split outputs

7e4d130

Improves split attention merging with stable normalization Supports variable-length sequences and autotuning

Add autotuning configurations for forward combine kernel and update r…

0c30555

…elated functions

Add num_splits_heuristic function to optimize KV splits for FlashDeco…

5dae391

…ding

Refactor forward attention functions to support split KV mechanism an…

e4cfffc

…d optimize output handling

Copilot AI review requested due to automatic review settings February 27, 2026 15:03

Copilot started reviewing on behalf of LoserCheems February 27, 2026 15:04 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

Update autotune configurations to use TILE_K instead of TILE_N for fo…

5d4c73d

…rward combine kernel

LoserCheems merged commit 6ec02a5 into main Feb 27, 2026

LoserCheems deleted the optime-triton-kernels branch March 1, 2026 03:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Enhance forward combine kernel and split attention#227

[FEATURE] Enhance forward combine kernel and split attention#227
LoserCheems merged 5 commits intomainfrom
optime-triton-kernels

LoserCheems commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LoserCheems commented Feb 27, 2026

Summary

Root Cause

Changes

Reproduction

Tests

Compatibility

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants