feat(dflash): add flex_attention with BlockMask support by xiaomin-D · Pull Request #435 · sgl-project/SpecForge

xiaomin-D · 2026-01-19T05:28:25Z

Add attention_backend parameter to OnlineDFlashModel
Implement BlockMask creation for flex_attention optimization
Default to flex_attention backend (~30% speedup)
Add iter_time display in training progress bar
Clean up comments and simplify code

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

eager

flex

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

- Add attention_backend parameter to OnlineDFlashModel - Implement BlockMask creation for flex_attention optimization - Default to flex_attention backend (~30% speedup) - Add iter_time display in training progress bar - Clean up comments and simplify code

gemini-code-assist · 2026-01-19T05:28:29Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Ximingwang-09 · 2026-01-27T06:01:05Z

Thank you for your PR. While training with long context(40K), I found that the flex attention backend causes OOM, whereas SDPA works fine. Upon investigation, the extra H = num_heads setting in create_block_mask seems to be the cause.
Is there any special reason why H must be set to num_heads in create_block_mask? In the implementation of eagle3, H is set to 1.
Related Issue：#452

xiaomin-D · 2026-01-27T07:40:47Z

Thank you for your PR. While training with long context(40K), I found that the flex attention backend causes OOM, whereas SDPA works fine. Upon investigation, the extra H = num_heads setting in create_block_mask seems to be the cause. Is there any special reason why H must be set to num_heads in create_block_mask? In the implementation of eagle3, H is set to 1. Related Issue：#452

I think you’re right — this was an oversight on my part.

xiaomin-D requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners January 19, 2026 05:28

sleepcoo merged commit b85f89c into sgl-project:main Jan 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): add flex_attention with BlockMask support#435

feat(dflash): add flex_attention with BlockMask support#435
sleepcoo merged 1 commit intosgl-project:mainfrom
eigen-ai-labs:feature/dflash

xiaomin-D commented Jan 19, 2026

Uh oh!

gemini-code-assist bot commented Jan 19, 2026

Uh oh!

Uh oh!

Ximingwang-09 commented Jan 27, 2026

Uh oh!

xiaomin-D commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiaomin-D commented Jan 19, 2026

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Jan 19, 2026

Uh oh!

Uh oh!

Ximingwang-09 commented Jan 27, 2026

Uh oh!

xiaomin-D commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants