Skip to content

feat(dflash): add flex_attention with BlockMask support#435

Merged
sleepcoo merged 1 commit intosgl-project:mainfrom
eigen-ai-labs:feature/dflash
Jan 19, 2026
Merged

feat(dflash): add flex_attention with BlockMask support#435
sleepcoo merged 1 commit intosgl-project:mainfrom
eigen-ai-labs:feature/dflash

Conversation

@xiaomin-D
Copy link
Contributor

  • Add attention_backend parameter to OnlineDFlashModel
  • Implement BlockMask creation for flex_attention optimization
  • Default to flex_attention backend (~30% speedup)
  • Add iter_time display in training progress bar
  • Clean up comments and simplify code

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

eager
image
flex
image

Checklist

- Add attention_backend parameter to OnlineDFlashModel
- Implement BlockMask creation for flex_attention optimization
- Default to flex_attention backend (~30% speedup)
- Add iter_time display in training progress bar
- Clean up comments and simplify code
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@sleepcoo sleepcoo merged commit b85f89c into sgl-project:main Jan 19, 2026
2 checks passed
@Ximingwang-09
Copy link
Contributor

Thank you for your PR. While training with long context(40K), I found that the flex attention backend causes OOM, whereas SDPA works fine. Upon investigation, the extra H = num_heads setting in create_block_mask seems to be the cause.
Is there any special reason why H must be set to num_heads in create_block_mask? In the implementation of eagle3, H is set to 1.
Related Issue:#452

@xiaomin-D
Copy link
Contributor Author

Thank you for your PR. While training with long context(40K), I found that the flex attention backend causes OOM, whereas SDPA works fine. Upon investigation, the extra H = num_heads setting in create_block_mask seems to be the cause. Is there any special reason why H must be set to num_heads in create_block_mask? In the implementation of eagle3, H is set to 1. Related Issue:#452

I think you’re right — this was an oversight on my part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants