feat(dflash): add flex_attention with BlockMask support#435
feat(dflash): add flex_attention with BlockMask support#435sleepcoo merged 1 commit intosgl-project:mainfrom
Conversation
- Add attention_backend parameter to OnlineDFlashModel - Implement BlockMask creation for flex_attention optimization - Default to flex_attention backend (~30% speedup) - Add iter_time display in training progress bar - Clean up comments and simplify code
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Thank you for your PR. While training with long context(40K), I found that the flex attention backend causes OOM, whereas SDPA works fine. Upon investigation, the extra H = num_heads setting in create_block_mask seems to be the cause. |
I think you’re right — this was an oversight on my part. |
Motivation
Modifications
Related Issues
Accuracy Test
Benchmark & Profiling
eager


flex
Checklist