[Feature] USP: Replace SDPA with Flash Attention for memory optimization & Add Online Mode by uygnef · Pull Request #425 · sgl-project/SpecForge

uygnef · 2026-01-13T05:38:25Z

waiting for #400 to be merged

Motivation

To accelerate long-context training, this PR integrates Flash Attention into the Unified Sequence Parallelism (USP) framework. By replacing standard PyTorch operations with optimized kernels inside the Ring Attention loop, this implementation significantly improves memory efficiency for Eagle3 draft models.

Additionally, this PR adds support for online mode within the USP framework.

Modifications

Implemented LlamaUSPFlashAttention: Added a hybrid sequence parallel attention layer that combines Ulysses, Ring Attention, and Flash Attention (flash_attn_func).
Support online mode for Sequence Parallelism：Enabled online mode support for Sequence Parallelism.

Usage

torchrun \
    ...
    scripts/train_eagle3.py \
    ...
    --attention-backend usp_fa \
    --sp-ulysses-size $ULYSSES_SIZE \
    --sp-ring-size $RING_SIZE

Related Issues

Accuracy Test

python tests/test_layers/test_decoder.py

compare to sdpa implement bf16 diff less than 2e-2, fp16 diff less than 5e-3

Benchmark & Profiling

todo

TODO

Optimal Loss Aggregation: Currently uses all_gather within the SP group. Will be optimized to local calculation + reduce_sum to save VRAM.
Enhance Online Mode: Set draft micro-batch size.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist · 2026-01-13T05:38:28Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-01-13T06:24:44Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

jiapingW · 2026-01-19T07:56:32Z

I tried train seqlen=65536 with ring=8, ulysses=2 and it'll cost 94G. The setting with ring=4, ulysses=2 and it'll also cost 94G. Does it normal?

uygnef force-pushed the tmp/sp branch from c8e48ba to 14006c9 Compare January 13, 2026 05:45

sleepcoo marked this pull request as ready for review January 13, 2026 06:24

sleepcoo requested review from FlamingoPg, FrankLeeeee, shuaills, sleepcoo and zyksir as code owners January 13, 2026 06:24

uygnef force-pushed the tmp/sp branch 2 times, most recently from 11a2e13 to bf7f659 Compare January 13, 2026 12:13

uygnef changed the title ~~[feature] Sequence Parallelism: Replace SDPA with LlamaUSPFlashAttention for memory optimization~~ [Feature] USP: Replace SDPA with Flash Attention for memory optimization & Add Online Mode Jan 13, 2026

timmy-feng and others added 5 commits January 14, 2026 17:25

added flash_attn backend

5bff9ae

fix pre commit

26b2c5c

add usp for flash attn

8870eac

online mode

76d3cf3

clean up

bba2f9d

uygnef force-pushed the tmp/sp branch from bf7f659 to bba2f9d Compare January 14, 2026 09:28

modify test case

c68a806

sleepcoo merged commit e515403 into sgl-project:main Jan 14, 2026
2 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] USP: Replace SDPA with Flash Attention for memory optimization & Add Online Mode#425

[Feature] USP: Replace SDPA with Flash Attention for memory optimization & Add Online Mode#425
sleepcoo merged 6 commits intosgl-project:mainfrom
uygnef:tmp/sp

uygnef commented Jan 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Uh oh!

Uh oh!

jiapingW commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

uygnef commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage

Related Issues

Accuracy Test

Benchmark & Profiling

TODO

Checklist

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Uh oh!

Uh oh!

jiapingW commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

uygnef commented Jan 13, 2026 •

edited

Loading