Support paged attention for eagle overlap by timmy-feng · Pull Request #12 · modal-labs/sglang

timmy-feng · 2025-08-24T18:03:39Z

Added support for paged attention by doing the following:

Pre-allocate pages in the scheduler thread before calling run_batch. Since we do not know the fill status of the most recent page (it is still running on the GPU), we allocate for the worst case number of pages starting from a new page.
Alter the assign_draft_cache_locs kernel in the draft decode to prepend the remaining unused cache locs from the previous page. We don't have to worry about freeing excess here because the allocator state is restored after draft.
Add a merge_cache_loc kernel to the verify to prepend the remaining unused cache locs from the previous page. We store the excess pages into an evict_cache_loc tensor, which is combined with the other pages that are evicted after accepting tokens.

TODO

Correctness has been achieved for all attention backends other than FA3.

The code is correct when FA3 is used for the draft decode + extend, but not verify.

Eagle overlap

Timmy/overlap eagle eos

free kv on scheduler

Timmy/remove accept length cpu

seq lens cpu caching (tested)

overlap scheduling

Fix seq_lens race

remove mrope position sync

fused mrope

timmy-feng and others added 30 commits July 22, 2025 17:30

eagle overlap thread scaffold

f275b14

static shapes

01a9871

Merge pull request #1 from modal-labs/eagle_overlap

29d7665

Eagle overlap

filter batch in scheduler

5023e5f

check for finish in schedule

44aef49

Merge pull request #2 from modal-labs/timmy/overlap_eagle_eos

d4a928b

Timmy/overlap eagle eos

scatter instead of index_put to avoid sync

bf0d437

free kv on scheduler for page_size=1

2e12eb3

change accept length copy to be unblocking

760345d

req does not have free_cache_loc_cpu property

6451366

debugging

13a7c89

fixed accept length bug

e17b2a9

removed prints

013a267

refactoring

63ffb45

undo accept length copy change

cd8f799

undo fixed accept length bug

e23e2d0

Merge pull request #4 from modal-labs/timmy/free_kv_on_scheduler

e96297e

free kv on scheduler

remove accept length cpu

b440952

remove field from output struct

3ddfd27

Merge pull request #5 from modal-labs/timmy/remove_accept_length_cpu

08ec169

Timmy/remove accept length cpu

untested seq lens cpu caching

22930bf

Merge pull request #6 from modal-labs/timmy/seq_lens_cpu

139de96

seq lens cpu caching (tested)

save and use spec_steps in EagleDraftInput

066ad2d

Add overlap scheduling

d5666f6

support batch size > 1

4db7efb

fix non-spec stale request bug

610adaf

Merge branch 'nathan/eagle_overlap' into timmy/overlap

cb46350

refactor verify fields

900d02a

pre allocate kv cache

2eb8e06

move cpu seq lens to worker batch

6c130e2

timmy-feng and others added 24 commits August 5, 2025 00:26

refactored FutureSpecInfo

1bb3972

Merge pull request #7 from modal-labs/timmy/overlap

84cd946

overlap scheduling

Fix typo in import path

cf2348f

Merge https://github.com/sgl-project/sglang

7cb5193

remove mrope position sync

6264614

torch compile mrope

edafab3

fork and merge seq-lens

a8229ae

avoid unnecessary backup

0244292

dealias accept length

5af7f68

Merge pull request #8 from modal-labs/timmy/fix-seq-lens-race

1dd6ffa

Fix seq_lens race

Merge pull request #9 from modal-labs/timmy/mrope-positions-on-gpu

c40099a

remove mrope position sync

Merge pull request #10 from modal-labs/timmy/fused-mrope

e8f941a

fused mrope

undo faulty update

9eb686a

allocate draft kv cache speculatively

2d6c9c4

change evict cache loc to gpu tensor

653e9d6

allocate verify kv separately

265c6e9

free verify kv cache

e42086b

fix last_loc

3d00118

add page size to benchmarks

e66b524

more fixes

b097f15

scheduler cache free fix

88cbdb7

fix merge_cache_loc bug

28089eb

fix assign_draft

0fe1bee

avoid random write addresses

804455e

timmy-feng mentioned this pull request Aug 25, 2025

Support overlap scheduling for speculative decoding sgl-project/sglang#9588

Open

9 tasks

dcw02 force-pushed the main branch from 9f21071 to 20282f5 Compare March 2, 2026 01:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support paged attention for eagle overlap#12

Support paged attention for eagle overlap#12
timmy-feng wants to merge 54 commits intomainfrom
timmy/paged-attn

timmy-feng commented Aug 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timmy-feng commented Aug 24, 2025

TODO

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants