Draft
Conversation
Eagle overlap
Timmy/overlap eagle eos
free kv on scheduler
Timmy/remove accept length cpu
seq lens cpu caching (tested)
overlap scheduling
Fix seq_lens race
remove mrope position sync
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added support for paged attention by doing the following:
run_batch. Since we do not know the fill status of the most recent page (it is still running on the GPU), we allocate for the worst case number of pages starting from a new page.assign_draft_cache_locskernel in the draft decode to prepend the remaining unused cache locs from the previous page. We don't have to worry about freeing excess here because the allocator state is restored after draft.merge_cache_lockernel to the verify to prepend the remaining unused cache locs from the previous page. We store the excess pages into anevict_cache_loctensor, which is combined with the other pages that are evicted after accepting tokens.TODO
Correctness has been achieved for all attention backends other than FA3.
The code is correct when FA3 is used for the draft decode + extend, but not verify.