-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Perf] Early return in KVCacheManager.allocate_slots #29206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -267,16 +267,6 @@ def allocate_slots( | |
| else: | ||
| new_computed_block_list = self.empty_kv_cache_blocks.blocks | ||
|
|
||
| # Free the blocks that are skipped during the attention computation | ||
| # (e.g., tokens outside the sliding window). | ||
| # We can do this even if we cannot schedule this request due to | ||
| # insufficient free blocks. | ||
| # Should call this function before allocating new blocks to reduce | ||
| # the number of evicted blocks. | ||
| self.coordinator.remove_skipped_blocks( | ||
| request.request_id, request.num_computed_tokens | ||
| ) | ||
|
|
||
| # The number of computed tokens is the number of computed tokens plus | ||
| # the new prefix caching hits | ||
| num_computed_tokens = request.num_computed_tokens + num_new_computed_tokens | ||
|
|
@@ -292,6 +282,23 @@ def allocate_slots( | |
| num_encoder_tokens=num_encoder_tokens, | ||
| ) | ||
|
|
||
| if ( | ||
| num_blocks_to_allocate == 0 | ||
| and new_computed_block_list is self.empty_kv_cache_blocks.blocks | ||
| ): | ||
| # Early return as no new blocks needed to be allocated | ||
| return self.empty_kv_cache_blocks | ||
|
Comment on lines
+285
to
+290
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Useful? React with 👍 / 👎. |
||
|
|
||
| # Free the blocks that are skipped during the attention computation | ||
| # (e.g., tokens outside the sliding window). | ||
| # We can do this even if we cannot schedule this request due to | ||
| # insufficient free blocks. | ||
| # Should call this function before allocating new blocks to reduce | ||
| # the number of evicted blocks. | ||
| self.coordinator.remove_skipped_blocks( | ||
| request.request_id, request.num_computed_tokens | ||
| ) | ||
|
|
||
| if num_blocks_to_allocate > self.block_pool.get_num_free_blocks(): | ||
| # Cannot allocate new blocks | ||
| return None | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer this slightly.
And do we need to add
num_computed_tokens > request.num_prompt_tokensto makeremove_skipped_blocksto be called in the first decode step? This can help to free the prefill tokens used by the last prefill step but is outside the sliding window of the first decode step. Would be grateful if you can try gpt-oss and gemma3, two models with small sliding window size.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can you add a comment for this may delay
remove_skipped_blocksandcache_blocksand give some analysis about it is fine.