Skip to content

Conversation

@Jialin
Copy link
Collaborator

@Jialin Jialin commented Nov 21, 2025

Purpose

As the title, early return in KVCacheManager.allocate_slots to speed up scheduling, as most of the requests only allocate a block every 16 steps.

Test Plan && Test Result

With the change, we could observed allocate slots cost is cut by half from 17.24% to 8.24% over the total runtime (with async scheduling enabled)
Screenshot 2025-11-21 at 1 36 40 PM

Total runtime distribution for allocate_slots per engine step
Screenshot 2025-11-21 at 1 35 50 PM

Public benchmarks

  • Before: 171.53s
  • After: 165.90s (~3.3% total runtime reduction)
vllm bench latency --model facebook/opt-125m \
  --dtype bfloat16 \
  --batch-size 3000 \
  --input-len 128 \
  --output-len 1800 \
  -tp 1 \
  --num-iters-warmup 1 \
  --num-iters 3

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the v1 label Nov 21, 2025
@Jialin Jialin marked this pull request as ready for review November 21, 2025 22:23
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 285 to 290
if (
num_blocks_to_allocate == 0
and new_computed_block_list is self.empty_kv_cache_blocks.blocks
):
# Early return as no new blocks needed to be allocated
return self.empty_kv_cache_blocks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skipped sliding-window cleanup when no new blocks allocated

allocate_slots now returns early when num_blocks_to_allocate==0 and no prefix-cache hits, so coordinator.remove_skipped_blocks() is never called on those steps. For sliding-window/ChunkedLocal attention this call is what frees blocks that have fallen outside the window; skipping it leaves those blocks held until the request hits a block boundary or finishes, inflating block_pool usage and causing unnecessary evictions or scheduling failures for long prompts that keep generating within an existing block.

Useful? React with 👍 / 👎.


if (
num_blocks_to_allocate == 0
and new_computed_block_list is self.empty_kv_cache_blocks.blocks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and new_computed_block_list is self.empty_kv_cache_blocks.blocks
and new_computed_blocks is None

prefer this slightly.

And do we need to add num_computed_tokens > request.num_prompt_tokens to make remove_skipped_blocks to be called in the first decode step? This can help to free the prefill tokens used by the last prefill step but is outside the sliding window of the first decode step. Would be grateful if you can try gpt-oss and gemma3, two models with small sliding window size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can you add a comment for this may delay remove_skipped_blocks and cache_blocks and give some analysis about it is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, great call. I didn't notice the cache_blocks logic. Let me think about the early return logic more :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Updated per your suggestion
  • Add request.num_output_tokens > 0 to ensure we DO NOT early return for prefill and first decode step

Do you feel if it's sufficient?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants