Skip to content

Conversation

@Jialin
Copy link
Collaborator

@Jialin Jialin commented Nov 21, 2025

Purpose

As the title, early return in KVCacheManager.allocate_slots to speed up scheduling, as most of the requests only allocate a block every 16 steps.

Test Plan && Test Result

With the change, we could observed allocate slots cost is cut by half from 17.24% to 8.24% over the total runtime (with async scheduling enabled)
Screenshot 2025-11-21 at 1 36 40 PM

Total runtime distribution for allocate_slots per engine step
Screenshot 2025-11-21 at 1 35 50 PM

Public benchmarks

  • Before: 171.53s
  • After: 165.90s (~3.3% total runtime reduction)
vllm bench latency --model facebook/opt-125m \
  --dtype bfloat16 \
  --batch-size 3000 \
  --input-len 128 \
  --output-len 1800 \
  -tp 1 \
  --num-iters-warmup 1 \
  --num-iters 3

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the v1 label Nov 21, 2025
@Jialin Jialin marked this pull request as ready for review November 21, 2025 22:23
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +285 to +290
if (
num_blocks_to_allocate == 0
and new_computed_block_list is self.empty_kv_cache_blocks.blocks
):
# Early return as no new blocks needed to be allocated
return self.empty_kv_cache_blocks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skipped sliding-window cleanup when no new blocks allocated

allocate_slots now returns early when num_blocks_to_allocate==0 and no prefix-cache hits, so coordinator.remove_skipped_blocks() is never called on those steps. For sliding-window/ChunkedLocal attention this call is what frees blocks that have fallen outside the window; skipping it leaves those blocks held until the request hits a block boundary or finishes, inflating block_pool usage and causing unnecessary evictions or scheduling failures for long prompts that keep generating within an existing block.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant