Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 17 additions & 10 deletions vllm/v1/core/kv_cache_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,16 +267,6 @@ def allocate_slots(
else:
new_computed_block_list = self.empty_kv_cache_blocks.blocks

# Free the blocks that are skipped during the attention computation
# (e.g., tokens outside the sliding window).
# We can do this even if we cannot schedule this request due to
# insufficient free blocks.
# Should call this function before allocating new blocks to reduce
# the number of evicted blocks.
self.coordinator.remove_skipped_blocks(
request.request_id, request.num_computed_tokens
)

# The number of computed tokens is the number of computed tokens plus
# the new prefix caching hits
num_computed_tokens = request.num_computed_tokens + num_new_computed_tokens
Expand All @@ -292,6 +282,23 @@ def allocate_slots(
num_encoder_tokens=num_encoder_tokens,
)

if (
num_blocks_to_allocate == 0
and new_computed_block_list is self.empty_kv_cache_blocks.blocks
):
# Early return as no new blocks needed to be allocated
return self.empty_kv_cache_blocks
Comment on lines +285 to +290

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skipped sliding-window cleanup when no new blocks allocated

allocate_slots now returns early when num_blocks_to_allocate==0 and no prefix-cache hits, so coordinator.remove_skipped_blocks() is never called on those steps. For sliding-window/ChunkedLocal attention this call is what frees blocks that have fallen outside the window; skipping it leaves those blocks held until the request hits a block boundary or finishes, inflating block_pool usage and causing unnecessary evictions or scheduling failures for long prompts that keep generating within an existing block.

Useful? React with 👍 / 👎.


# Free the blocks that are skipped during the attention computation
# (e.g., tokens outside the sliding window).
# We can do this even if we cannot schedule this request due to
# insufficient free blocks.
# Should call this function before allocating new blocks to reduce
# the number of evicted blocks.
self.coordinator.remove_skipped_blocks(
request.request_id, request.num_computed_tokens
)

if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
# Cannot allocate new blocks
return None
Expand Down