Skip to content

Commit ddf4d53

Browse files
authored
[bugfix] Fix bugs in _dumm_run and re-initialize kv-cache. (#3262)
### What this PR does / why we need it? Currently we run an extra profile_run with `num_tokens == self.mc2_tokens_capacity`. However, when setting `max_num_batched_tokens < self.mc2_tokens_capacity`, this will trigger an assertion error that requires num_tokens in `_dummy_run` to be smaller than `max_num_batched_tokens`. This PR skips this extra `profile_run` if `self.max_num_tokens <= self.mc2_tokens_capacity` so as to avoid this bug. This PR fixes a bug that `kernel_block_sizes` never equals to `[self.cache_config.block_size]`. `kernel_block_sizes` is type of List[List[int]], so the condition should be `kernel_block_sizes != [[self.cache_config.block_size]]`. This also helps to resolve a issue that cpu_offload_gb cannot be enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@releases/v0.11.0 Signed-off-by: Angazenn <[email protected]>
1 parent 00ba071 commit ddf4d53

File tree

1 file changed

+4
-3
lines changed

1 file changed

+4
-3
lines changed

vllm_ascend/worker/model_runner_v1.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -502,7 +502,7 @@ def __init__(self, vllm_config: VllmConfig, device: torch.device):
502502
self.is_pooling_model,
503503
self.vllm_config.model_config.logits_processors),
504504
is_pooling_model=self.is_pooling_model,
505-
kernel_block_sizes=None,
505+
kernel_block_sizes=[[self.vllm_config.cache_config.block_size]],
506506
)
507507
self.num_accepted_tokens = self._make_buffer(self.max_num_reqs,
508508
dtype=torch.int64)
@@ -2511,7 +2511,8 @@ def profile_run(self) -> None:
25112511
# MC2 will consume additional NPU memory.
25122512
# Therefore, we need to run the MC2 path once here to complete its initialization,
25132513
# allowing vLLM to correctly estimate the maximum memory required.
2514-
if self._select_moe_comm_method(
2514+
if self.max_num_tokens > self.mc2_tokens_capacity and \
2515+
self._select_moe_comm_method(
25152516
self.mc2_tokens_capacity,
25162517
with_prefill=True) == MoECommType.MC2:
25172518
self._dummy_run(self.mc2_tokens_capacity, with_prefill=True)
@@ -3140,7 +3141,7 @@ def may_reinitialize_input_batch(self,
31403141
# of mamba block. In this case, BlockTable.block_size will never equal
31413142
# to kernel_block_sizes[0]
31423143
kernel_block_sizes.append([0])
3143-
if kernel_block_sizes != [self.cache_config.block_size]:
3144+
if kernel_block_sizes != [[self.cache_config.block_size]]:
31443145
assert self.cache_config.cpu_offload_gb == 0, (
31453146
"Cannot re-initialize the input batch when CPU weight "
31463147
"offloading is enabled. See https://github.com/vllm-project/vllm/pull/18298 " # noqa: E501

0 commit comments

Comments
 (0)