[Long Sequence Feat]support chunk prefill#3734
[Long Sequence Feat]support chunk prefill#3734LookAround0301 wants to merge 79 commits intovllm-project:mainfrom
Conversation
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: chenjie <chenjie137@huawei.com>
model runner support cp: input ids, position ids and slot mapping
Signed-off-by: chenjie <chenjie137@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
model runner support cp: metadata, logits indices
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
…_dev # Conflicts: # vllm_ascend/attention/attention_v1.py # vllm_ascend/attention/mla_v1.py # vllm_ascend/distributed/parallel_state.py # vllm_ascend/envs.py # vllm_ascend/ops/fused_moe.py # vllm_ascend/platform.py # vllm_ascend/worker/model_runner_v1.py
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
…group initialization Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
support cp_kv_cache_interleave_size and pd disaggregate
Signed-off-by: LookAround <lixushi@huawei.com>
…_dev # Conflicts: # vllm_ascend/attention/attention_v1.py # vllm_ascend/attention/mla_v1.py # vllm_ascend/attention/utils.py # vllm_ascend/distributed/llmdatadist_c_mgr_connector.py # vllm_ascend/envs.py # vllm_ascend/patch/worker/patch_common/patch_distributed.py # vllm_ascend/platform.py # vllm_ascend/utils.py # vllm_ascend/worker/model_runner_v1.py
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Feng Liu <liufeng248@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
There was a problem hiding this comment.
Code Review
This PR introduces support for chunked prefill for long sequences, a significant feature involving extensive changes to attention mechanisms and the model runner for distributed context parallelism on Ascend NPUs. While the overall implementation appears robust, I have identified a critical bug that could lead to a runtime crash, along with two high-severity performance bottlenecks stemming from inefficient tensor manipulations and unnecessary CPU-GPU synchronizations. Addressing these issues is crucial for ensuring the correctness and performance of the new feature.
vllm_ascend/worker/block_table.py
Outdated
|
|
||
| # Get starting rank for this chunk | ||
| if request_start_rank_dict is not None: | ||
| start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0) |
There was a problem hiding this comment.
There is a potential TypeError here. If req_id is not found in request_start_rank_dict, request_start_rank_dict.get(req_id, 0) will return the integer 0. The subsequent attempt to unpack this integer into start_rank, tokens_blank will cause a crash.
While the current call sites might ensure req_id is always present, this code is fragile. To make it more robust, the default value should be a tuple (0, 0) to match the expected unpacking.
| start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0) | |
| start_rank, tokens_blank = request_start_rank_dict.get(req_id, (0, 0)) |
| k_nope, v = kv_nope.split([self.qk_nope_head_dim, self.v_head_dim], dim=-1) | ||
| k_pe = k_pe.expand((*k_nope.shape[:-1], -1)) | ||
|
|
||
| seq_len = torch.stack([seq_len1.cpu(), seq_len2.cpu()]) |
There was a problem hiding this comment.
In _compute_prefill_context, seq_len is constructed by moving seq_len1 and seq_len2 to the CPU in every iteration of the loop. This CPU-GPU synchronization inside a loop can be a significant performance bottleneck, especially since this is in the critical prefill path. It appears the npu_ring_mla kernel requires seqlen on the CPU.
To optimize this, consider moving seq_len1.cpu() out of the loop, as seq_len1 is not modified within it. This would reduce the number of GPU-to-CPU transfers by half within this hot loop.
| cp_kv_recover_idx_for_chunk = torch.from_numpy(np.concatenate(self.cp_kv_recover_idx_for_chunk) | ||
| ).to(device=self.device) | ||
| cp_kv_recover_idx_for_chunk.copy_(torch.tensor( | ||
| np.array(self.cp_kv_recover_idx_for_chunk).flatten().tolist()), | ||
| non_blocking=True) | ||
| self.cp_kv_recover_idx_for_chunk = cp_kv_recover_idx_for_chunk.to( | ||
| torch.float32).argsort().to(torch.int32) |
There was a problem hiding this comment.
The creation of cp_kv_recover_idx_for_chunk in generate_kv_idx involves multiple inefficient conversions between Python lists, NumPy arrays, and PyTorch tensors (e.g., np.concatenate, np.array, .flatten().tolist(), torch.tensor). This happens in _prepare_inputs, which is a critical path executed frequently. These expensive conversions can introduce a significant performance bottleneck.
Consider simplifying this logic to use PyTorch operations directly to avoid these conversions and improve performance.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
| ).world_size if prefill_context_parallel_enable() else 1 | ||
| self.dcp_world_size = get_dcp_group().world_size | ||
| num_requests = len(num_computed_tokens) | ||
| if request_ids is None: |
There was a problem hiding this comment.
request_ids 有可能为None吗,这块逻辑是不是可以删掉
| req_id, 0) | ||
| else: | ||
| start_rank = 0 | ||
| tokens_blank = 0 |
There was a problem hiding this comment.
这块改成:
start_rank = 0
tokens_blank = 0
if request_start_rank_dict is not None:
start_rank, tokens_blank = request_start_rank_dict.get(
req_id, 0)
这样逻辑会更清晰点?
| from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_ND, ACL_FORMAT_FRACTAL_NZ, | ||
| is_enable_nz, prefill_context_parallel_enable) | ||
| from vllm_ascend.worker.npu_input_batch import InputBatch | ||
| from vllm.logger import logger |
| self.sin_cache = None | ||
| self.pcp_size = get_prefill_context_model_parallel_world_size( | ||
| ) if prefill_context_parallel_enable() else 1 | ||
| self.cp_rank = get_prefill_context_model_parallel_rank( |
There was a problem hiding this comment.
self.cp+rank改成self.pcp_rank
| ).device_group if self.tp_size > 1 else None | ||
|
|
||
| # Step indices for chunked prefill tracking | ||
| self._prefill_step_idx: int = 0 |
| seq_len1 = torch.tensor(prefill_metadata.query_lens, | ||
| dtype=torch.int32, | ||
| device=q_nope.device) | ||
| seq_len1_rank = seq_len1.cpu() # q for each cp rank |
| self.dcp_rank = get_decode_context_model_parallel_rank( | ||
| ) if self.dcp_size > 1 else 0 | ||
| decode_max_num_seqs = getattr(scheduler_config, 'decode_max_num_seqs', | ||
| 0) |
There was a problem hiding this comment.
这块判断需要吗?这个只在prefill用
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?