Skip to content

[Long Sequence Feat]support chunk prefill#3734

Closed
LookAround0301 wants to merge 79 commits intovllm-project:mainfrom
LookAround0301:chunk_prefill
Closed

[Long Sequence Feat]support chunk prefill#3734
LookAround0301 wants to merge 79 commits intovllm-project:mainfrom
LookAround0301:chunk_prefill

Conversation

@LookAround0301
Copy link
Copy Markdown
Contributor

@LookAround0301 LookAround0301 commented Oct 25, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

LookAround0301 and others added 30 commits September 24, 2025 22:16
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: chenjie <chenjie137@huawei.com>
model runner support cp: input ids, position ids and slot mapping
Signed-off-by: chenjie <chenjie137@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
model runner support cp: metadata, logits indices
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
…_dev

# Conflicts:
#	vllm_ascend/attention/attention_v1.py
#	vllm_ascend/attention/mla_v1.py
#	vllm_ascend/distributed/parallel_state.py
#	vllm_ascend/envs.py
#	vllm_ascend/ops/fused_moe.py
#	vllm_ascend/platform.py
#	vllm_ascend/worker/model_runner_v1.py
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
…group initialization

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
support cp_kv_cache_interleave_size and pd disaggregate
Signed-off-by: LookAround <lixushi@huawei.com>
…_dev

# Conflicts:
#	vllm_ascend/attention/attention_v1.py
#	vllm_ascend/attention/mla_v1.py
#	vllm_ascend/attention/utils.py
#	vllm_ascend/distributed/llmdatadist_c_mgr_connector.py
#	vllm_ascend/envs.py
#	vllm_ascend/patch/worker/patch_common/patch_distributed.py
#	vllm_ascend/platform.py
#	vllm_ascend/utils.py
#	vllm_ascend/worker/model_runner_v1.py
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Feng Liu <liufeng248@huawei.com>
LookAround0301 and others added 15 commits October 22, 2025 23:44
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces support for chunked prefill for long sequences, a significant feature involving extensive changes to attention mechanisms and the model runner for distributed context parallelism on Ascend NPUs. While the overall implementation appears robust, I have identified a critical bug that could lead to a runtime crash, along with two high-severity performance bottlenecks stemming from inefficient tensor manipulations and unnecessary CPU-GPU synchronizations. Addressing these issues is crucial for ensuring the correctness and performance of the new feature.


# Get starting rank for this chunk
if request_start_rank_dict is not None:
start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a potential TypeError here. If req_id is not found in request_start_rank_dict, request_start_rank_dict.get(req_id, 0) will return the integer 0. The subsequent attempt to unpack this integer into start_rank, tokens_blank will cause a crash.

While the current call sites might ensure req_id is always present, this code is fragile. To make it more robust, the default value should be a tuple (0, 0) to match the expected unpacking.

Suggested change
start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)
start_rank, tokens_blank = request_start_rank_dict.get(req_id, (0, 0))

k_nope, v = kv_nope.split([self.qk_nope_head_dim, self.v_head_dim], dim=-1)
k_pe = k_pe.expand((*k_nope.shape[:-1], -1))

seq_len = torch.stack([seq_len1.cpu(), seq_len2.cpu()])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In _compute_prefill_context, seq_len is constructed by moving seq_len1 and seq_len2 to the CPU in every iteration of the loop. This CPU-GPU synchronization inside a loop can be a significant performance bottleneck, especially since this is in the critical prefill path. It appears the npu_ring_mla kernel requires seqlen on the CPU.

To optimize this, consider moving seq_len1.cpu() out of the loop, as seq_len1 is not modified within it. This would reduce the number of GPU-to-CPU transfers by half within this hot loop.

Comment on lines +1336 to +1342
cp_kv_recover_idx_for_chunk = torch.from_numpy(np.concatenate(self.cp_kv_recover_idx_for_chunk)
).to(device=self.device)
cp_kv_recover_idx_for_chunk.copy_(torch.tensor(
np.array(self.cp_kv_recover_idx_for_chunk).flatten().tolist()),
non_blocking=True)
self.cp_kv_recover_idx_for_chunk = cp_kv_recover_idx_for_chunk.to(
torch.float32).argsort().to(torch.int32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The creation of cp_kv_recover_idx_for_chunk in generate_kv_idx involves multiple inefficient conversions between Python lists, NumPy arrays, and PyTorch tensors (e.g., np.concatenate, np.array, .flatten().tolist(), torch.tensor). This happens in _prepare_inputs, which is a critical path executed frequently. These expensive conversions can introduce a significant performance bottleneck.

Consider simplifying this logic to use PyTorch operations directly to avoid these conversions and improve performance.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
).world_size if prefill_context_parallel_enable() else 1
self.dcp_world_size = get_dcp_group().world_size
num_requests = len(num_computed_tokens)
if request_ids is None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request_ids 有可能为None吗,这块逻辑是不是可以删掉

req_id, 0)
else:
start_rank = 0
tokens_blank = 0
Copy link
Copy Markdown
Collaborator

@zhenwenqi2024 zhenwenqi2024 Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块改成:
start_rank = 0
tokens_blank = 0
if request_start_rank_dict is not None:
start_rank, tokens_blank = request_start_rank_dict.get(
req_id, 0)
这样逻辑会更清晰点?

from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_ND, ACL_FORMAT_FRACTAL_NZ,
is_enable_nz, prefill_context_parallel_enable)
from vllm_ascend.worker.npu_input_batch import InputBatch
from vllm.logger import logger
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger删除掉

self.sin_cache = None
self.pcp_size = get_prefill_context_model_parallel_world_size(
) if prefill_context_parallel_enable() else 1
self.cp_rank = get_prefill_context_model_parallel_rank(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.cp+rank改成self.pcp_rank

).device_group if self.tp_size > 1 else None

# Step indices for chunked prefill tracking
self._prefill_step_idx: int = 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这俩参数有用吗,没用的话删掉

seq_len1 = torch.tensor(prefill_metadata.query_lens,
dtype=torch.int32,
device=q_nope.device)
seq_len1_rank = seq_len1.cpu() # q for each cp rank
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里会触发一个同步操作,影响性能吗?

self.dcp_rank = get_decode_context_model_parallel_rank(
) if self.dcp_size > 1 else 0
decode_max_num_seqs = getattr(scheduler_config, 'decode_max_num_seqs',
0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块判断需要吗?这个只在prefill用

@LookAround0301 LookAround0301 deleted the chunk_prefill branch November 12, 2025 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants