[Long Sequence Feat]support chunk prefill by LookAround0301 · Pull Request #3734 · vllm-project/vllm-ascend

LookAround0301 · 2025-10-25T01:23:38Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@c9461e0

Signed-off-by: LookAround <lixushi@huawei.com>

Signed-off-by: chenjie <chenjie137@huawei.com>

model runner support cp: input ids, position ids and slot mapping

Signed-off-by: chenjie <chenjie137@huawei.com>

Signed-off-by: LookAround <lixushi@huawei.com>

model runner support cp: metadata, logits indices

Signed-off-by: LookAround <lixushi@huawei.com>

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

Signed-off-by: LookAround <lixushi@huawei.com>

…_dev # Conflicts: # vllm_ascend/attention/attention_v1.py # vllm_ascend/attention/mla_v1.py # vllm_ascend/distributed/parallel_state.py # vllm_ascend/envs.py # vllm_ascend/ops/fused_moe.py # vllm_ascend/platform.py # vllm_ascend/worker/model_runner_v1.py

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

…group initialization Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

Signed-off-by: LookAround <lixushi@huawei.com>

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>

support cp_kv_cache_interleave_size and pd disaggregate

Signed-off-by: LookAround <lixushi@huawei.com>

…_dev # Conflicts: # vllm_ascend/attention/attention_v1.py # vllm_ascend/attention/mla_v1.py # vllm_ascend/attention/utils.py # vllm_ascend/distributed/llmdatadist_c_mgr_connector.py # vllm_ascend/envs.py # vllm_ascend/patch/worker/patch_common/patch_distributed.py # vllm_ascend/platform.py # vllm_ascend/utils.py # vllm_ascend/worker/model_runner_v1.py

Signed-off-by: LookAround <lixushi@huawei.com>

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>

Signed-off-by: LookAround <lixushi@huawei.com>

Signed-off-by: Feng Liu <liufeng248@huawei.com>

Signed-off-by: LookAround <lixushi@huawei.com>

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

Signed-off-by: LookAround <lixushi@huawei.com>

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>

…_dev

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>

gemini-code-assist

Code Review

This PR introduces support for chunked prefill for long sequences, a significant feature involving extensive changes to attention mechanisms and the model runner for distributed context parallelism on Ascend NPUs. While the overall implementation appears robust, I have identified a critical bug that could lead to a runtime crash, along with two high-severity performance bottlenecks stemming from inefficient tensor manipulations and unnecessary CPU-GPU synchronizations. Addressing these issues is crucial for ensuring the correctness and performance of the new feature.

gemini-code-assist · 2025-10-25T01:30:43Z

vllm_ascend/worker/block_table.py

+
+            # Get starting rank for this chunk
+            if request_start_rank_dict is not None:
+                start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)


There is a potential TypeError here. If req_id is not found in request_start_rank_dict, request_start_rank_dict.get(req_id, 0) will return the integer 0. The subsequent attempt to unpack this integer into start_rank, tokens_blank will cause a crash.

While the current call sites might ensure req_id is always present, this code is fragile. To make it more robust, the default value should be a tuple (0, 0) to match the expected unpacking.

Suggested change

start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)

start_rank, tokens_blank = request_start_rank_dict.get(req_id, (0, 0))

gemini-code-assist · 2025-10-25T01:30:43Z

vllm_ascend/attention/mla_v1.py

+                k_nope, v = kv_nope.split([self.qk_nope_head_dim, self.v_head_dim], dim=-1)
+                k_pe = k_pe.expand((*k_nope.shape[:-1], -1))
+
+            seq_len = torch.stack([seq_len1.cpu(), seq_len2.cpu()])


In _compute_prefill_context, seq_len is constructed by moving seq_len1 and seq_len2 to the CPU in every iteration of the loop. This CPU-GPU synchronization inside a loop can be a significant performance bottleneck, especially since this is in the critical prefill path. It appears the npu_ring_mla kernel requires seqlen on the CPU.

To optimize this, consider moving seq_len1.cpu() out of the loop, as seq_len1 is not modified within it. This would reduce the number of GPU-to-CPU transfers by half within this hot loop.

gemini-code-assist · 2025-10-25T01:30:44Z

vllm_ascend/worker/model_runner_v1.py

+        cp_kv_recover_idx_for_chunk = torch.from_numpy(np.concatenate(self.cp_kv_recover_idx_for_chunk)
+                                             ).to(device=self.device)
+        cp_kv_recover_idx_for_chunk.copy_(torch.tensor(
+            np.array(self.cp_kv_recover_idx_for_chunk).flatten().tolist()),
+            non_blocking=True)
+        self.cp_kv_recover_idx_for_chunk = cp_kv_recover_idx_for_chunk.to(
+            torch.float32).argsort().to(torch.int32)


The creation of cp_kv_recover_idx_for_chunk in generate_kv_idx involves multiple inefficient conversions between Python lists, NumPy arrays, and PyTorch tensors (e.g., np.concatenate, np.array, .flatten().tolist(), torch.tensor). This happens in _prepare_inputs, which is a critical path executed frequently. These expensive conversions can introduce a significant performance bottleneck.

Consider simplifying this logic to use PyTorch operations directly to avoid these conversions and improve performance.

github-actions · 2025-10-25T01:37:13Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-10-25T01:53:55Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>

zhenwenqi2024 · 2025-10-27T08:34:17Z

vllm_ascend/worker/block_table.py

+        ).world_size if prefill_context_parallel_enable() else 1
+        self.dcp_world_size = get_dcp_group().world_size
+        num_requests = len(num_computed_tokens)
+        if request_ids is None:


request_ids 有可能为None吗，这块逻辑是不是可以删掉

zhenwenqi2024 · 2025-10-27T08:36:36Z

vllm_ascend/worker/block_table.py

+                    req_id, 0)
+            else:
+                start_rank = 0
+                tokens_blank = 0


这块改成：
start_rank = 0
tokens_blank = 0
if request_start_rank_dict is not None:
start_rank, tokens_blank = request_start_rank_dict.get(
req_id, 0)
这样逻辑会更清晰点？

zhenwenqi2024 · 2025-10-27T08:53:45Z

vllm_ascend/attention/mla_v1.py

 from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_ND, ACL_FORMAT_FRACTAL_NZ,
                               is_enable_nz, prefill_context_parallel_enable)
 from vllm_ascend.worker.npu_input_batch import InputBatch
+from vllm.logger import logger


logger删除掉

zhenwenqi2024 · 2025-10-27T08:55:35Z

vllm_ascend/attention/mla_v1.py

        self.sin_cache = None
+        self.pcp_size = get_prefill_context_model_parallel_world_size(
+        ) if prefill_context_parallel_enable() else 1
+        self.cp_rank = get_prefill_context_model_parallel_rank(


self.cp+rank改成self.pcp_rank

zhenwenqi2024 · 2025-10-27T09:11:13Z

vllm_ascend/attention/mla_v1.py

        ).device_group if self.tp_size > 1 else None

+        # Step indices for chunked prefill tracking
+        self._prefill_step_idx: int = 0


这俩参数有用吗，没用的话删掉

zhenwenqi2024 · 2025-10-27T09:13:07Z

vllm_ascend/attention/mla_v1.py

+        seq_len1 = torch.tensor(prefill_metadata.query_lens,
+                                dtype=torch.int32,
+                                device=q_nope.device)
+        seq_len1_rank = seq_len1.cpu()  # q for each cp rank


这里会触发一个同步操作，影响性能吗？

zhenwenqi2024 · 2025-10-27T09:20:42Z

vllm_ascend/attention/mla_v1.py

+        self.dcp_rank = get_decode_context_model_parallel_rank(
+        ) if self.dcp_size > 1 else 0
+        decode_max_num_seqs = getattr(scheduler_config, 'decode_max_num_seqs',
+                                      0)


这块判断需要吗？这个只在prefill用

LookAround0301 and others added 30 commits September 24, 2025 22:16

[mla backend] support dcp&cp prefill

f5862ac

Signed-off-by: LookAround <lixushi@huawei.com>

model runner support cp: input ids, position ids and slot mapping

d1ad588

Signed-off-by: chenjie <chenjie137@huawei.com>

Merge pull request #28 from HiC4Sh1e/long_seq_dev

c0e0f51

model runner support cp: input ids, position ids and slot mapping

model runner support cp: metadata, logits indices

b301659

Signed-off-by: chenjie <chenjie137@huawei.com>

[mla backend] add num_computed_tokens_of_dcp_sp

2f36197

Signed-off-by: LookAround <lixushi@huawei.com>

Merge pull request #29 from HiC4Sh1e/long_seq_dev

30e8076

model runner support cp: metadata, logits indices

[bug] fix config & block_table bug

f887deb

Signed-off-by: LookAround <lixushi@huawei.com>

[optim] support not enable cp and add env

1bc86bc

Signed-off-by: LookAround <lixushi@huawei.com>

[bug] fix prefill bug

b69f45a

Signed-off-by: LookAround <lixushi@huawei.com>

[bug] fix decode bug (single batch)

8b333b9

Signed-off-by: LookAround <lixushi@huawei.com>

[bug] fix dcp bug

2470894

Signed-off-by: LookAround <lixushi@huawei.com>

[bug] fix block size bug

8442fb8

Signed-off-by: LookAround <lixushi@huawei.com>

[optim] clean code

9022138

Signed-off-by: LookAround <lixushi@huawei.com>

GQA support pcp and dcp

8dda1ba

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

[bug fix] add cp env

3fb037b

Signed-off-by: LookAround <lixushi@huawei.com>

bugfix: qwen3 support pcp&dcp

af8ed6f

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

bugfix:support customized and separated hccl_buffer_size for process …

68b5a2b

…group initialization Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

[Feature] support mla multi-requests

faa564e

Signed-off-by: LookAround <lixushi@huawei.com>

[refactor] mla and model_runner refactor

ade3726

Signed-off-by: LookAround <lixushi@huawei.com>

[Feature] support GQA multi-requests

c7866ad

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

support kv_cache interleave_size and pd disaggregate

230ee9f

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>

Merge pull request #32 from zhangsicheng5/long_seq_dev

0a69230

support cp_kv_cache_interleave_size and pd disaggregate

clean code

772dbbe

Signed-off-by: LookAround <lixushi@huawei.com>

[BugFix] mla bug fix

30b45f2

Signed-off-by: LookAround <lixushi@huawei.com>

[Feature] support chunk prefill

83678a2

Signed-off-by: LookAround <lixushi@huawei.com>

pd bugfix

ff87f0b

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>

rename some variable

7e741bc

Signed-off-by: LookAround <lixushi@huawei.com>

[BugFix] Resolve error when disabling PCP

652799e

Signed-off-by: Feng Liu <liufeng248@huawei.com>

LookAround0301 and others added 15 commits October 22, 2025 23:44

[clean code] fix attention_lint

0131005

Signed-off-by: LookAround <lixushi@huawei.com>

Merge branch 'vllm-project:main' into long_seq_dev

42c162e

[clean code] fix st bug

d4209c2

Signed-off-by: LookAround <lixushi@huawei.com>

[clean code] fix st bug

5dea32c

Signed-off-by: LookAround <lixushi@huawei.com>

bugfix: fix ci pipleline

d57d545

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

bugfix: fix ci pipleline

e97de70

Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>

[clean code] clean code

0ffb88e

Signed-off-by: LookAround <lixushi@huawei.com>

chunkprefill support multi-req

5faded9

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>

Merge remote-tracking branch 'refs/remotes/origin/main' into long_seq…

f489276

…_dev

support aclgraph when pcp and dcp

28e579f

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>

rebase long_seq_dev

6f7812b

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>

[bugfix] aclgraph

33e2a55

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>

support aclgraph when pcp and dcp

643d88c

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>

support aclgraph when pcp and dcp

b179903

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>

Merge remote-tracking branch 'origin/long_seq_dev' into chunk_prefill

6237d14

gemini-code-assist bot reviewed Oct 25, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Oct 25, 2025

Apocalypse990923-qshi added 2 commits October 27, 2025 09:22

lint fix

f29e8db

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>

[bugfix] support online serve

db6a0ca

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>

zhenwenqi2024 reviewed Oct 27, 2025

View reviewed changes

LookAround0301 closed this Nov 12, 2025

LookAround0301 deleted the chunk_prefill branch November 12, 2025 01:44

	start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)
	start_rank, tokens_blank = request_start_rank_dict.get(req_id, (0, 0))

Conversation

LookAround0301 commented Oct 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

zhenwenqi2024 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhenwenqi2024 Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenwenqi2024 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhenwenqi2024 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhenwenqi2024 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhenwenqi2024 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhenwenqi2024 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

LookAround0301 commented Oct 25, 2025 •

edited by github-actions bot

Loading

zhenwenqi2024 Oct 27, 2025 •

edited

Loading