Skip to content

Conversation

LookAround0301
Copy link
Contributor

@LookAround0301 LookAround0301 commented Sep 29, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for context parallelism (CP) and decode context parallelism (DCP) for Ascend NPUs, which is a significant feature addition. The changes are extensive, touching attention mechanisms, worker logic, and distributed state management. While the core implementation for CP/DCP seems thorough, I've identified several critical issues. These include a potential performance regression due to the removal of a workaround for tensor.tolist(), bugs in the new example script that lead to incorrect performance measurements, and the removal of important configuration logic for non-MLA models that could cause issues. Additionally, there are opportunities for performance improvements in newly added helper functions and some leftover debugging code that should be removed.

if max_gen_len == 1:
# No spec decode tokens.
valid_sampled_token_ids = self._to_list(sampled_token_ids)
valid_sampled_token_ids = sampled_token_ids.tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The custom _to_list method, which was a workaround for a performance issue with tensor.tolist() causing an implicit device-wide synchronization, has been removed. The call site now uses sampled_token_ids.tolist() directly. This likely reintroduces the performance problem that the workaround was meant to solve. Unless the underlying issue in torch_npu has been resolved, the original workaround should be restored to avoid a performance regression.

Comment on lines 3541 to 3822
def _build_drafter_prepare_inputs_torchair_param(self):
return False

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
# This is a short term mitigation for issue mentioned in
# https://github.com/vllm-project/vllm/issues/22754.
# `tolist` would trigger a npu wise stream sync, which
# would block other copy ops from other npu streams.
# A npu event sync would avoid such a situation. Since
# this is in the critical path of every single model
# forward loop, this has caused perf issue for a disagg
# setup.
pinned = self.sampled_token_ids_pinned_cpu[:sampled_token_ids.shape[0]]
pinned.copy_(sampled_token_ids, non_blocking=True)
self.transfer_event.record()
self.transfer_event.synchronize()
return pinned.tolist()
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _to_list method, which contained a performance-critical workaround for tensor.tolist(), has been removed. This is likely to cause a performance regression. Please restore this method and its usage at call sites.

ensure_model_parallel_initialized(
self.parallel_config.tensor_parallel_size,
self.parallel_config.pipeline_parallel_size)
print(f"context_parallel_enable:{context_parallel_enable}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

A print statement has been left in the code. This will produce verbose and unnecessary output in the logs, which is undesirable in production environments and can make debugging more difficult.

Signed-off-by: LookAround <[email protected]>
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

tokens = [scheduler_output.num_scheduled_tokens[i] for i in req_ids]
original_num_scheduled_tokens = np.array(tokens, dtype=np.int32)
original_total_num_scheduled_tokens = total_num_scheduled_tokens
tokens = self._update_tokens_for_cp(tokens, scheduler_output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this modification lead to a performance degradation when the CP is not enabled,?

Signed-off-by: Delphine-Nic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants