-
Notifications
You must be signed in to change notification settings - Fork 468
support cp&dcp #3260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
support cp&dcp #3260
Conversation
Signed-off-by: LookAround <[email protected]>
Signed-off-by: chenjie <[email protected]>
model runner support cp: input ids, position ids and slot mapping
Signed-off-by: chenjie <[email protected]>
Signed-off-by: LookAround <[email protected]>
model runner support cp: metadata, logits indices
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for context parallelism (CP) and decode context parallelism (DCP) for Ascend NPUs, which is a significant feature addition. The changes are extensive, touching attention mechanisms, worker logic, and distributed state management. While the core implementation for CP/DCP seems thorough, I've identified several critical issues. These include a potential performance regression due to the removal of a workaround for tensor.tolist()
, bugs in the new example script that lead to incorrect performance measurements, and the removal of important configuration logic for non-MLA models that could cause issues. Additionally, there are opportunities for performance improvements in newly added helper functions and some leftover debugging code that should be removed.
if max_gen_len == 1: | ||
# No spec decode tokens. | ||
valid_sampled_token_ids = self._to_list(sampled_token_ids) | ||
valid_sampled_token_ids = sampled_token_ids.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom _to_list
method, which was a workaround for a performance issue with tensor.tolist()
causing an implicit device-wide synchronization, has been removed. The call site now uses sampled_token_ids.tolist()
directly. This likely reintroduces the performance problem that the workaround was meant to solve. Unless the underlying issue in torch_npu
has been resolved, the original workaround should be restored to avoid a performance regression.
def _build_drafter_prepare_inputs_torchair_param(self): | ||
return False | ||
|
||
def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]: | ||
# This is a short term mitigation for issue mentioned in | ||
# https://github.com/vllm-project/vllm/issues/22754. | ||
# `tolist` would trigger a npu wise stream sync, which | ||
# would block other copy ops from other npu streams. | ||
# A npu event sync would avoid such a situation. Since | ||
# this is in the critical path of every single model | ||
# forward loop, this has caused perf issue for a disagg | ||
# setup. | ||
pinned = self.sampled_token_ids_pinned_cpu[:sampled_token_ids.shape[0]] | ||
pinned.copy_(sampled_token_ids, non_blocking=True) | ||
self.transfer_event.record() | ||
self.transfer_event.synchronize() | ||
return pinned.tolist() | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure_model_parallel_initialized( | ||
self.parallel_config.tensor_parallel_size, | ||
self.parallel_config.pipeline_parallel_size) | ||
print(f"context_parallel_enable:{context_parallel_enable}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: LookAround <[email protected]>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
tokens = [scheduler_output.num_scheduled_tokens[i] for i in req_ids] | ||
original_num_scheduled_tokens = np.array(tokens, dtype=np.int32) | ||
original_total_num_scheduled_tokens = total_num_scheduled_tokens | ||
tokens = self._update_tokens_for_cp(tokens, scheduler_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this modification lead to a performance degradation when the CP is not enabled,?
Signed-off-by: Delphine-Nic <[email protected]>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?