Skip to content

Commit 6a2f792

Browse files
authored
move with_prefill allreduce from cpu to npu (#2230)
### What this PR does / why we need it? This PR optimizes the distributed performance by migrating the with_prefill allreduce operation from CPU to NPU. By offloading this operation to the NPU, we leverage NPU network topology acceleration to significantly reduce communication overhead and synchronization delays. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? The patch was validated through rigorous benchmarking in a production-like environment: Test Environment: Hardware: 4-unit A3 NPU cluster Performance Metrics: Before: Allreduce operation averaged 11 ms per iteration After: Allreduce operation averaged 0.3 ms per iteration Signed-off-by: liziyu <[email protected]>
1 parent 31bf6f0 commit 6a2f792

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

vllm_ascend/worker/model_runner_v1.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -637,10 +637,10 @@ def _get_forward_metadata_across_dp(
637637
self.torchair_graph_batch_sizes
638638
) == 1 and not self.in_profile_run:
639639
with_prefill_tensor = torch.tensor([with_prefill],
640-
device="cpu",
640+
device="npu",
641641
dtype=torch.bool)
642642
dist.all_reduce(with_prefill_tensor,
643-
group=get_dp_group().cpu_group,
643+
group=get_dp_group().device_group,
644644
op=dist.ReduceOp.MAX)
645645
if not with_prefill_tensor.item():
646646
max_num_decode_tokens = self.torchair_graph_batch_sizes[0]

0 commit comments

Comments
 (0)