move with_prefill allreduce from cpu to npu (#2230)

liziyu179 · web-flow · commit 6a2f79285db9 · 2025-08-06T15:15:52.000+08:00
### What this PR does / why we need it?
This PR optimizes the distributed performance by migrating the
with_prefill allreduce operation from CPU to NPU. By offloading this
operation to the NPU, we leverage NPU network topology acceleration to
significantly reduce communication overhead and synchronization delays.

### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
The patch was validated through rigorous benchmarking in a
production-like environment:

Test Environment:

Hardware: 4-unit A3 NPU cluster 

Performance Metrics:

Before: Allreduce operation averaged 11 ms per iteration

After: Allreduce operation averaged 0.3 ms per iteration

Signed-off-by: liziyu &lt;liziyu16@huawei.com&gt;
diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
@@ -637,10 +637,10 @@ def _get_forward_metadata_across_dp(
                 self.torchair_graph_batch_sizes
         ) == 1 and not self.in_profile_run:
             with_prefill_tensor = torch.tensor([with_prefill],
-                                               device="cpu",
+                                               device="npu",
                                                dtype=torch.bool)
             dist.all_reduce(with_prefill_tensor,
-                            group=get_dp_group().cpu_group,
+                            group=get_dp_group().device_group,
                             op=dist.ReduceOp.MAX)
             if not with_prefill_tensor.item():
                 max_num_decode_tokens = self.torchair_graph_batch_sizes[0]