You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
move with_prefill allreduce from cpu to npu (#2230)
### What this PR does / why we need it?
This PR optimizes the distributed performance by migrating the
with_prefill allreduce operation from CPU to NPU. By offloading this
operation to the NPU, we leverage NPU network topology acceleration to
significantly reduce communication overhead and synchronization delays.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
The patch was validated through rigorous benchmarking in a
production-like environment:
Test Environment:
Hardware: 4-unit A3 NPU cluster
Performance Metrics:
Before: Allreduce operation averaged 11 ms per iteration
After: Allreduce operation averaged 0.3 ms per iteration
Signed-off-by: liziyu <[email protected]>
0 commit comments