-
Notifications
You must be signed in to change notification settings - Fork 398
Description
Motivation.
The community is continuously optimizing the performance of the Alltoall EP approach (e.g., #1802). However, for Atlas 800I/T A2, the Allgather EP solution demonstrates superior performance compared to Alltoall [1].
First, the implementation of Flash Comm1 needs to be completed. In PR1802, only the Alltoall EP was implemented, while the FFN layer and the MoE layer with Allgather EP still require implementation.
After implementing Flash Comm1, further optimization of DP communication is needed. The diagram below illustrates the computation-communication flow under TP8 DP2. Taking the two communications before MoE as an example, the total communication volume of TP8 Allgather is the same as that of DP2 Broadcast. However, the former operates within a single machine, while the latter involves inter-machine communication. Since inter-machine bandwidth is only about 1/7 of intra-machine bandwidth, the latency of DP2 Broadcast is significantly higher than that of TP8 Allgather (empirical measurements show a ~6x difference for a sequence length of 32k).

The optimization approach is shown in the diagram below, again using the pre-MoE communication as an example. The key idea is to convert DP2 Broadcast into DP2 Allgather via padding and then merge it with TP8 Allgather into an EP16 Allgather. HCCL has optimized communication for this scenario—taking the Pipeline algorithm [3] as an example, the latency of EP16 Allgather is approximately twice that of TP8 Allgather. In this case, the TP8 Allgather required by the shared expert TP8 is merged, so the shared expert must be adjusted to TP16. The optimization principle for post-MoE communication is similar, requiring the merging of communication operators into an EP16 ReduceScatter.

We have implemented the aforementioned optimizations in our private repository, achieving approximately 15% throughput improvement over Flash Comm1.
Proposed Change.
- Support DP for Allgather EP: [Bugfix] fix bug in AllgatherEP #2288
- Support static EPLB for Allgather EP:[Feat] support static EPLB for allgather EP #2366
- Implement Flash Comm1 for MoE layer with Allgather EP and FFN layer: eager mode deepseekv3 flashcomm1 #2257
- Optimize performance in DP scenarios
- Replace DP multi broadcast with allgather,and merge with TP allgather
- Shared Experts TP16
- Replace DP allreduce+slicewith reducescatter,and merge with TP reducescatter
[1] https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/Overview/%E5%8D%8E%E4%B8%BA%E6%98%87%E8%85%BE%E6%9C%8D%E5%8A%A1%E5%99%A8_DeepSeek_V3_R1_%E6%8E%A8%E7%90%86%E9%83%A8%E7%BD%B2%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.pdf
[2] https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E7%9A%84AllReduce%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf
[3] https://gitee.com/ascend/cann-hccl#/ascend/cann-hccl/blob/master/docs/Pipeline.md
Feedback Period.
No response
CC List.
@jianzs @ApsarasX @zzzzwwjj @ganyi1996ppo @zhaozx-cn
Any Other Things.
No response