[QUESTION] Possible communication reduction in MoE training #1085

laze44 · 2024-08-27T08:04:20Z

laze44
Aug 27, 2024

Hi Megatron team:
During MoE training, there might be opportunities to combine communication operators when simultaneously using sequence and expert parallelism, though I'm uncertain if this hypothesis is accurate.
In the original sequence parallelism + expert parallelism process, assuming sp=ep=4, the activations undergo an all-gather phase after dropout, followed by gating operations for all sequences, which are then selectively routed to different GPUs' experts. However, if I move the gating function before the all-gather phase, each GPU would perform the gating operations on its sequences, followed by an all-to-all communication based on the gating results. The rationale behind this approach is that, unlike FFN tensor parallelism, during the MoE forward process, each GPU only needs to handle a subset of sequences. Theoretically, this could eliminate the need for the all-gather phase. （As shown in Figure below)
I want to understand whether this approach is correct, as the MoE training process typically incorporates some load-balancing-related loss functions. Does altering the order affect the backward process? Moreover, if this approach is correct, does Megatron-LM support this communication concept?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Possible communication reduction in MoE training #1085

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[QUESTION] Possible communication reduction in MoE training #1085

Uh oh!

Uh oh!

laze44 Aug 27, 2024

Replies: 0 comments

laze44
Aug 27, 2024