You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi Megatron team:
During MoE training, there might be opportunities to combine communication operators when simultaneously using sequence and expert parallelism, though I'm uncertain if this hypothesis is accurate.
In the original sequence parallelism + expert parallelism process, assuming sp=ep=4, the activations undergo an all-gather phase after dropout, followed by gating operations for all sequences, which are then selectively routed to different GPUs' experts. However, if I move the gating function before the all-gather phase, each GPU would perform the gating operations on its sequences, followed by an all-to-all communication based on the gating results. The rationale behind this approach is that, unlike FFN tensor parallelism, during the MoE forward process, each GPU only needs to handle a subset of sequences. Theoretically, this could eliminate the need for the all-gather phase. (As shown in Figure below)
I want to understand whether this approach is correct, as the MoE training process typically incorporates some load-balancing-related loss functions. Does altering the order affect the backward process? Moreover, if this approach is correct, does Megatron-LM support this communication concept?
This discussion was converted from issue #1039 on September 04, 2024 18:26.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Megatron team:
During MoE training, there might be opportunities to combine communication operators when simultaneously using sequence and expert parallelism, though I'm uncertain if this hypothesis is accurate.
In the original sequence parallelism + expert parallelism process, assuming sp=ep=4, the activations undergo an all-gather phase after dropout, followed by gating operations for all sequences, which are then selectively routed to different GPUs' experts. However, if I move the gating function before the all-gather phase, each GPU would perform the gating operations on its sequences, followed by an all-to-all communication based on the gating results. The rationale behind this approach is that, unlike FFN tensor parallelism, during the MoE forward process, each GPU only needs to handle a subset of sequences. Theoretically, this could eliminate the need for the all-gather phase. (As shown in Figure below)
I want to understand whether this approach is correct, as the MoE training process typically incorporates some load-balancing-related loss functions. Does altering the order affect the backward process? Moreover, if this approach is correct, does Megatron-LM support this communication concept?
Beta Was this translation helpful? Give feedback.
All reactions