Conversation
|
Have you tested this PR working fine with EP > 1 case? Did it show expected behavior? |
Hi, I just tested it with EP=4 on GAIA cluster on 2 nodes and it does show expected behaviour |
|
could you review this PR @youngeunkwon0405 @shjwudp ? Thanks a lot |
|
/ok to test 33ede31 |
|
Seems to have a lint error. |
33ede31 to
ce44736
Compare
I fixed the linting, could you retrigger the CI ? It should be fine now. Thanks ! |
|
/ok to test ce44736 |
ce44736 to
c5b9670
Compare
could we push this PR also to final review ? @youngeunkwon0405 |
|
Hi @jeffnvidia , this PR depends on PR-2663, so we’ll need to wait for that one to be merged first. Thanks for your patients. |
|
Please resolve conflicts now that 2663 has gone in. |
hey,
so actually, I added everything in the new PR that uses process group collection, I think I can close this current PR : #3249 |
What does this PR do ?
This PR extends the all-gather/reduce-scatter overlap optimization to support Expert Parallelism (EP), enabling communication overlap for MoE (Mixture-of-Experts) models. It creates a separate process group for expert all-gather operations to avoid head-of-line blocking with gradient reduce-scatter, following the same pattern as the base implementation for regular data parallelism.
Background
This PR builds on #2663 (add all-gather process-group for overlapping), which implemented AG/RS overlap for regular data parallelism. That PR explicitly excluded expert parameters from the optimization (via
if not group.is_expert_paramcheck). This PR removes that limitation and adds full support for expert parallelism.Changes
Core Changes:
_EXPERT_DATA_PARALLEL_GROUP_AGprocess group inparallel_state.pyinitialize_model_parallel()when--create-all-gather-groupis enabledget_expert_data_parallel_group()withindependent_all_gatherparameterhas_separate_expert_all_gather_group()helper functionFSDP Integration:
FSDPDistributedIndexto storeexpt_fsdp_group_agget_fsdp_group()to handle all 4 combinations:(is_expert_parallel × independent_all_gather)Testing:
test_separate_expert_all_gather_group()unit testContribution process
Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.