-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[Dev] feat: hybrid-cp for dev branch (part 2) #2000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
f33edcd to
48e91d2
Compare
|
Is there any difference between this and #2054? |
This is the second MR, we need to merge 2054 first, and then this 2000 (The reason the second MR is 2000, while the first one is 2054 (>2000), is because they were migrated from GitLab at different times) |
Got it, thanks! Could you please update the title to reflect this? |
983e5f3 to
11d9960
Compare
|
/ok to test e0c90c5 |
e0c90c5 to
501a5f6
Compare
|
/ok to test a9def58 |
|
/ok to test d12ccf1 |
|
/ok to test 86581cd |
| # during pipeline parallelism, it should not be set if sequence length | ||
| # is constant during training. | ||
| args.variable_seq_lengths = False | ||
| if args.sequence_packing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move these validations into transformer_config.
| max_seqlen = torch.empty( | ||
| 1, | ||
| dtype=torch.int32, | ||
| device=torch.cuda.current_device(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify why these were removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support pp, it would become more complex, so the thd related logic was moved to a new separate function get_batch_on_this_rank_for_sequence_packing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thd logic was added in the part-1 PR, it's just back to how it was before.
| # Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved. | ||
|
|
||
| from typing import Any, List, Optional | ||
| import enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put this big change in a separate file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This data_schedule.py should be the separate file you want, it was included in the part-1 PR.
|
@asolergi-nv FYI |
Signed-off-by: xiaoyao0115 <1804647152@qq.com>
3f9564f to
ffe8f94
Compare
This PR is the second part of hybrid-cp. The first part is: #2054
(PR for main branch: #2304 )
Compared to part 1, this PR adds the following:
max_seqlenis set to 12288, andmax_seqlen_per_dp_cp_rankis set to 3072. In the figure below, 'bshd' refers to running with CP=4, where sequences are padded to max_seqlen and executed in the same bshd format as in pretraining. 'thd-packing' refers to using CP=4 while packing variable-length sequences. In 'hybrid-cp', the maximum CP group size is also 4.dataiterator_wrapperto minimize code changes. Adding a new scheduling algorithm now only requires adding a new scheduler class, which keeps the logic clear and easier to maintain.There's many improvements that we want to make in the future releases.
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.