Skip to content

Commit cba67e3

Browse files
authored
Merge branch 'main' into kunlunl/megatron-fsdp-fp8-params_main
2 parents 31624f7 + 7e5e16b commit cba67e3

File tree

56 files changed

+1901
-469
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+1901
-469
lines changed

.github/ISSUE_TEMPLATE/bug_report.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ assignees: ''
99

1010
**Describe the bug**
1111

12-
A clear and concise description of what the bug is.
12+
A clear and concise description of what the bug is. Tag the [@megatron-oncall](https://github.com/orgs/NVIDIA/teams/megatron-oncall)
13+
to get oncall's attention to this issue.
1314

1415
**Steps/Code to reproduce bug**
1516

@@ -25,4 +26,4 @@ A clear and concise description of what you expected to happen.
2526

2627
**Additional context**
2728

28-
Add any other context about the problem here.
29+
Add any other context about the problem here.

.github/ISSUE_TEMPLATE/feature_request.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ assignees: ''
1010
**Is your feature request related to a problem? Please describe.**
1111
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
1212

13+
Tag the [@megatron-oncall](https://github.com/orgs/NVIDIA/teams/megatron-oncall)
14+
to get oncall's attention to this issue.
15+
1316
**Describe the solution you'd like**
1417
A clear and concise description of what you want to happen.
1518

.github/ISSUE_TEMPLATE/question.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@ assignees: ''
99
---
1010

1111
**Your question**
12-
Ask a clear and concise question about Megatron-LM.
12+
Ask a clear and concise question about Megatron-LM. Tag the [@megatron-oncall](https://github.com/orgs/NVIDIA/teams/megatron-oncall)
13+
to get oncall's attention to this issue.

.github/ISSUE_TEMPLATE/regression.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ assignees: ''
88
---
99

1010
**Describe the regression**
11-
A clear and concise description of what the regression is.
11+
A clear and concise description of what the regression is. Tag the [@megatron-oncall](https://github.com/orgs/NVIDIA/teams/megatron-oncall)
12+
to get oncall's attention to this issue.
1213

1314
**To Reproduce**
1415
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

.github/copy-pr-bot.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
enabled: true
22
auto_sync_draft: false
33
auto_sync_ready: true
4-
trustees_override: ["AAnoosheh", "ArEsKay3", "Autumn1998", "BestJuly", "BoxiangW", "ChenhanYu", "FDecaYed", "HaochenYuan", "ISEEKYAN", "JRD971000", "Phlip79", "QiZhangNV", "ShriyaRishab", "Victarry", "Wohox", "ZhiyuLi-Nvidia", "ahmadki", "aklife97", "ananthsub", "asolergi-nv", "buptzyb", "chtruong814", "cspades", "cuichenx", "deepakn94", "dimapihtar", "duncanriach", "erhoo82", "ericharper", "fanshiqing", "frsun-nvda", "gautham-kollu", "gdengk", "guyueh1", "hxbai", "jalbericiola", "jaredcasper", "jenchen13", "jiemingz", "jkamalu", "jon-barker", "kanz-nv", "kevalmorabia97", "ko3n1g", "kunlunl", "kvareddy", "layalir", "lhb8125", "lmcafee-nvidia", "maanug-nv", "mathemakitten", "matthieule", "mehraakash", "mkhona-nvidia", "pablo-garay", "parthmannan", "pthombre", "rogerwaleffe", "sanandaraj5597", "santhnm2", "sbak5", "shanmugamr1992", "shifangx", "shjwudp", "sidsingh-nvidia", "skyw", "sudhakarsingh27", "tdene", "theothermike", "thomasdhc", "trintamaki", "tylerpoon", "wdykas", "xiaoyao0115", "xuwchen", "yanring", "yaox12", "yaoyu-33", "yashaswikarnati", "yeyu-nvidia", "yobibyte", "youngeunkwon0405", "yuzhongw-nvidia", "zhongbozhu"]
4+
trustees_override: ["AAnoosheh", "ArEsKay3", "Autumn1998", "BestJuly", "BoxiangW", "ChenhanYu", "FDecaYed", "HaochenYuan", "ISEEKYAN", "JRD971000", "Phlip79", "QiZhangNV", "ShriyaRishab", "Victarry", "Wohox", "ZhiyuLi-Nvidia", "ahmadki", "aklife97", "ananthsub", "asolergi-nv", "buptzyb", "chtruong814", "cspades", "cuichenx", "deepakn94", "dimapihtar", "duncanriach", "erhoo82", "ericharper", "fanshiqing", "frsun-nvda", "gautham-kollu", "gdengk", "guyueh1", "hxbai", "jalbericiola", "jaredcasper", "jenchen13", "jiemingz", "jkamalu", "jon-barker", "jstjohn", "kanz-nv", "kevalmorabia97", "ko3n1g", "kunlunl", "kvareddy", "layalir", "lhb8125", "lmcafee-nvidia", "maanug-nv", "mathemakitten", "matthieule", "mehraakash", "mkhona-nvidia", "pablo-garay", "parthmannan", "pthombre", "rogerwaleffe", "sanandaraj5597", "santhnm2", "sbak5", "shanmugamr1992", "shifangx", "shjwudp", "sidsingh-nvidia", "skyw", "sudhakarsingh27", "tdene", "theothermike", "thomasdhc", "trintamaki", "tylerpoon", "wdykas", "xiaoyao0115", "xuwchen", "yanring", "yaox12", "yaoyu-33", "yashaswikarnati", "yeyu-nvidia", "yobibyte", "youngeunkwon0405", "yuzhongw-nvidia", "zhongbozhu"]

.github/oncall_schedule.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
11
[
2-
{
3-
"user": "Phlip79",
4-
"date": "2025-12-31"
5-
},
62
{
73
"user": "Phlip79",
84
"date": "2026-01-07"
@@ -46,5 +42,9 @@
4642
{
4743
"user": "ko3n1g",
4844
"date": "2026-03-18"
45+
},
46+
{
47+
"user": "Phlip79",
48+
"date": "2026-03-25"
4949
}
5050
]

.github/workflows/_build_test_publish_wheel.yml

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ jobs:
157157
- PACKAGE: megatron-core
158158
PLATFORM: amd64
159159
- PACKAGE: megatron-fsdp
160-
IMAGE: quay.io/pypa/manylinux_2_28_x86_64
160+
PLATFORM: amd64
161161
env:
162162
PACKAGE: ${{ matrix.PACKAGE }}
163163
steps:
@@ -173,7 +173,19 @@ jobs:
173173
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
174174
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
175175
TWINE_REPOSITORY: ${{ (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/r')) && 'pypi' || 'testpypi' }}
176+
PLATFORM: ${{ matrix.PLATFORM }}
176177
run: |
177-
ls -al dist/$PACKAGE*
178+
179+
# Delete sdist for arm64 since we already upload it with amd64.
180+
if [ "$PLATFORM" == "arm64" ]; then
181+
rm dist/*.tar.gz
182+
fi
183+
184+
ls -al dist/
178185
pip install twine
179-
twine upload -r $TWINE_REPOSITORY -u $TWINE_USERNAME -p $TWINE_PASSWORD dist/$PACKAGE*
186+
twine upload \
187+
--verbose \
188+
-r $TWINE_REPOSITORY \
189+
-u $TWINE_USERNAME \
190+
-p $TWINE_PASSWORD \
191+
dist/*

.github/workflows/_release_library.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ jobs:
6060
with:
6161
dry-run: true
6262
ref: ${{ inputs.release-ref }}
63+
no-publish: true
6364
secrets:
6465
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
6566
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
@@ -74,7 +75,7 @@ jobs:
7475
)
7576
&& !cancelled()
7677
outputs:
77-
version: ${{ needs.bump-version-mcore.outputs.release-version }}
78+
release-version: ${{ needs.bump-version-mcore.outputs.release-version }}
7879
env:
7980
IS_DRY_RUN: ${{ inputs.dry-run }}
8081
steps:
@@ -92,6 +93,7 @@ jobs:
9293
SRC_DIR: ''
9394
PYPROJECT_NAME: 'megatron.core'
9495
run: |
96+
set +u
9597
cd ${{ github.run_id }}
9698
9799
PACKAGE_INFO_FILE="$SRC_DIR${PYPROJECT_NAME//.//}/package_info.py"
@@ -101,7 +103,7 @@ jobs:
101103
PATCH=$(cat $PACKAGE_INFO_FILE | awk '/^PATCH = /' | awk -F"= " '{print $2}')
102104
PRERELEASE=$(cat $PACKAGE_INFO_FILE | awk '/^PRE_RELEASE = /' | awk -F"= " '{print $2}' | tr -d '"' | tr -d "'")
103105
104-
echo "release-version=$MAJOR.$MINOR.$NEXT_PATCH$NEXT_PRERELEASE" | tee -a "$GITHUB_OUTPUT"
106+
echo "release-version=$MAJOR.$MINOR.$PATCH$PRERELEASE" | tee -a "$GITHUB_OUTPUT"
105107
106108
if [[ "$PRERELEASE" != "" ]]; then
107109
if [[ "$PRERELEASE" == *rc* ]]; then
@@ -130,6 +132,8 @@ jobs:
130132
SRC_DIR: 'megatron/core/distributed/fsdp/src/'
131133
PYPROJECT_NAME: 'megatron_fsdp'
132134
run: |
135+
set +u
136+
133137
cd ${{ github.run_id }}
134138
135139
PACKAGE_INFO_FILE="$SRC_DIR${PYPROJECT_NAME//.//}/package_info.py"

docs/developer/contribute.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,11 @@ You should receive a response within 2 business days.
4646

4747
### I need help, who should I ping?
4848

49-
Use `@megatron-oncall`.
49+
Use [@megatron-oncall](https://github.com/orgs/NVIDIA/teams/megatron-oncall).
5050

5151
### If my issue or PR isn't getting attention, what should I do?
5252

53-
After 2 business days, use `@megatron-oncall`.
53+
After 2 business days, tag the user [@megatron-oncall](https://github.com/orgs/NVIDIA/teams/megatron-oncall).
5454

5555
### Is there a policy for issues and PRs that haven't been touched in X days? Should they be closed?
5656

docs/developer/oncall.md

Lines changed: 30 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,45 @@
11
# Oncall Overview
22

3-
During your oncall week, you will be assigned to all PRs marked “Ready for Review”. From a high-level, your responsibilities include:
3+
During your oncall week, you will be assigned to all PRs marked “Ready for
4+
Review”. From a high-level, your responsibilities include:
45

5-
- Review all new PRs
6-
- Uphold and enforce the Megatron coding standard
7-
- Accelerate the review process for expert reviewers (when necessary)
6+
- Review all new PRs
7+
- Accelerate the review process
8+
- Ensure issues and discussion questions are answered
89

9-
## Checklist
10+
## PR Responsibilities
1011

1112
Below is the checklist that the oncall needs to go through for each PR.
1213

13-
- [ ] Should the PR remain a single PR?
14+
- Should the PR remain a single PR?
1415
- Each PR should have at most 1 expert reviewer, although there will be some outlier cases
15-
- [ ] Label PR as “complexity: low”, “complexity: medium”, or “complexity: high” depending on complexity
16-
- Low: <100 lines changed
17-
- Medium: 100 < lines changed < 500
18-
- High: > 500 lines changed
19-
- [ ] Does this PR have proper testing coverage?
16+
- Label PR as “complexity: low”, “complexity: medium”, or “complexity: high” depending on complexity
17+
- Expert reviewers have final say, oncall just sets the initial complexity level
18+
- Initial complexity level guideline
19+
- Low: <100 lines changed
20+
- Medium: 100 < lines changed < 500
21+
- High: > 500 lines changed
22+
- Does this PR have proper testing coverage?
2023
- If new logic is added, is the new logic tested?
21-
- [ ] Should the PR add documentation for any new features?
22-
- [ ] Does the PR conform to our style guidelines?
24+
- Should the PR add documentation for any new features?
25+
- Does the PR conform to our style guidelines?
2326
- Code structure
2427
- Cleanliness
2528
- Comments
2629
- File structure
27-
- [ ] Do all tests pass?
30+
- Do all tests pass?
2831
- Oncall will need to kick off testing suite for external reviewers
2932
- Comment “/ok to test commid_id” to kick off testing suite
30-
- [ ] Add the “Expert Review” label
31-
- Expert reviewers should review within 1 business day. Message the assigned reviewer if it is taking longer.
32-
- After 2 business days, the expert reviewer waives the right to review.
33-
- [ ] Add the “Final Review” label after experts approve
33+
- Add the “Expert Review” label
34+
- Select an expert reviewer from each expert group as a reviewer. If you’re unsure who to select, pick a “maintainer” or manager.
35+
- **Expert reviewers should review within 1 business day.** Message the assigned reviewer if it is taking longer. The reviewer either needs to review the PR or suggest an alternate reviewer.
36+
- If the reviewer is not responding after 2 business days, escalate to the reviewer's manager.
37+
- Add the “Final Review” label after experts approve
38+
- Final reviewers should review within 1 business day. Message the assigned reviewer if it is taking longer.
39+
- If the reviewer is not responding after 2 business days, escalate to the reviewer's manager.
40+
41+
## Issues and Discussion Questions
42+
43+
On a daily basis, check for new [issues](https://github.com/NVIDIA/Megatron-LM/issues)
44+
and [discussions](https://github.com/NVIDIA/Megatron-LM/discussions). If you
45+
do not know how to answer that's ok! Delegate the issue or discussion to someone who does.

0 commit comments

Comments
 (0)