Skip to content

fix: Fix grad norm metric in mcore path#1426

Merged
terrykong merged 3 commits intomainfrom
yifu/fix_mcore_gradnorm_metric
Oct 27, 2025
Merged

fix: Fix grad norm metric in mcore path#1426
terrykong merged 3 commits intomainfrom
yifu/fix_mcore_gradnorm_metric

Conversation

@yfw
Copy link
Copy Markdown
Contributor

@yfw yfw commented Oct 24, 2025

What does this PR do ?

Previously, the grad_norm metric in mcore was added as a microbatch metric and then summed during reporting, meaning the value we were reporting was being multiplied by the number of microbatches. This PR fixes this issue by removing grad_norm from the microbatch metrics and is more consistent with how we report grad norm in the dtensor path.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw requested a review from a team as a code owner October 24, 2025 19:59
@yfw yfw requested review from ashors1, gshennvm and terrykong October 24, 2025 19:59
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance to add a unit test for this?

would it be sufficient to test case when using micro_bs=1 mini_bs=2 and micro_bs=2 mini_bs=2 and check grad norm equal?

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw
Copy link
Copy Markdown
Contributor Author

yfw commented Oct 25, 2025

any chance to add a unit test for this?

would it be sufficient to test case when using micro_bs=1 mini_bs=2 and micro_bs=2 mini_bs=2 and check grad norm equal?

Added in c951fb9

@yfw yfw requested a review from a team as a code owner October 25, 2025 00:46
@yfw yfw added the CI:L1 Run doctests, unit tests, and functional tests label Oct 25, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 26, 2025
@terrykong terrykong enabled auto-merge (squash) October 27, 2025 16:36
@terrykong terrykong merged commit 9475e7b into main Oct 27, 2025
41 of 42 checks passed
@terrykong terrykong deleted the yifu/fix_mcore_gradnorm_metric branch October 27, 2025 16:36
chtruong814 pushed a commit that referenced this pull request Oct 27, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
lbliii pushed a commit that referenced this pull request Nov 3, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants