cp: `fix: grad norm check for automodel gpt oss nightly (1708)` into `r0.5.0` #1711

chtruong814 · 2026-01-05T06:56:22Z

beep boop [🤖]: Hi @hemildesai 👋,

we've cherry picked #1708 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Tests
- Updated validation thresholds for gradient norm measurements in training tests, adjusting acceptable parameter ranges to reflect refined performance expectations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Hemil Desai <[email protected]> Signed-off-by: NeMo Bot <[email protected]>

coderabbitai · 2026-01-05T07:00:14Z

📝 Walkthrough

Walkthrough

A test script assertion for gradient norm at step 50 is updated to check a bounded range (10.0–17.5) instead of a single upper bound, tightening the validation criteria for this training metric.

Changes

Cohort / File(s)	Summary
Test Assertion Update `tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh`	Modifies gradient norm validation at training step 50 from single upper-bound check (`< 2.5`) to a bounded range check (`> 10.0` and `< 17.5`), establishing both lower and upper bounds for the metric.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

fix: grad norm check for automodel gpt oss nightly #1708: Modifies the same test script and updates the same gradient norm assertion from single upper bound to bounded range (10.0–17.5).

Suggested labels

CI:L0, r0.5.0

Suggested reviewers

hemildesai
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR modifies gradient norm test thresholds from < 2.5 to 10.0 < x < 17.5 without explanation of rationale, test results, or evidence of no regression.	Update PR description to include: (1) specific issue prompting threshold changes, (2) test results from training runs, (3) explanation of why original threshold was insufficient, (4) context about test flakiness or updated expected behavior.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: fixing gradient norm checks for automodel gpt oss nightly tests. It is clear, specific, and directly related to the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0fef58c and f443763.

📒 Files selected for processing (1)

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

🧰 Additional context used

📓 Path-based instructions (4)

**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

🧠 Learnings (1)

📚 Learning: 2025-10-12T14:46:57.171Z

Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (1)

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh (1)

39-40: The review comment is based on unverified assumptions about the previous state of this assertion.

The review claims the gradient norm bound changed from < 2.5 to < 17.5, but the previous version of this file is unavailable in git history. The commit message explicitly labels this as a "fix," not a relaxation or adjustment to accommodate observed behavior. Additionally, gradient norm ranges vary significantly across models—other test files show ranges from [0.1, 0.5] to [10.0, 17.5] depending on the model and configuration, which is expected and not indicative of issues.

Without verification that the previous threshold was indeed < 2.5, and given that the commit is explicitly a "fix," the characterization that this change masks training instability is not supported by available evidence.

Likely an incorrect or invalid review comment.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fix: grad norm check for automodel gpt oss nightly (#1708)

f443763

Signed-off-by: Hemil Desai <[email protected]> Signed-off-by: NeMo Bot <[email protected]>

chtruong814 requested a review from a team as a code owner January 5, 2026 06:56

chtruong814 requested a review from hemildesai January 5, 2026 06:56

chtruong814 added cherry-pick Run CICD labels Jan 5, 2026

terrykong approved these changes Jan 5, 2026

View reviewed changes

terrykong added the CI:docs Run doctest label Jan 5, 2026

terrykong enabled auto-merge (squash) January 5, 2026 07:22

terrykong temporarily deployed to nemo-ci January 5, 2026 07:22 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci January 5, 2026 07:34 — with GitHub Actions Inactive

terrykong merged commit 6e7a2f7 into r0.5.0 Jan 5, 2026
66 of 69 checks passed

terrykong deleted the cherry-pick-1708-r0.5.0 branch January 5, 2026 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cp: `fix: grad norm check for automodel gpt oss nightly (1708)` into `r0.5.0` #1711

cp: `fix: grad norm check for automodel gpt oss nightly (1708)` into `r0.5.0` #1711

Uh oh!

chtruong814 commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cp: fix: grad norm check for automodel gpt oss nightly (1708) into r0.5.0 #1711

cp: fix: grad norm check for automodel gpt oss nightly (1708) into r0.5.0 #1711

Uh oh!

Conversation

chtruong814 commented Jan 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cp: `fix: grad norm check for automodel gpt oss nightly (1708)` into `r0.5.0` #1711

cp: `fix: grad norm check for automodel gpt oss nightly (1708)` into `r0.5.0` #1711

chtruong814 commented Jan 5, 2026 •

edited by coderabbitai bot

Loading