Skip to content

fix: qwen32 nightly metric check more stable#1271

Merged
terrykong merged 3 commits intomainfrom
tk/qwen32-fix
Oct 8, 2025
Merged

fix: qwen32 nightly metric check more stable#1271
terrykong merged 3 commits intomainfrom
tk/qwen32-fix

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

@terrykong terrykong commented Oct 4, 2025

Similar to #1221

The root cause was the num_workers that changed the determinism slightly b/c of seed inheritance between the main process and the subprocess (shuffling was on for this test).

here's the current results:

                                                                    Metric Checks                                                                     
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Status     ┃ Check                                              ┃ Value                     ┃ Message     ┃ Range Details                          ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ PASS       │ mean(data["train/loss"], 16) < 0.31                │ 0.299751216173172         │             │ {                                      │
│            │                                                    │                           │             │   "16": 0.29737550020217896,           │
│            │                                                    │                           │             │   "17": 0.28047552704811096,           │
│            │                                                    │                           │             │   "18": 0.33757448196411133,           │
│            │                                                    │                           │             │   "19": 0.27592864632606506,           │
│            │                                                    │                           │             │   "20": 0.3074019253253937             │
│            │                                                    │                           │             │ }                                      │
└────────────┴────────────────────────────────────────────────────┴───────────────────────────┴─────────────┴────────────────────────────────────────┘

Summary by CodeRabbit

  • Tests
    • Updated loss validation in the LLM training test: replaced a single-step loss threshold with a moving average across recent steps to determine pass/fail.
    • Maintains all other existing test checks unchanged.
    • This adjustment enhances consistency of test outcomes across runs, reducing sensitivity to transient fluctuations.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from a team as a code owner October 4, 2025 05:44
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Oct 4, 2025

📝 Walkthrough

Walkthrough

Updated a test script’s loss validation: replaced a single-step loss threshold check at step 20 with a moving average check over the last 16 loss values against a 0.31 threshold. No other test logic or public interfaces changed.

Changes

Cohort / File(s) Summary
LLM SFT test script
tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh
Changed metric assertion from single-step loss (<0.3 at step 20) to moving average of last 16 losses (<0.31). Other checks unchanged.

Sequence Diagram(s)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

CI:L0

Suggested reviewers

  • ashors1
  • chtruong814

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly describes the primary change of stabilizing the qwen32 nightly metric check by making the test more stable, which directly reflects the modifications made in the script.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Test Results For Major Changes ✅ Passed The PR only adjusts a stability threshold in an existing test script, which is a minor change, and the PR description already documents the resulting metric values (overall mean and per-seed ranges) to show the rule still passes, satisfying the requirement for test evidence.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/qwen32-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0e203c and 1a2b701.

📒 Files selected for processing (1)
  • tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Use uv run to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling python directly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts

Files:

  • tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh
tests/test_suites/llm/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Files:

  • tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh
tests/test_suites/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place driver shell scripts and common.env under tests/test_suites// and list nightly tests in tests/test_suites/nightly.txt

Files:

  • tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

@terrykong terrykong enabled auto-merge (squash) October 5, 2025 02:27
…v3.sh

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
@terrykong terrykong added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Oct 7, 2025
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Oct 8, 2025
@terrykong terrykong merged commit 9a1e3df into main Oct 8, 2025
40 of 41 checks passed
@terrykong terrykong deleted the tk/qwen32-fix branch October 8, 2025 10:26
chtruong814 pushed a commit that referenced this pull request Oct 8, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants