fix: qwen32 nightly metric check more stable#1271
Conversation
Signed-off-by: Terry Kong <terryk@nvidia.com>
📝 WalkthroughWalkthroughUpdated a test script’s loss validation: replaced a single-step loss threshold check at step 20 with a moving average check over the last 16 loss values against a 0.31 threshold. No other test logic or public interfaces changed. Changes
Sequence Diagram(s)Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Useuv runto execute Python scripts in shell/driver scripts instead of activating virtualenvs and callingpythondirectly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts
Files:
tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh
tests/test_suites/llm/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension
Files:
tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh
tests/test_suites/**
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Place driver shell scripts and common.env under tests/test_suites// and list nightly tests in tests/test_suites/nightly.txt
Files:
tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh
Outdated
Show resolved
Hide resolved
…v3.sh Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Similar to #1221
The root cause was the num_workers that changed the determinism slightly b/c of seed inheritance between the main process and the subprocess (shuffling was on for this test).
here's the current results:
Summary by CodeRabbit