Skip to content

Conversation

@njzjz
Copy link
Member

@njzjz njzjz commented Jan 7, 2026

Fix #4985.

This implementation is much simpler than #4986.

Summary by CodeRabbit

  • Bug Fixes
    • Improved training-metric validation to detect NaN total RMSE, logging a clear error and halting runs to avoid silent failures.
  • Documentation
    • Added documentation for the new option that controls NaN checking so users can enable or disable the validation as needed.

✏️ Tip: You can customize this high-level summary in your review settings.

Fix deepmodeling#4985.

This implementation is much simpler than deepmodeling#4986.

Signed-off-by: Jinzhe Zeng <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Added an optional NaN check to format_training_message_per_task that logs and raises a RuntimeError when the total RMSE is NaN; also initialized a module-level logger and imported math.

Changes

Cohort / File(s) Summary
Training Logger NaN Validation
deepmd/loggers/training.py
Added import logging and import math plus log = logging.getLogger(__name__). Extended format_training_message_per_task signature with check_total_rmse_nan: bool = True and updated its docstring. Builds the message locally and, if check_total_rmse_nan is True and rmse["rmse"] is NaN, logs an error and raises RuntimeError.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The PR implements NaN detection with a new parameter in format_training_message_per_task that raises RuntimeError when NaN is detected, addressing the core requirement from issue #4985 to detect and stop on NaN loss. Clarify implementation scope: verify whether backend-specific integration (TensorFlow, PyTorch, PaddlePaddle) and checkpoint prevention are handled elsewhere, as only logging function changes are present.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add NaN detection during training' accurately and concisely summarizes the main change: adding NaN detection capability to the training logging functionality.
Out of Scope Changes check ✅ Passed All changes are focused on NaN detection in the logging module, directly addressing the feature request without introducing unrelated modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4dc138 and db5e195.

📒 Files selected for processing (1)
  • deepmd/loggers/training.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/loggers/training.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (40)
  • GitHub Check: Test Python (12, 3.13)
  • GitHub Check: Test Python (12, 3.10)
  • GitHub Check: Test Python (10, 3.13)
  • GitHub Check: Test Python (7, 3.13)
  • GitHub Check: Test Python (3, 3.10)
  • GitHub Check: Test Python (11, 3.13)
  • GitHub Check: Test Python (11, 3.10)
  • GitHub Check: Test Python (7, 3.10)
  • GitHub Check: Test Python (9, 3.13)
  • GitHub Check: Test Python (2, 3.10)
  • GitHub Check: Test Python (8, 3.10)
  • GitHub Check: Test Python (9, 3.10)
  • GitHub Check: Test Python (5, 3.10)
  • GitHub Check: Test Python (1, 3.10)
  • GitHub Check: Test Python (4, 3.13)
  • GitHub Check: Test Python (6, 3.13)
  • GitHub Check: Test Python (10, 3.10)
  • GitHub Check: Test Python (1, 3.13)
  • GitHub Check: Test Python (3, 3.13)
  • GitHub Check: Test Python (4, 3.10)
  • GitHub Check: Test Python (5, 3.13)
  • GitHub Check: Test Python (2, 3.13)
  • GitHub Check: Test Python (6, 3.10)
  • GitHub Check: Test Python (8, 3.13)
  • GitHub Check: Test C++ (true, true, true, false)
  • GitHub Check: Test C++ (false, true, true, false)
  • GitHub Check: Test C++ (false, false, false, true)
  • GitHub Check: Test C++ (true, false, false, true)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (clang, clang)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds NaN (Not a Number) detection during model training to prevent wasting time training models that have already diverged. When the total RMSE becomes NaN, training is immediately stopped with a descriptive error message. The implementation adds a new parameter check_total_rmse_nan to the format_training_message_per_task function with a default value of True.

  • Adds NaN detection logic to check the "rmse" key in the RMSE dictionary
  • Logs an error message and raises RuntimeError when NaN is detected
  • Adds comprehensive docstring to document the new functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @deepmd/loggers/training.py:
- Around line 56-62: The current NaN guard uses rmse.get("rmse", 0.0) which
silently defaults and skips detection when the "rmse" key is absent; change the
condition in the check_total_rmse_nan branch to explicitly test for the key and
then check NaN (e.g., only evaluate math.isnan on rmse["rmse"] when "rmse" in
rmse), i.e., replace the get(...) usage with an explicit membership check of the
rmse dict before calling math.isnan, keeping the existing log.error and
RuntimeError behavior for true NaN.
🧹 Nitpick comments (1)
deepmd/loggers/training.py (1)

28-42: Fix grammar in docstring.

Line 41 has a grammatical error. It should read "Whether to throw an error" instead of "Whether throw the error".

📝 Proposed fix
-    check_total_rmse_nan : bool
-        Whether throw the error if the total RMSE is NaN
+    check_total_rmse_nan : bool
+        Whether to throw an error if the total RMSE is NaN
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc85a6d and c4dc138.

📒 Files selected for processing (1)
  • deepmd/loggers/training.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T13:40:14.334Z
Learnt from: CR
Repo: deepmodeling/deepmd-kit PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-12T13:40:14.334Z
Learning: Verify PyTorch training output shows decreasing loss with 'batch X: trn: rmse' messages

Applied to files:

  • deepmd/loggers/training.py
🧬 Code graph analysis (1)
deepmd/loggers/training.py (3)
deepmd/pd/utils/stat.py (1)
  • rmse (536-537)
deepmd/pt/utils/stat.py (1)
  • rmse (536-537)
deepmd/utils/pair_tab.py (1)
  • get (207-209)
🔇 Additional comments (2)
deepmd/loggers/training.py (2)

3-6: LGTM!

The import additions and logger initialization follow Python best practices. Using logging.getLogger(__name__) ensures proper logger hierarchy.


51-63: Good implementation approach with clear error handling.

The implementation correctly:

  • Constructs the message with all RMSE values for visibility
  • Logs the message before raising the error so users can see the problematic values
  • Raises a descriptive RuntimeError that stops training
  • Defaults to enabled (safer choice)

The sequence of operations is appropriate: format the message, check for NaN, log if found, then raise. This provides good visibility into what went wrong.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Jinzhe Zeng <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.14%. Comparing base (cc85a6d) to head (db5e195).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/loggers/training.py 66.66% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5135      +/-   ##
==========================================
- Coverage   82.15%   82.14%   -0.01%     
==========================================
  Files         709      709              
  Lines       72470    72478       +8     
  Branches     3616     3615       -1     
==========================================
+ Hits        59535    59540       +5     
- Misses      11771    11775       +4     
+ Partials     1164     1163       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@iProzd iProzd enabled auto-merge January 9, 2026 08:19
@iProzd iProzd added this pull request to the merge queue Jan 9, 2026
Merged via the queue into deepmodeling:master with commit 6012b4d Jan 9, 2026
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] throw an error when the total loss is NaN

3 participants