feat: add NaN detection during training #5135

njzjz · 2026-01-07T17:28:41Z

This implementation is much simpler than #4986.

Summary by CodeRabbit

Bug Fixes
- Improved training-metric validation to detect NaN total RMSE, logging a clear error and halting runs to avoid silent failures.
Documentation
- Added documentation for the new option that controls NaN checking so users can enable or disable the validation as needed.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Fix deepmodeling#4985. This implementation is much simpler than deepmodeling#4986. Signed-off-by: Jinzhe Zeng <[email protected]>

for more information, see https://pre-commit.ci

coderabbitai · 2026-01-07T17:30:30Z

📝 Walkthrough

Walkthrough

Added an optional NaN check to format_training_message_per_task that logs and raises a RuntimeError when the total RMSE is NaN; also initialized a module-level logger and imported math.

Changes

Cohort / File(s)	Summary
Training Logger NaN Validation `deepmd/loggers/training.py`	Added `import logging` and `import math` plus `log = logging.getLogger(__name__)`. Extended `format_training_message_per_task` signature with `check_total_rmse_nan: bool = True` and updated its docstring. Builds the message locally and, if `check_total_rmse_nan` is True and `rmse["rmse"]` is NaN, logs an error and raises `RuntimeError`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	❓ Inconclusive	The PR implements NaN detection with a new parameter in format_training_message_per_task that raises RuntimeError when NaN is detected, addressing the core requirement from issue #4985 to detect and stop on NaN loss.	Clarify implementation scope: verify whether backend-specific integration (TensorFlow, PyTorch, PaddlePaddle) and checkpoint prevention are handled elsewhere, as only logging function changes are present.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add NaN detection during training' accurately and concisely summarizes the main change: adding NaN detection capability to the training logging functionality.
Out of Scope Changes check	✅ Passed	All changes are focused on NaN detection in the logging module, directly addressing the feature request without introducing unrelated modifications.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4dc138 and db5e195.

📒 Files selected for processing (1)

deepmd/loggers/training.py

🚧 Files skipped from review as they are similar to previous changes (1)

deepmd/loggers/training.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (40)

GitHub Check: Test Python (12, 3.13)
GitHub Check: Test Python (12, 3.10)
GitHub Check: Test Python (10, 3.13)
GitHub Check: Test Python (7, 3.13)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Test Python (11, 3.13)
GitHub Check: Test Python (11, 3.10)
GitHub Check: Test Python (7, 3.10)
GitHub Check: Test Python (9, 3.13)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (8, 3.10)
GitHub Check: Test Python (9, 3.10)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (10, 3.10)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (8, 3.13)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (clang, clang)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adds NaN (Not a Number) detection during model training to prevent wasting time training models that have already diverged. When the total RMSE becomes NaN, training is immediately stopped with a descriptive error message. The implementation adds a new parameter check_total_rmse_nan to the format_training_message_per_task function with a default value of True.

Adds NaN detection logic to check the "rmse" key in the RMSE dictionary
Logs an error message and raises RuntimeError when NaN is detected
Adds comprehensive docstring to document the new functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepmd/loggers/training.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @deepmd/loggers/training.py:
- Around line 56-62: The current NaN guard uses rmse.get("rmse", 0.0) which
silently defaults and skips detection when the "rmse" key is absent; change the
condition in the check_total_rmse_nan branch to explicitly test for the key and
then check NaN (e.g., only evaluate math.isnan on rmse["rmse"] when "rmse" in
rmse), i.e., replace the get(...) usage with an explicit membership check of the
rmse dict before calling math.isnan, keeping the existing log.error and
RuntimeError behavior for true NaN.

🧹 Nitpick comments (1)

deepmd/loggers/training.py (1)
28-42: Fix grammar in docstring.

Line 41 has a grammatical error. It should read "Whether to throw an error" instead of "Whether throw the error".
📝 Proposed fix
-    check_total_rmse_nan : bool
-        Whether throw the error if the total RMSE is NaN
+    check_total_rmse_nan : bool
+        Whether to throw an error if the total RMSE is NaN

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc85a6d and c4dc138.

📒 Files selected for processing (1)

deepmd/loggers/training.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-12T13:40:14.334Z

Learnt from: CR
Repo: deepmodeling/deepmd-kit PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-12T13:40:14.334Z
Learning: Verify PyTorch training output shows decreasing loss with 'batch X: trn: rmse' messages

Applied to files:

deepmd/loggers/training.py

🧬 Code graph analysis (1)

deepmd/loggers/training.py (3)

deepmd/pd/utils/stat.py (1)

rmse (536-537)

deepmd/pt/utils/stat.py (1)

rmse (536-537)

deepmd/utils/pair_tab.py (1)

get (207-209)

🔇 Additional comments (2)

deepmd/loggers/training.py (2)

3-6: LGTM!

The import additions and logger initialization follow Python best practices. Using logging.getLogger(__name__) ensures proper logger hierarchy.

51-63: Good implementation approach with clear error handling.

The implementation correctly:

Constructs the message with all RMSE values for visibility

Logs the message before raising the error so users can see the problematic values

Raises a descriptive RuntimeError that stops training

Defaults to enabled (safer choice)

The sequence of operations is appropriate: format the message, check for NaN, log if found, then raise. This provides good visibility into what went wrong.

deepmd/loggers/training.py

Co-authored-by: Copilot <[email protected]> Signed-off-by: Jinzhe Zeng <[email protected]>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

codecov · 2026-01-07T18:13:23Z

Codecov Report

❌ Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.14%. Comparing base (cc85a6d) to head (db5e195).
⚠️ Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
deepmd/loggers/training.py	66.66%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5135      +/-   ##
==========================================
- Coverage   82.15%   82.14%   -0.01%     
==========================================
  Files         709      709              
  Lines       72470    72478       +8     
  Branches     3616     3615       -1     
==========================================
+ Hits        59535    59540       +5     
- Misses      11771    11775       +4     
+ Partials     1164     1163       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

feat: add NaN detection during training

6e8bca5

Fix deepmodeling#4985. This implementation is much simpler than deepmodeling#4986. Signed-off-by: Jinzhe Zeng <[email protected]>

njzjz requested review from Copilot, iProzd and wanghan-iapcm January 7, 2026 17:28

github-actions bot added the Python label Jan 7, 2026

njzjz mentioned this pull request Jan 7, 2026

feat: add NaN detection during training #4986

Closed

dosubot bot added the new feature label Jan 7, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

c4dc138

for more information, see https://pre-commit.ci

Copilot started reviewing on behalf of njzjz January 7, 2026 17:30 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

deepmd/loggers/training.py Outdated Show resolved Hide resolved

deepmd/loggers/training.py Show resolved Hide resolved

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

deepmd/loggers/training.py Show resolved Hide resolved

Update deepmd/loggers/training.py

db5e195

Co-authored-by: Copilot <[email protected]> Signed-off-by: Jinzhe Zeng <[email protected]>

Copilot AI reviewed Jan 7, 2026

View reviewed changes

wanghan-iapcm approved these changes Jan 8, 2026

View reviewed changes

iProzd approved these changes Jan 9, 2026

View reviewed changes

iProzd enabled auto-merge January 9, 2026 08:19

iProzd added this pull request to the merge queue Jan 9, 2026

Merged via the queue into deepmodeling:master with commit 6012b4d Jan 9, 2026
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add NaN detection during training #5135

feat: add NaN detection during training #5135

njzjz commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add NaN detection during training #5135

feat: add NaN detection during training #5135

Conversation

njzjz commented Jan 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

njzjz commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 7, 2026 •

edited

Loading

codecov bot commented Jan 7, 2026 •

edited

Loading