Skip to content

Conversation

@OutisLi
Copy link
Collaborator

@OutisLi OutisLi commented Jan 9, 2026

Summary by CodeRabbit

  • New Features

    • Model-summary logging at training start: shows descriptor type and trainable-parameter counts (per-model and multi-task).
  • Style

    • Unified capitalization and clearer display labels across backend, system and summary outputs for improved readability.
  • Tests

    • Added unit tests covering descriptor type detection and trainable-parameter counting.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings January 9, 2026 05:27
@github-actions github-actions bot added the Python label Jan 9, 2026
@OutisLi OutisLi requested review from iProzd and njzjz January 9, 2026 05:28
@dosubot dosubot bot added the new feature label Jan 9, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 9, 2026

📝 Walkthrough

Walkthrough

This PR standardizes many display keys/capitalization across logging and summary outputs, adds PyTorch Trainer model-summary logging (descriptor type and trainable parameter counts), and introduces unit tests for model-summary helpers.

Changes

Cohort / File(s) Summary
Training Log Formatting
deepmd/loggers/training.py
Capitalized leading word in batch log messages in format_training_message and format_training_message_per_task.
Backend Info Key Standardization
deepmd/pd/entrypoints/main.py, deepmd/pt/entrypoints/main.py, deepmd/tf/train/run_options.py
Renamed backend info dictionary keys to Title Case (examples: "PD ver""PD Ver", "Enable custom OP""Custom OP Enabled", "build with PT ver""Built With PT Ver").
System Info Formatting
deepmd/utils/data_system.py
Capitalized system name and summary header: e.g., "found %d system(s):""Found %d System(s):".
Summary Output Formatting
deepmd/utils/summary.py
Extensive key renames to Title/Upper Case (e.g., "installed to""Installed To", "world size""World Size", "computing device""Computing Device"), value case normalization, and conditional addition of "Device Name" when detectable.
Model Summary Logging
deepmd/pt/train/training.py
Added Trainer._log_model_summary() to compute and log descriptor type and trainable parameter counts (supports single and multi-model/multi-task flows); invoked during Trainer init on rank 0 after profiler setup.
Model Summary Tests
source/tests/pt/test_model_summary.py
New unit tests for descriptor-type extraction (standard and ZBL-like models, and edge cases) and trainable-parameter counting (all trainable, partially frozen, all frozen).

Sequence Diagram(s)

sequenceDiagram
    participant Trainer
    participant Model
    participant Logger

    Trainer->>Model: inspect model structure (get_descriptor / atomic_model)
    Trainer->>Model: iterate parameters to count trainable params
    Trainer-->>Trainer: assemble summary entries (descriptor type, params)
    Trainer->>Logger: emit summary logs per model / model_key
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • njzjz
  • wanghan-iapcm
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title comprehensively summarizes the main changes: adding descriptor names and parameter counts for PyTorch, GPU name detection for CUDA, and capitalization updates across all backends.
Docstring Coverage ✅ Passed Docstring coverage is 91.30% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
deepmd/pt/train/training.py (1)

731-748: Consider adding defensive checks for descriptor extraction.

The get_descriptor_type function attempts to extract descriptor information with fallback to "UNKNOWN", which is good for resilience. However, consider these refinements:

  1. Empty ZBL models list: Line 742 checks if models:, which is good, but you could also validate the structure more explicitly.
  2. Type key presence: The function checks "type" in serialized but doesn't validate the structure beforehand.
  3. Silent failures: Returning "UNKNOWN" silently might hide configuration issues during development.
♻️ Optional: Add logging for unknown descriptor types
             if hasattr(model, "get_descriptor"):
                 descriptor = model.get_descriptor()
                 serialized = descriptor.serialize()
                 if isinstance(serialized, dict) and "type" in serialized:
                     return serialized["type"].upper()
+                else:
+                    log.debug("Descriptor serialization did not return expected dict with 'type' key")
             # ZBL models: descriptor is in atomic_model.models[0]
             if hasattr(model, "atomic_model") and hasattr(model.atomic_model, "models"):
                 models = model.atomic_model.models
                 if models:  # Check non-empty
                     dp_model = models[0]
                     if hasattr(dp_model, "descriptor"):
                         serialized = dp_model.descriptor.serialize()
                         if isinstance(serialized, dict) and "type" in serialized:
                             return serialized["type"].upper() + " (with ZBL)"
+                        else:
+                            log.debug("ZBL descriptor serialization did not return expected dict with 'type' key")
+            log.debug("Could not determine descriptor type for model")
             return "UNKNOWN"
deepmd/utils/summary.py (2)

77-81: Device name may not match the actual compute device in multi-GPU setups.

torch.cuda.get_device_name(0) always returns the name of device 0, but the actual compute device (from get_compute_device()) could be a different GPU index (e.g., cuda:1). Consider extracting the device index from the compute device string:

♻️ Suggested improvement
         if build_info["Backend"] == "PyTorch":
             import torch

             if torch.cuda.is_available():
-                build_info["Device Name"] = torch.cuda.get_device_name(0)
+                compute_device = self.get_compute_device()
+                if compute_device.startswith("cuda"):
+                    # Extract device index, default to 0
+                    try:
+                        device_idx = int(compute_device.split(":")[-1]) if ":" in compute_device else 0
+                    except ValueError:
+                        device_idx = 0
+                    build_info["Device Name"] = torch.cuda.get_device_name(device_idx)

94-95: Minor: Naming style inconsistency.

"NUM_INTRA_THREADS" and "NUM_INTER_THREADS" use SCREAMING_SNAKE_CASE while other keys in this PR use Title Case (e.g., "Visible GPU Count", "Device Name"). If this is intentional to mirror environment variable naming, consider adding a comment explaining the convention. Otherwise, consider "Intra Threads" and "Inter Threads" for consistency.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f73113 and d9c3db7.

📒 Files selected for processing (8)
  • deepmd/loggers/training.py
  • deepmd/pd/entrypoints/main.py
  • deepmd/pt/entrypoints/main.py
  • deepmd/pt/train/training.py
  • deepmd/tf/train/run_options.py
  • deepmd/utils/data_system.py
  • deepmd/utils/summary.py
  • source/tests/pt/test_model_summary.py
🧰 Additional context used
🧬 Code graph analysis (3)
source/tests/pt/test_model_summary.py (1)
deepmd/pt/train/training.py (2)
  • get_descriptor_type (731-748)
  • count_parameters (750-752)
deepmd/utils/data_system.py (1)
deepmd/utils/path.py (3)
  • name (143-144)
  • name (264-266)
  • name (466-468)
deepmd/utils/summary.py (4)
deepmd/pd/entrypoints/main.py (4)
  • get_compute_device (209-211)
  • is_built_with_cuda (201-203)
  • is_built_with_rocm (205-207)
  • get_ngpus (213-215)
deepmd/pt/entrypoints/main.py (4)
  • get_compute_device (230-232)
  • is_built_with_cuda (222-224)
  • is_built_with_rocm (226-228)
  • get_ngpus (234-236)
deepmd/tf/train/run_options.py (4)
  • get_compute_device (58-60)
  • is_built_with_cuda (50-52)
  • is_built_with_rocm (54-56)
  • get_ngpus (62-64)
deepmd/env.py (1)
  • get_default_nthreads (120-143)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (42)
  • GitHub Check: Agent
  • GitHub Check: CodeQL analysis (python)
  • GitHub Check: Test Python (11, 3.10)
  • GitHub Check: Test Python (12, 3.13)
  • GitHub Check: Test Python (12, 3.10)
  • GitHub Check: Test Python (11, 3.13)
  • GitHub Check: Test Python (10, 3.13)
  • GitHub Check: Test Python (5, 3.10)
  • GitHub Check: Test Python (10, 3.10)
  • GitHub Check: Test Python (5, 3.13)
  • GitHub Check: Test Python (6, 3.13)
  • GitHub Check: Test Python (9, 3.13)
  • GitHub Check: Test Python (8, 3.10)
  • GitHub Check: Test Python (9, 3.10)
  • GitHub Check: Test Python (8, 3.13)
  • GitHub Check: Test Python (7, 3.13)
  • GitHub Check: Test Python (6, 3.10)
  • GitHub Check: Test Python (7, 3.10)
  • GitHub Check: Test Python (2, 3.10)
  • GitHub Check: Test Python (3, 3.10)
  • GitHub Check: Test Python (4, 3.10)
  • GitHub Check: Test Python (4, 3.13)
  • GitHub Check: Test Python (1, 3.10)
  • GitHub Check: Test Python (3, 3.13)
  • GitHub Check: Test Python (2, 3.13)
  • GitHub Check: Test Python (1, 3.13)
  • GitHub Check: Test C++ (false, true, true, false)
  • GitHub Check: Test C++ (true, true, true, false)
  • GitHub Check: Test C++ (true, false, false, true)
  • GitHub Check: Test C++ (false, false, false, true)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (clang, clang)
🔇 Additional comments (13)
deepmd/utils/data_system.py (2)

716-716: LGTM: Consistent capitalization for log message.

Note that .capitalize() converts the first character to uppercase and the rest to lowercase, which may alter the display of system names with mixed case (e.g., "MySystem" → "Mysystem"). This appears intentional for standardizing the format.


718-718: LGTM: Capitalization aligns with PR standardization goals.

The updated log message follows the title-case convention applied across the codebase.

deepmd/pt/train/training.py (3)

724-726: LGTM: Proper placement and rank guard.

The model summary is logged only on rank 0, avoiding duplicate logs in distributed training scenarios. The placement after profiling setup is appropriate.


750-752: LGTM: Correct parameter counting.

The function correctly counts only trainable parameters using p.requires_grad, which is the appropriate metric for model capacity reporting.


754-767: LGTM: Well-structured model summary logging.

The implementation provides clear, readable output for both single-task and multi-task scenarios. The empty line improves log readability, and parameter counts in millions with 3 decimal places strike a good balance between precision and conciseness.

deepmd/pd/entrypoints/main.py (1)

222-223: LGTM: Backend info keys standardized to Title Case.

The key renames ("PD ver" → "PD Ver", "Enable custom OP" → "Custom OP Enabled") align with the consistent capitalization pattern applied across all backends in this PR.

deepmd/tf/train/run_options.py (1)

70-73: LGTM: TensorFlow backend info keys standardized.

The updated keys follow Title Case convention, matching the standardization applied to PyTorch and Paddle backends. Values remain unchanged, preserving all functionality.

deepmd/loggers/training.py (2)

11-11: LGTM: Training message capitalization standardized.

The capitalization of "Batch" aligns with the PR's goal of standardizing log message formatting across the codebase.


32-32: LGTM: Consistent capitalization in per-task messages.

The change matches the formatting update in format_training_message, ensuring consistency across all training log messages.

deepmd/pt/entrypoints/main.py (1)

238-253: LGTM!

The key renaming to Title Case format is consistent with the changes in other backends (Paddle, TensorFlow) as noted in the AI summary. The "Custom OP Enabled" phrasing is clearer than "Enable custom OP" as it describes the current state rather than an action.

source/tests/pt/test_model_summary.py (2)

103-140: Good test coverage for parameter counting.

The tests correctly verify:

  • All trainable parameters (Linear(10,5) = 55 params ✓)
  • Mixed trainable/frozen (only unfrozen layer counted ✓)
  • All frozen (0 trainable params ✓)

15-33: The suggested import is not feasible; the functions are nested inside _log_model_summary().

While code duplication is a valid concern, these functions cannot be imported directly since they're defined as nested functions inside the _log_model_summary() method in training.py (lines 731 and 750). The import statement from deepmd.pt.train.training import get_descriptor_type will fail.

If duplication elimination is desired, consider extracting these functions to module-level helpers in training.py so they can be shared between the implementation and tests. However, the current test approach with isolated static methods and mocks/real models is reasonable for unit testing this specific logic.

Likely an incorrect or invalid review comment.

deepmd/utils/summary.py (1)

49-57: LGTM!

The BUILD dictionary keys are consistently renamed to Title Case format, aligning with the overall PR objective of standardizing output formatting.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the DeePMD-kit training output by adding descriptor type and trainable parameter count information for PyTorch models, displaying GPU device names for CUDA setups, and standardizing the capitalization of various information fields across all backends (PyTorch, TensorFlow, Paddle) for better consistency and readability.

Key changes:

  • Added model summary logging that displays descriptor type and trainable parameter count in millions
  • Added GPU device name display for PyTorch CUDA setups
  • Standardized capitalization of build, runtime, and training information across all backends

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
source/tests/pt/test_model_summary.py New test file for descriptor type detection and parameter counting functions
deepmd/utils/summary.py Capitalized info keys, added GPU device name for PyTorch CUDA, changed environment variable keys to uppercase
deepmd/utils/data_system.py Capitalized DataSystem summary title and system count message
deepmd/tf/train/run_options.py Capitalized TensorFlow backend information keys
deepmd/pt/train/training.py Added _log_model_summary method with descriptor type and parameter count logging for single and multi-task models
deepmd/pt/entrypoints/main.py Capitalized PyTorch backend information keys
deepmd/pd/entrypoints/main.py Capitalized Paddle backend information keys
deepmd/loggers/training.py Capitalized "batch" to "Batch" in training log messages

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
deepmd/utils/summary.py (2)

80-84: Consider adding ROCm device detection for PyTorch.

Currently, the code only checks CUDA devices via torch.cuda.is_available(). For PyTorch installations with ROCm support, consider also checking torch.hip.is_available() and torch.hip.get_device_name(0) to provide device names for AMD GPUs.

♻️ Optional enhancement for ROCm support
 if backend == "PyTorch":
     import torch
 
-    if torch.cuda.is_available():
-        device_name = torch.cuda.get_device_name(0)
+    if torch.cuda.is_available():
+        device_name = torch.cuda.get_device_name(0)
+    elif hasattr(torch, "hip") and torch.hip.is_available():
+        device_name = torch.hip.get_device_name(0)

97-99: Consider logging device detection failures at debug level.

The broad exception handler silently suppresses all errors during device name detection. While this best-effort approach is reasonable, logging failures at debug level would help troubleshoot issues without cluttering normal output.

Based on static analysis hints, the broad exception catch could be improved:

♻️ Optional enhancement with debug logging
 except Exception:
-    # Best-effort device name detection; ignore failures silently
-    pass
+    # Best-effort device name detection; log failures at debug level
+    log.debug("Failed to detect device name", exc_info=True)
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d9c3db7 and 218cdc9.

📒 Files selected for processing (1)
  • deepmd/utils/summary.py
🧰 Additional context used
🪛 Ruff (0.14.10)
deepmd/utils/summary.py

97-99: try-except-pass detected, consider logging the exception

(S110)


97-97: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (40)
  • GitHub Check: Test Python (2, 3.10)
  • GitHub Check: Test Python (9, 3.10)
  • GitHub Check: Test Python (8, 3.10)
  • GitHub Check: Test Python (7, 3.10)
  • GitHub Check: Test Python (7, 3.13)
  • GitHub Check: Test Python (10, 3.13)
  • GitHub Check: Test Python (5, 3.10)
  • GitHub Check: Test Python (12, 3.13)
  • GitHub Check: Test Python (6, 3.10)
  • GitHub Check: Test Python (11, 3.10)
  • GitHub Check: Test Python (10, 3.10)
  • GitHub Check: Test Python (5, 3.13)
  • GitHub Check: Test Python (6, 3.13)
  • GitHub Check: Test Python (9, 3.13)
  • GitHub Check: Test Python (4, 3.10)
  • GitHub Check: Test Python (11, 3.13)
  • GitHub Check: Test Python (12, 3.10)
  • GitHub Check: Test Python (3, 3.13)
  • GitHub Check: Test Python (4, 3.13)
  • GitHub Check: Test Python (8, 3.13)
  • GitHub Check: Test Python (1, 3.13)
  • GitHub Check: Test Python (3, 3.10)
  • GitHub Check: Test Python (2, 3.13)
  • GitHub Check: Test Python (1, 3.10)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test C++ (false, false, false, true)
  • GitHub Check: Test C++ (false, true, true, false)
  • GitHub Check: Test C++ (true, true, true, false)
  • GitHub Check: Test C++ (true, false, false, true)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
🔇 Additional comments (4)
deepmd/utils/summary.py (4)

49-57: LGTM! Improved display formatting.

The capitalization of keys and formatting of values makes the build information more readable and professional.


67-68: LGTM! Consistent capitalization.

The key names now follow the same Title Case convention as the BUILD dictionary, improving overall consistency.

Also applies to: 73-74


109-109: LGTM! Consistent naming for GPU and thread information.

The key renames maintain consistency with the overall capitalization improvements. Using uppercase for thread variables follows common environment variable naming conventions.

Also applies to: 114-115


85-91: The code snippet referenced in this review does not exist in the current codebase.

Lines 85-91 in deepmd/utils/summary.py contain thread configuration logic, not TensorFlow device detection. The TensorFlow SummaryPrinter implementation (in deepmd/tf/train/run_options.py, lines 42-74) does not contain any tf.config.list_physical_devices() calls or device name extraction logic. It only passes pre-computed device information from the constructor.

Likely an incorrect or invalid review comment.

@codecov
Copy link

codecov bot commented Jan 9, 2026

Codecov Report

❌ Patch coverage is 81.48148% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.94%. Comparing base (5f73113) to head (218cdc9).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/utils/summary.py 55.00% 9 Missing ⚠️
deepmd/pt/train/training.py 96.87% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5141   +/-   ##
=======================================
  Coverage   81.94%   81.94%           
=======================================
  Files         712      712           
  Lines       72887    72938   +51     
  Branches     3616     3616           
=======================================
+ Hits        59725    59767   +42     
- Misses      11998    12007    +9     
  Partials     1164     1164           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the key point of this PR. A single PR should not have multiple purposes that are not related to each other.

def _log_model_summary(self) -> None:
"""Log model summary information including descriptor type and parameter count."""

def get_descriptor_type(model: Any) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why annotate `Any‘?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since

def get_model_for_wrapper(
    _model_params: dict[str, Any],
    resuming: bool = False,
    _loss_params: dict[str, Any] | None = None,
) -> Any

returns Any

return serialized["type"].upper()
# ZBL models: descriptor is in atomic_model.models[0]
if hasattr(model, "atomic_model") and hasattr(model.atomic_model, "models"):
models = model.atomic_model.models
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good behavior to visit a inner attribution like this.

@OutisLi OutisLi closed this Jan 10, 2026
@OutisLi OutisLi deleted the pr/display branch January 10, 2026 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants