-
Notifications
You must be signed in to change notification settings - Fork 586
feat(pt): add descriptor name & parameter numbers output & gpu name (only for cuda) & Capitalise some infos (all backends) #5141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis PR standardizes many display keys/capitalization across logging and summary outputs, adds PyTorch Trainer model-summary logging (descriptor type and trainable parameter counts), and introduces unit tests for model-summary helpers. Changes
Sequence Diagram(s)sequenceDiagram
participant Trainer
participant Model
participant Logger
Trainer->>Model: inspect model structure (get_descriptor / atomic_model)
Trainer->>Model: iterate parameters to count trainable params
Trainer-->>Trainer: assemble summary entries (descriptor type, params)
Trainer->>Logger: emit summary logs per model / model_key
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
deepmd/pt/train/training.py (1)
731-748: Consider adding defensive checks for descriptor extraction.The
get_descriptor_typefunction attempts to extract descriptor information with fallback to "UNKNOWN", which is good for resilience. However, consider these refinements:
- Empty ZBL models list: Line 742 checks
if models:, which is good, but you could also validate the structure more explicitly.- Type key presence: The function checks
"type" in serializedbut doesn't validate the structure beforehand.- Silent failures: Returning "UNKNOWN" silently might hide configuration issues during development.
♻️ Optional: Add logging for unknown descriptor types
if hasattr(model, "get_descriptor"): descriptor = model.get_descriptor() serialized = descriptor.serialize() if isinstance(serialized, dict) and "type" in serialized: return serialized["type"].upper() + else: + log.debug("Descriptor serialization did not return expected dict with 'type' key") # ZBL models: descriptor is in atomic_model.models[0] if hasattr(model, "atomic_model") and hasattr(model.atomic_model, "models"): models = model.atomic_model.models if models: # Check non-empty dp_model = models[0] if hasattr(dp_model, "descriptor"): serialized = dp_model.descriptor.serialize() if isinstance(serialized, dict) and "type" in serialized: return serialized["type"].upper() + " (with ZBL)" + else: + log.debug("ZBL descriptor serialization did not return expected dict with 'type' key") + log.debug("Could not determine descriptor type for model") return "UNKNOWN"deepmd/utils/summary.py (2)
77-81: Device name may not match the actual compute device in multi-GPU setups.
torch.cuda.get_device_name(0)always returns the name of device 0, but the actual compute device (fromget_compute_device()) could be a different GPU index (e.g.,cuda:1). Consider extracting the device index from the compute device string:♻️ Suggested improvement
if build_info["Backend"] == "PyTorch": import torch if torch.cuda.is_available(): - build_info["Device Name"] = torch.cuda.get_device_name(0) + compute_device = self.get_compute_device() + if compute_device.startswith("cuda"): + # Extract device index, default to 0 + try: + device_idx = int(compute_device.split(":")[-1]) if ":" in compute_device else 0 + except ValueError: + device_idx = 0 + build_info["Device Name"] = torch.cuda.get_device_name(device_idx)
94-95: Minor: Naming style inconsistency."NUM_INTRA_THREADS" and "NUM_INTER_THREADS" use SCREAMING_SNAKE_CASE while other keys in this PR use Title Case (e.g., "Visible GPU Count", "Device Name"). If this is intentional to mirror environment variable naming, consider adding a comment explaining the convention. Otherwise, consider "Intra Threads" and "Inter Threads" for consistency.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
deepmd/loggers/training.pydeepmd/pd/entrypoints/main.pydeepmd/pt/entrypoints/main.pydeepmd/pt/train/training.pydeepmd/tf/train/run_options.pydeepmd/utils/data_system.pydeepmd/utils/summary.pysource/tests/pt/test_model_summary.py
🧰 Additional context used
🧬 Code graph analysis (3)
source/tests/pt/test_model_summary.py (1)
deepmd/pt/train/training.py (2)
get_descriptor_type(731-748)count_parameters(750-752)
deepmd/utils/data_system.py (1)
deepmd/utils/path.py (3)
name(143-144)name(264-266)name(466-468)
deepmd/utils/summary.py (4)
deepmd/pd/entrypoints/main.py (4)
get_compute_device(209-211)is_built_with_cuda(201-203)is_built_with_rocm(205-207)get_ngpus(213-215)deepmd/pt/entrypoints/main.py (4)
get_compute_device(230-232)is_built_with_cuda(222-224)is_built_with_rocm(226-228)get_ngpus(234-236)deepmd/tf/train/run_options.py (4)
get_compute_device(58-60)is_built_with_cuda(50-52)is_built_with_rocm(54-56)get_ngpus(62-64)deepmd/env.py (1)
get_default_nthreads(120-143)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (42)
- GitHub Check: Agent
- GitHub Check: CodeQL analysis (python)
- GitHub Check: Test Python (11, 3.10)
- GitHub Check: Test Python (12, 3.13)
- GitHub Check: Test Python (12, 3.10)
- GitHub Check: Test Python (11, 3.13)
- GitHub Check: Test Python (10, 3.13)
- GitHub Check: Test Python (5, 3.10)
- GitHub Check: Test Python (10, 3.10)
- GitHub Check: Test Python (5, 3.13)
- GitHub Check: Test Python (6, 3.13)
- GitHub Check: Test Python (9, 3.13)
- GitHub Check: Test Python (8, 3.10)
- GitHub Check: Test Python (9, 3.10)
- GitHub Check: Test Python (8, 3.13)
- GitHub Check: Test Python (7, 3.13)
- GitHub Check: Test Python (6, 3.10)
- GitHub Check: Test Python (7, 3.10)
- GitHub Check: Test Python (2, 3.10)
- GitHub Check: Test Python (3, 3.10)
- GitHub Check: Test Python (4, 3.10)
- GitHub Check: Test Python (4, 3.13)
- GitHub Check: Test Python (1, 3.10)
- GitHub Check: Test Python (3, 3.13)
- GitHub Check: Test Python (2, 3.13)
- GitHub Check: Test Python (1, 3.13)
- GitHub Check: Test C++ (false, true, true, false)
- GitHub Check: Test C++ (true, true, true, false)
- GitHub Check: Test C++ (true, false, false, true)
- GitHub Check: Test C++ (false, false, false, true)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (clang, clang)
🔇 Additional comments (13)
deepmd/utils/data_system.py (2)
716-716: LGTM: Consistent capitalization for log message.Note that
.capitalize()converts the first character to uppercase and the rest to lowercase, which may alter the display of system names with mixed case (e.g., "MySystem" → "Mysystem"). This appears intentional for standardizing the format.
718-718: LGTM: Capitalization aligns with PR standardization goals.The updated log message follows the title-case convention applied across the codebase.
deepmd/pt/train/training.py (3)
724-726: LGTM: Proper placement and rank guard.The model summary is logged only on rank 0, avoiding duplicate logs in distributed training scenarios. The placement after profiling setup is appropriate.
750-752: LGTM: Correct parameter counting.The function correctly counts only trainable parameters using
p.requires_grad, which is the appropriate metric for model capacity reporting.
754-767: LGTM: Well-structured model summary logging.The implementation provides clear, readable output for both single-task and multi-task scenarios. The empty line improves log readability, and parameter counts in millions with 3 decimal places strike a good balance between precision and conciseness.
deepmd/pd/entrypoints/main.py (1)
222-223: LGTM: Backend info keys standardized to Title Case.The key renames ("PD ver" → "PD Ver", "Enable custom OP" → "Custom OP Enabled") align with the consistent capitalization pattern applied across all backends in this PR.
deepmd/tf/train/run_options.py (1)
70-73: LGTM: TensorFlow backend info keys standardized.The updated keys follow Title Case convention, matching the standardization applied to PyTorch and Paddle backends. Values remain unchanged, preserving all functionality.
deepmd/loggers/training.py (2)
11-11: LGTM: Training message capitalization standardized.The capitalization of "Batch" aligns with the PR's goal of standardizing log message formatting across the codebase.
32-32: LGTM: Consistent capitalization in per-task messages.The change matches the formatting update in
format_training_message, ensuring consistency across all training log messages.deepmd/pt/entrypoints/main.py (1)
238-253: LGTM!The key renaming to Title Case format is consistent with the changes in other backends (Paddle, TensorFlow) as noted in the AI summary. The "Custom OP Enabled" phrasing is clearer than "Enable custom OP" as it describes the current state rather than an action.
source/tests/pt/test_model_summary.py (2)
103-140: Good test coverage for parameter counting.The tests correctly verify:
- All trainable parameters (Linear(10,5) = 55 params ✓)
- Mixed trainable/frozen (only unfrozen layer counted ✓)
- All frozen (0 trainable params ✓)
15-33: The suggested import is not feasible; the functions are nested inside_log_model_summary().While code duplication is a valid concern, these functions cannot be imported directly since they're defined as nested functions inside the
_log_model_summary()method intraining.py(lines 731 and 750). The import statementfrom deepmd.pt.train.training import get_descriptor_typewill fail.If duplication elimination is desired, consider extracting these functions to module-level helpers in
training.pyso they can be shared between the implementation and tests. However, the current test approach with isolated static methods and mocks/real models is reasonable for unit testing this specific logic.Likely an incorrect or invalid review comment.
deepmd/utils/summary.py (1)
49-57: LGTM!The BUILD dictionary keys are consistently renamed to Title Case format, aligning with the overall PR objective of standardizing output formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the DeePMD-kit training output by adding descriptor type and trainable parameter count information for PyTorch models, displaying GPU device names for CUDA setups, and standardizing the capitalization of various information fields across all backends (PyTorch, TensorFlow, Paddle) for better consistency and readability.
Key changes:
- Added model summary logging that displays descriptor type and trainable parameter count in millions
- Added GPU device name display for PyTorch CUDA setups
- Standardized capitalization of build, runtime, and training information across all backends
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| source/tests/pt/test_model_summary.py | New test file for descriptor type detection and parameter counting functions |
| deepmd/utils/summary.py | Capitalized info keys, added GPU device name for PyTorch CUDA, changed environment variable keys to uppercase |
| deepmd/utils/data_system.py | Capitalized DataSystem summary title and system count message |
| deepmd/tf/train/run_options.py | Capitalized TensorFlow backend information keys |
| deepmd/pt/train/training.py | Added _log_model_summary method with descriptor type and parameter count logging for single and multi-task models |
| deepmd/pt/entrypoints/main.py | Capitalized PyTorch backend information keys |
| deepmd/pd/entrypoints/main.py | Capitalized Paddle backend information keys |
| deepmd/loggers/training.py | Capitalized "batch" to "Batch" in training log messages |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]> Signed-off-by: OutisLi <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
deepmd/utils/summary.py (2)
80-84: Consider adding ROCm device detection for PyTorch.Currently, the code only checks CUDA devices via
torch.cuda.is_available(). For PyTorch installations with ROCm support, consider also checkingtorch.hip.is_available()andtorch.hip.get_device_name(0)to provide device names for AMD GPUs.♻️ Optional enhancement for ROCm support
if backend == "PyTorch": import torch - if torch.cuda.is_available(): - device_name = torch.cuda.get_device_name(0) + if torch.cuda.is_available(): + device_name = torch.cuda.get_device_name(0) + elif hasattr(torch, "hip") and torch.hip.is_available(): + device_name = torch.hip.get_device_name(0)
97-99: Consider logging device detection failures at debug level.The broad exception handler silently suppresses all errors during device name detection. While this best-effort approach is reasonable, logging failures at debug level would help troubleshoot issues without cluttering normal output.
Based on static analysis hints, the broad exception catch could be improved:
♻️ Optional enhancement with debug logging
except Exception: - # Best-effort device name detection; ignore failures silently - pass + # Best-effort device name detection; log failures at debug level + log.debug("Failed to detect device name", exc_info=True)
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/summary.py
🧰 Additional context used
🪛 Ruff (0.14.10)
deepmd/utils/summary.py
97-99: try-except-pass detected, consider logging the exception
(S110)
97-97: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (40)
- GitHub Check: Test Python (2, 3.10)
- GitHub Check: Test Python (9, 3.10)
- GitHub Check: Test Python (8, 3.10)
- GitHub Check: Test Python (7, 3.10)
- GitHub Check: Test Python (7, 3.13)
- GitHub Check: Test Python (10, 3.13)
- GitHub Check: Test Python (5, 3.10)
- GitHub Check: Test Python (12, 3.13)
- GitHub Check: Test Python (6, 3.10)
- GitHub Check: Test Python (11, 3.10)
- GitHub Check: Test Python (10, 3.10)
- GitHub Check: Test Python (5, 3.13)
- GitHub Check: Test Python (6, 3.13)
- GitHub Check: Test Python (9, 3.13)
- GitHub Check: Test Python (4, 3.10)
- GitHub Check: Test Python (11, 3.13)
- GitHub Check: Test Python (12, 3.10)
- GitHub Check: Test Python (3, 3.13)
- GitHub Check: Test Python (4, 3.13)
- GitHub Check: Test Python (8, 3.13)
- GitHub Check: Test Python (1, 3.13)
- GitHub Check: Test Python (3, 3.10)
- GitHub Check: Test Python (2, 3.13)
- GitHub Check: Test Python (1, 3.10)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Test C++ (false, false, false, true)
- GitHub Check: Test C++ (false, true, true, false)
- GitHub Check: Test C++ (true, true, true, false)
- GitHub Check: Test C++ (true, false, false, true)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
🔇 Additional comments (4)
deepmd/utils/summary.py (4)
49-57: LGTM! Improved display formatting.The capitalization of keys and formatting of values makes the build information more readable and professional.
67-68: LGTM! Consistent capitalization.The key names now follow the same Title Case convention as the BUILD dictionary, improving overall consistency.
Also applies to: 73-74
109-109: LGTM! Consistent naming for GPU and thread information.The key renames maintain consistency with the overall capitalization improvements. Using uppercase for thread variables follows common environment variable naming conventions.
Also applies to: 114-115
85-91: The code snippet referenced in this review does not exist in the current codebase.Lines 85-91 in
deepmd/utils/summary.pycontain thread configuration logic, not TensorFlow device detection. The TensorFlowSummaryPrinterimplementation (indeepmd/tf/train/run_options.py, lines 42-74) does not contain anytf.config.list_physical_devices()calls or device name extraction logic. It only passes pre-computed device information from the constructor.Likely an incorrect or invalid review comment.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5141 +/- ##
=======================================
Coverage 81.94% 81.94%
=======================================
Files 712 712
Lines 72887 72938 +51
Branches 3616 3616
=======================================
+ Hits 59725 59767 +42
- Misses 11998 12007 +9
Partials 1164 1164 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
njzjz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the key point of this PR. A single PR should not have multiple purposes that are not related to each other.
| def _log_model_summary(self) -> None: | ||
| """Log model summary information including descriptor type and parameter count.""" | ||
|
|
||
| def get_descriptor_type(model: Any) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why annotate `Any‘?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since
def get_model_for_wrapper(
_model_params: dict[str, Any],
resuming: bool = False,
_loss_params: dict[str, Any] | None = None,
) -> Anyreturns Any
| return serialized["type"].upper() | ||
| # ZBL models: descriptor is in atomic_model.models[0] | ||
| if hasattr(model, "atomic_model") and hasattr(model.atomic_model, "models"): | ||
| models = model.atomic_model.models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a good behavior to visit a inner attribution like this.
Summary by CodeRabbit
New Features
Style
Tests
✏️ Tip: You can customize this high-level summary in your review settings.