Skip to content

Conversation

@OutisLi
Copy link
Collaborator

@OutisLi OutisLi commented Jan 15, 2026

refactor: unify learning rate schedulers with array API

  • Refactor BaseLR in dpmodel to use array_api_compat for backend-agnostic implementation
  • Consolidate learning rate logic from TF/PT/PD backends into unified dpmodel layer
  • Use array API operations (xp.where, xp.clip, etc.) for JIT compatibility across backends
  • Add warmup support (warmup_steps, warmup_ratio, warmup_start_factor) during refactoring
  • Add stop_ratio parameter as alternative to stop_lr for flexible configuration
  • Implement mutual exclusion validation for stop_lr/stop_ratio and warmup_steps/warmup_ratio
  • Update all backends to use unified BaseLR implementation
  • Add comprehensive consistency tests across NumPy/PyTorch/JAX/array_api_strict backends

Summary by CodeRabbit

Release Notes

  • New Features

    • Added debugging utilities for model compression, training, and inference profiling
    • Enhanced learning rate schedulers with optional warmup phase support
  • Documentation

    • Added comprehensive guides for DPA3 architecture and model compression
    • Added installation guide for building from source
    • Added development guidance documentation
  • Refactor

    • Refactored learning rate configuration to use dict-based parameters
    • Simplified training warmup handling through unified scheduler logic
    • Updated pre-commit hooks to use mirror endpoints
  • Tests

    • Added extensive learning rate scheduler test coverage including warmup scenarios
    • Enhanced test validation for parameter interactions and edge cases
  • Chores

    • Extended repository ignore patterns for development tools and build artifacts
    • Updated docstring type hints for improved documentation accuracy

✏️ Tip: You can customize this high-level summary in your review settings.

feat: add multiple runs for infer scripts

feat: add pytorch profiler for infer debug

ignore profiler files in gitognore
- Refactor BaseLR in dpmodel to use array_api_compat for backend-agnostic implementation
- Consolidate learning rate logic from TF/PT/PD backends into unified dpmodel layer
- Use array API operations (xp.where, xp.clip, etc.) for JIT compatibility across backends
- Add warmup support (warmup_steps, warmup_ratio, warmup_start_factor) during refactoring
- Add stop_ratio parameter as alternative to stop_lr for flexible configuration
- Implement mutual exclusion validation for stop_lr/stop_ratio and warmup_steps/warmup_ratio
- Update all backends to use unified BaseLR implementation
- Add comprehensive consistency tests across NumPy/PyTorch/JAX/array_api_strict backends
Copilot AI review requested due to automatic review settings January 15, 2026 04:25
@OutisLi OutisLi closed this Jan 15, 2026
@dosubot dosubot bot added the enhancement label Jan 15, 2026
@OutisLi OutisLi deleted the self/lr branch January 15, 2026 04:27
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 15, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive learning rate scheduler refactor with warmup support, adds debug utilities for model training/compression/inference/testing, expands documentation (DPA3, compression, installation), updates configuration files, and modernizes learning rate implementations across TensorFlow, PyTorch, and dpmodel backends.

Changes

Cohort / File(s) Summary
Configuration & Build
.gitignore, .pre-commit-config.yaml, examples/water/.gitignore, source/CMakeLists.txt, source/lmp/builtin.cmake
Added ignore patterns (.claude, .spec-workflow, .serena, tfevents, torch traces); replaced GitHub URLs with gh-proxy mirrors in pre-commit config; enabled CMAKE_EXPORT_COMPILE_COMMANDS; added Torch as build requirement for LAMMPS.
Documentation - Project Guidance
AGENTS.md, CLAUDE.md, README.md
Renamed/restructured AGENTS.md into comprehensive CLAUDE.md with architecture, data flow, compression concepts, and development patterns; added CLAUDE.md file; appended usage example to README.md.
Documentation - Technical Guides
doc/outisli/DPA3.md, doc/outisli/compress.md, doc/outisli/install.md
Added extensive documentation for DPA3 PyTorch implementation (2630 lines), compression workflow (650 lines), and source installation guide (307 lines) covering Linux/macOS and GPU/CPU configurations.
Debug Utilities
debug/train_debug.py, debug/compress_debug.py, debug/dptest_debug.py, debug/inference_debug.py, debug/train_debug_gradient.py
Added five new debug scripts: train_debug.py (basic training), compress_debug.py (model compression), dptest_debug.py (model inference testing with timing), inference_debug.py (detailed inference profiling with PyTorch profiler), train_debug_gradient.py (NaN/Inf detection during training).
Learning Rate System - Core Refactor
deepmd/dpmodel/utils/learning_rate.py, deepmd/tf/utils/learning_rate.py
Introduced warmup support in BaseLR with _decay_value abstract method; LearningRateExp and LearningRateCosine now implement post-warmup decay logic; TensorFlow's LearningRateExp renamed to LearningRateSchedule with dict-based configuration and lazy backend instantiation.
Learning Rate System - Validation
deepmd/utils/argcheck.py
Refactored argument validation: added _check_lr_stop_args, _check_warmup_args, _learning_rate_common_args helpers; removed legacy warmup args from training_args; integrated mutual-exclusion checks for LR stop/warmup parameters.
Learning Rate System - Integration
deepmd/tf/train/trainer.py, deepmd/tf/utils/__init__.py
Updated trainer to use LearningRateSchedule instead of LearningRateExp; refactored get_lr_and_coef to accept dict-based parameters; changed public export from LearningRateExp to LearningRateSchedule.
Training Code Updates - Warmup Removal
deepmd/pd/train/training.py, deepmd/pt/train/training.py
Removed dynamic warmup configuration from both PD and PT trainers; stop_steps now equals total steps; simplified LR scheduling via lr_exp.value(step); replaced hardcoded "cpu" with env.DEVICE; removed inlined warmup_linear function.
Utility Type Expansions
deepmd/pd/utils/utils.py, deepmd/pt/utils/utils.py
Enhanced to_numpy_array to accept float/int/np.ndarray inputs in addition to torch.Tensor; returns 0-d arrays for scalars; updated type hints.
Type Annotations - Docstrings
deepmd/tf/fit/dipole.py, deepmd/tf/fit/dos.py, deepmd/tf/fit/ener.py, deepmd/tf/fit/fitting.py, deepmd/tf/fit/polar.py
Updated docstrings: changed lr parameter type from LearningRateExp to LearningRateSchedule in get_loss methods (no runtime changes).
Test Suite - Learning Rate
source/tests/consistent/test_learning_rate.py, source/tests/universal/dpmodel/utils/test_learning_rate.py, source/tests/tf/test_lr.py, source/tests/pd/test_lr.py, source/tests/pt/test_lr.py
Added warmup test support to consistent tests; added 240-line universal test suite for LR with/without warmup, array broadcasting, boundary conditions; new TensorFlow LearningRateSchedule validation tests (build, start_lr accessor, error handling); updated existing PD/PT test suites to use LearningRateSchedule dict-based config.
Test Suite - Model Integration
source/tests/pd/model/test_model.py, source/tests/pt/model/test_model.py
Updated _get_dp_lr to construct LearningRateSchedule with dict-based parameters instead of LearningRateExp direct instantiation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

learning-rate, refactoring, documentation, testing

Suggested reviewers

  • wanghan-iapcm
  • njzjz
✨ Finishing touches
  • 📝 Generate docstrings


📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a9667e and 573a7e3.

📒 Files selected for processing (37)
  • .gitignore
  • .pre-commit-config.yaml
  • AGENTS.md
  • CLAUDE.md
  • README.md
  • debug/compress_debug.py
  • debug/dptest_debug.py
  • debug/inference_debug.py
  • debug/train_debug.py
  • debug/train_debug_gradient.py
  • deepmd/dpmodel/utils/learning_rate.py
  • deepmd/pd/train/training.py
  • deepmd/pd/utils/utils.py
  • deepmd/pt/train/training.py
  • deepmd/pt/utils/utils.py
  • deepmd/tf/fit/dipole.py
  • deepmd/tf/fit/dos.py
  • deepmd/tf/fit/ener.py
  • deepmd/tf/fit/fitting.py
  • deepmd/tf/fit/polar.py
  • deepmd/tf/train/trainer.py
  • deepmd/tf/utils/__init__.py
  • deepmd/tf/utils/learning_rate.py
  • deepmd/utils/argcheck.py
  • doc/outisli/DPA3.md
  • doc/outisli/compress.md
  • doc/outisli/install.md
  • examples/water/.gitignore
  • source/CMakeLists.txt
  • source/lmp/builtin.cmake
  • source/tests/consistent/test_learning_rate.py
  • source/tests/pd/model/test_model.py
  • source/tests/pd/test_lr.py
  • source/tests/pt/model/test_model.py
  • source/tests/pt/test_lr.py
  • source/tests/tf/test_lr.py
  • source/tests/universal/dpmodel/utils/test_learning_rate.py

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors learning rate schedulers to use array API for backend-agnostic implementation, consolidating previously backend-specific logic into a unified dpmodel layer. The refactoring adds warmup support and flexible configuration options while maintaining JIT compatibility across all backends.

Changes:

  • Unified learning rate implementation using array_api_compat for backend independence
  • Added warmup support with warmup_steps, warmup_ratio, and warmup_start_factor parameters
  • Introduced stop_ratio as an alternative to stop_lr for flexible configuration
  • Moved warmup logic from training code to learning rate scheduler itself
  • Updated TensorFlow backend to wrap the unified implementation
  • Added comprehensive test coverage for all new features

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
deepmd/dpmodel/utils/learning_rate.py Unified learning rate scheduler with array API and warmup support
deepmd/tf/utils/learning_rate.py TensorFlow wrapper using tf.numpy_function
deepmd/utils/argcheck.py Added validation for mutual exclusion of stop_lr/stop_ratio and warmup parameters
deepmd/pt/train/training.py Removed warmup logic now handled by scheduler
deepmd/pd/train/training.py Removed warmup logic now handled by scheduler
source/tests/universal/dpmodel/utils/test_learning_rate.py Comprehensive unit tests for new features
source/tests/tf/test_lr.py TensorFlow wrapper tests
source/tests/pt/test_lr.py Updated PyTorch tests
source/tests/pd/test_lr.py Updated Paddle tests
source/tests/consistent/test_learning_rate.py Cross-backend consistency tests
Comments suppressed due to low confidence (20)

deepmd/utils/argcheck.py:2055

  • This comment appears to contain commented-out code.
# def fitting_global_polar():
#    return fitting_polar()

deepmd/tf/fit/polar.py:171

  • This comment appears to contain commented-out code.
        # if type(self.diag_shift) is not list:
        #    self.diag_shift = [self.diag_shift]

deepmd/tf/train/trainer.py:675

  • This comment appears to contain commented-out code.
    # def print_head (self) :  # depreciated
    #     if self.run_opt.is_chief:
    #         fp = open(self.disp_file, "a")
    #         print_str = "# %5s" % 'batch'
    #         print_str += self.loss.print_header()
    #         print_str += '   %8s\n' % 'lr'
    #         fp.write(print_str)
    #         fp.close ()

deepmd/pd/utils/utils.py:177

  • This comment appears to contain commented-out code.
        # if paddle.any(mask).item():
        #     tanh_part = paddle.tanh(self.slope * (x - self.threshold)) + self.const
        #     return paddle.where(x < self.threshold, silu_part, tanh_part)
        # else:
        #     return silu_part

deepmd/tf/fit/dos.py:452

  • Variable t_dfparam is not used.
            t_dfparam = tf.constant(self.numb_fparam, name="dfparam", dtype=tf.int32)

deepmd/tf/fit/dos.py:453

  • Variable t_daparam is not used.
            t_daparam = tf.constant(self.numb_aparam, name="daparam", dtype=tf.int32)

deepmd/tf/fit/dos.py:454

  • Variable t_numb_dos is not used.
            t_numb_dos = tf.constant(self.numb_dos, name="numb_dos", dtype=tf.int32)

deepmd/tf/fit/ener.py:549

  • Variable t_dfparam is not used.
            t_dfparam = tf.constant(self.numb_fparam, name="dfparam", dtype=tf.int32)

deepmd/tf/fit/ener.py:550

  • Variable t_daparam is not used.
            t_daparam = tf.constant(self.numb_aparam, name="daparam", dtype=tf.int32)

deepmd/tf/fit/polar.py:244

  • Variable mean_polar is not used.
            mean_polar = np.zeros([len(self.sel_type), 9])  # pylint: disable=no-explicit-dtype

source/tests/pd/model/test_model.py:420

  • Variable bdata is not used.
    source/tests/pt/model/test_model.py:420
  • Variable bdata is not used.
    source/tests/pt/model/test_model.py:419
  • Variable step is not used.
    source/tests/pd/model/test_model.py:419
  • Variable step is not used.
    deepmd/tf/train/trainer.py:461
  • Variable tb_valid_writer is not used.
            tb_valid_writer = tf.summary.FileWriter(self.tensorboard_log_dir + "/test")

deepmd/tf/train/trainer.py:464

  • Variable tb_valid_writer is not used.
            tb_valid_writer = None

deepmd/tf/train/trainer.py:490

  • Variable fitting_key is not used.
                fitting_key = next_fitting_key

deepmd/pt/train/training.py:1123

  • Variable module is not used.
                module = (

deepmd/tf/train/trainer.py:21

  • Import of 'deepmd' is not used.
import deepmd.tf.op  # noqa: F401

deepmd/pd/train/training.py:403

  • This statement is unreachable.
            self.model = paddle.jit.to_static(self.model)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


file(GLOB DEEPMD_LMP_SRC ${CMAKE_CURRENT_LIST_DIR}/*.cpp)

find_package(Torch REQUIRED)
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line appears to be unrelated to learning rate refactoring. It adds a required Torch dependency to LAMMPS which seems out of scope for this PR focused on learning rate schedulers. If this is intentional, it should be explained in the PR description or moved to a separate PR.

Suggested change
find_package(Torch REQUIRED)

Copilot uses AI. Check for mistakes.
project(DeePMD)

# generate compile_commands.json
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unrelated to the learning rate refactoring. This developer convenience setting should either be documented in the PR description or moved to a separate PR.

Suggested change
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
if(NOT DEFINED CMAKE_EXPORT_COMPILE_COMMANDS)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
endif()

Copilot uses AI. Check for mistakes.
# for training dirs
*.out
*.pb
*.hdf5
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unrelated to learning rate refactoring and should be moved to a separate PR or explained in the description.

Suggested change
*.hdf5

Copilot uses AI. Check for mistakes.
[3]: https://arxiv.org/abs/1805.09003
[4]: https://aip.scitation.org/doi/full/10.1063/1.5027645

Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('/home/outisli/Research/dpmd/deepmd_json_schema.json', 'w'), indent=2)"`
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains a hardcoded personal file path ('/home/outisli/Research/...') which should not be committed. Either remove this line or use a generic path placeholder like '$HOME/deepmd_json_schema.json'.

Suggested change
Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('/home/outisli/Research/dpmd/deepmd_json_schema.json', 'w'), indent=2)"`
Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('$HOME/deepmd_json_schema.json', 'w'), indent=2)"`

Copilot uses AI. Check for mistakes.
error_if_nonfinite=True,
)
with torch.device("cpu"):
with torch.device(env.DEVICE):
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change from torch.device('cpu') to torch.device(env.DEVICE) is unrelated to learning rate refactoring. While it may be a bug fix, it should be documented in the PR description or moved to a separate commit with proper explanation.

Suggested change
with torch.device(env.DEVICE):
with torch.device(DEVICE):

Copilot uses AI. Check for mistakes.
skip_neighbor_stat=skip_neighbor_stat,
use_pretrain_script=use_pretrain_script,
force_load=force_load,
compile_model=compile_model,
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keyword argument 'compile_model' is not a supported parameter name of function train.

Suggested change
compile_model=compile_model,

Copilot uses AI. Check for mistakes.
skip_neighbor_stat=True,
use_pretrain_script=False,
force_load=False,
compile_model=False,
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keyword argument 'compile_model' is not a supported parameter name of function train.

Suggested change
compile_model=False,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant