refactor: unify learning rate schedulers with array API #5153

OutisLi · 2026-01-15T04:25:50Z

refactor: unify learning rate schedulers with array API

Refactor BaseLR in dpmodel to use array_api_compat for backend-agnostic implementation
Consolidate learning rate logic from TF/PT/PD backends into unified dpmodel layer
Use array API operations (xp.where, xp.clip, etc.) for JIT compatibility across backends
Add warmup support (warmup_steps, warmup_ratio, warmup_start_factor) during refactoring
Add stop_ratio parameter as alternative to stop_lr for flexible configuration
Implement mutual exclusion validation for stop_lr/stop_ratio and warmup_steps/warmup_ratio
Update all backends to use unified BaseLR implementation
Add comprehensive consistency tests across NumPy/PyTorch/JAX/array_api_strict backends

Summary by CodeRabbit

Release Notes

New Features
- Added debugging utilities for model compression, training, and inference profiling
- Enhanced learning rate schedulers with optional warmup phase support
Documentation
- Added comprehensive guides for DPA3 architecture and model compression
- Added installation guide for building from source
- Added development guidance documentation
Refactor
- Refactored learning rate configuration to use dict-based parameters
- Simplified training warmup handling through unified scheduler logic
- Updated pre-commit hooks to use mirror endpoints
Tests
- Added extensive learning rate scheduler test coverage including warmup scenarios
- Enhanced test validation for parameter interactions and edge cases
Chores
- Extended repository ignore patterns for development tools and build artifacts
- Updated docstring type hints for improved documentation accuracy

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…tal changes

feat: add inference script

feat: add multiple runs for infer scripts feat: add pytorch profiler for infer debug ignore profiler files in gitognore

- Refactor BaseLR in dpmodel to use array_api_compat for backend-agnostic implementation - Consolidate learning rate logic from TF/PT/PD backends into unified dpmodel layer - Use array API operations (xp.where, xp.clip, etc.) for JIT compatibility across backends - Add warmup support (warmup_steps, warmup_ratio, warmup_start_factor) during refactoring - Add stop_ratio parameter as alternative to stop_lr for flexible configuration - Implement mutual exclusion validation for stop_lr/stop_ratio and warmup_steps/warmup_ratio - Update all backends to use unified BaseLR implementation - Add comprehensive consistency tests across NumPy/PyTorch/JAX/array_api_strict backends

coderabbitai · 2026-01-15T04:28:52Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive learning rate scheduler refactor with warmup support, adds debug utilities for model training/compression/inference/testing, expands documentation (DPA3, compression, installation), updates configuration files, and modernizes learning rate implementations across TensorFlow, PyTorch, and dpmodel backends.

Changes

Cohort / File(s)	Summary
Configuration & Build `.gitignore`, `.pre-commit-config.yaml`, `examples/water/.gitignore`, `source/CMakeLists.txt`, `source/lmp/builtin.cmake`	Added ignore patterns (.claude, .spec-workflow, .serena, tfevents, torch traces); replaced GitHub URLs with gh-proxy mirrors in pre-commit config; enabled CMAKE_EXPORT_COMPILE_COMMANDS; added Torch as build requirement for LAMMPS.
Documentation - Project Guidance `AGENTS.md`, `CLAUDE.md`, `README.md`	Renamed/restructured AGENTS.md into comprehensive CLAUDE.md with architecture, data flow, compression concepts, and development patterns; added CLAUDE.md file; appended usage example to README.md.
Documentation - Technical Guides `doc/outisli/DPA3.md`, `doc/outisli/compress.md`, `doc/outisli/install.md`	Added extensive documentation for DPA3 PyTorch implementation (2630 lines), compression workflow (650 lines), and source installation guide (307 lines) covering Linux/macOS and GPU/CPU configurations.
Debug Utilities `debug/train_debug.py`, `debug/compress_debug.py`, `debug/dptest_debug.py`, `debug/inference_debug.py`, `debug/train_debug_gradient.py`	Added five new debug scripts: train_debug.py (basic training), compress_debug.py (model compression), dptest_debug.py (model inference testing with timing), inference_debug.py (detailed inference profiling with PyTorch profiler), train_debug_gradient.py (NaN/Inf detection during training).
Learning Rate System - Core Refactor `deepmd/dpmodel/utils/learning_rate.py`, `deepmd/tf/utils/learning_rate.py`	Introduced warmup support in BaseLR with _decay_value abstract method; LearningRateExp and LearningRateCosine now implement post-warmup decay logic; TensorFlow's LearningRateExp renamed to LearningRateSchedule with dict-based configuration and lazy backend instantiation.
Learning Rate System - Validation `deepmd/utils/argcheck.py`	Refactored argument validation: added _check_lr_stop_args, _check_warmup_args, _learning_rate_common_args helpers; removed legacy warmup args from training_args; integrated mutual-exclusion checks for LR stop/warmup parameters.
Learning Rate System - Integration `deepmd/tf/train/trainer.py`, `deepmd/tf/utils/__init__.py`	Updated trainer to use LearningRateSchedule instead of LearningRateExp; refactored get_lr_and_coef to accept dict-based parameters; changed public export from LearningRateExp to LearningRateSchedule.
Training Code Updates - Warmup Removal `deepmd/pd/train/training.py`, `deepmd/pt/train/training.py`	Removed dynamic warmup configuration from both PD and PT trainers; stop_steps now equals total steps; simplified LR scheduling via lr_exp.value(step); replaced hardcoded "cpu" with env.DEVICE; removed inlined warmup_linear function.
Utility Type Expansions `deepmd/pd/utils/utils.py`, `deepmd/pt/utils/utils.py`	Enhanced to_numpy_array to accept float/int/np.ndarray inputs in addition to torch.Tensor; returns 0-d arrays for scalars; updated type hints.
Type Annotations - Docstrings `deepmd/tf/fit/dipole.py`, `deepmd/tf/fit/dos.py`, `deepmd/tf/fit/ener.py`, `deepmd/tf/fit/fitting.py`, `deepmd/tf/fit/polar.py`	Updated docstrings: changed lr parameter type from LearningRateExp to LearningRateSchedule in get_loss methods (no runtime changes).
Test Suite - Learning Rate `source/tests/consistent/test_learning_rate.py`, `source/tests/universal/dpmodel/utils/test_learning_rate.py`, `source/tests/tf/test_lr.py`, `source/tests/pd/test_lr.py`, `source/tests/pt/test_lr.py`	Added warmup test support to consistent tests; added 240-line universal test suite for LR with/without warmup, array broadcasting, boundary conditions; new TensorFlow LearningRateSchedule validation tests (build, start_lr accessor, error handling); updated existing PD/PT test suites to use LearningRateSchedule dict-based config.
Test Suite - Model Integration `source/tests/pd/model/test_model.py`, `source/tests/pt/model/test_model.py`	Updated _get_dp_lr to construct LearningRateSchedule with dict-based parameters instead of LearningRateExp direct instantiation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat(pt/dp): add cosine LR & BaseLR #5142: Learning rate refactor touching BaseLR, decay API, cosine/exp implementations, and scheduler wiring in multiple trainers and test files.
chore(pd): sync get_lr from pt to pd #5144: Updates deepmd/pd/train/training.py's learning rate construction and warmup handling, coordinating with the LearningRateSchedule abstraction changes.
chore: move LearningRateExp to deepmd.utils.learning_rate #4219: Modifies LearningRate* implementations, imports, and usages across TensorFlow and test modules, aligning with the LearningRateExp → LearningRateSchedule migration.

Suggested labels

learning-rate, refactoring, documentation, testing

Suggested reviewers

wanghan-iapcm
njzjz

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a9667e and 573a7e3.

📒 Files selected for processing (37)

.gitignore
.pre-commit-config.yaml
AGENTS.md
CLAUDE.md
README.md
debug/compress_debug.py
debug/dptest_debug.py
debug/inference_debug.py
debug/train_debug.py
debug/train_debug_gradient.py
deepmd/dpmodel/utils/learning_rate.py
deepmd/pd/train/training.py
deepmd/pd/utils/utils.py
deepmd/pt/train/training.py
deepmd/pt/utils/utils.py
deepmd/tf/fit/dipole.py
deepmd/tf/fit/dos.py
deepmd/tf/fit/ener.py
deepmd/tf/fit/fitting.py
deepmd/tf/fit/polar.py
deepmd/tf/train/trainer.py
deepmd/tf/utils/__init__.py
deepmd/tf/utils/learning_rate.py
deepmd/utils/argcheck.py
doc/outisli/DPA3.md
doc/outisli/compress.md
doc/outisli/install.md
examples/water/.gitignore
source/CMakeLists.txt
source/lmp/builtin.cmake
source/tests/consistent/test_learning_rate.py
source/tests/pd/model/test_model.py
source/tests/pd/test_lr.py
source/tests/pt/model/test_model.py
source/tests/pt/test_lr.py
source/tests/tf/test_lr.py
source/tests/universal/dpmodel/utils/test_learning_rate.py

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This pull request refactors learning rate schedulers to use array API for backend-agnostic implementation, consolidating previously backend-specific logic into a unified dpmodel layer. The refactoring adds warmup support and flexible configuration options while maintaining JIT compatibility across all backends.

Changes:

Unified learning rate implementation using array_api_compat for backend independence
Added warmup support with warmup_steps, warmup_ratio, and warmup_start_factor parameters
Introduced stop_ratio as an alternative to stop_lr for flexible configuration
Moved warmup logic from training code to learning rate scheduler itself
Updated TensorFlow backend to wrap the unified implementation
Added comprehensive test coverage for all new features

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
deepmd/dpmodel/utils/learning_rate.py	Unified learning rate scheduler with array API and warmup support
deepmd/tf/utils/learning_rate.py	TensorFlow wrapper using tf.numpy_function
deepmd/utils/argcheck.py	Added validation for mutual exclusion of stop_lr/stop_ratio and warmup parameters
deepmd/pt/train/training.py	Removed warmup logic now handled by scheduler
deepmd/pd/train/training.py	Removed warmup logic now handled by scheduler
source/tests/universal/dpmodel/utils/test_learning_rate.py	Comprehensive unit tests for new features
source/tests/tf/test_lr.py	TensorFlow wrapper tests
source/tests/pt/test_lr.py	Updated PyTorch tests
source/tests/pd/test_lr.py	Updated Paddle tests
source/tests/consistent/test_learning_rate.py	Cross-backend consistency tests

Comments suppressed due to low confidence (20)

deepmd/utils/argcheck.py:2055

This comment appears to contain commented-out code.

# def fitting_global_polar():
#    return fitting_polar()

deepmd/tf/fit/polar.py:171

This comment appears to contain commented-out code.

        # if type(self.diag_shift) is not list:
        #    self.diag_shift = [self.diag_shift]

deepmd/tf/train/trainer.py:675

This comment appears to contain commented-out code.

    # def print_head (self) :  # depreciated
    #     if self.run_opt.is_chief:
    #         fp = open(self.disp_file, "a")
    #         print_str = "# %5s" % 'batch'
    #         print_str += self.loss.print_header()
    #         print_str += '   %8s\n' % 'lr'
    #         fp.write(print_str)
    #         fp.close ()

deepmd/pd/utils/utils.py:177

This comment appears to contain commented-out code.

        # if paddle.any(mask).item():
        #     tanh_part = paddle.tanh(self.slope * (x - self.threshold)) + self.const
        #     return paddle.where(x < self.threshold, silu_part, tanh_part)
        # else:
        #     return silu_part

deepmd/tf/fit/dos.py:452

Variable t_dfparam is not used.

            t_dfparam = tf.constant(self.numb_fparam, name="dfparam", dtype=tf.int32)

deepmd/tf/fit/dos.py:453

Variable t_daparam is not used.

            t_daparam = tf.constant(self.numb_aparam, name="daparam", dtype=tf.int32)

deepmd/tf/fit/dos.py:454

Variable t_numb_dos is not used.

            t_numb_dos = tf.constant(self.numb_dos, name="numb_dos", dtype=tf.int32)

deepmd/tf/fit/ener.py:549

Variable t_dfparam is not used.

            t_dfparam = tf.constant(self.numb_fparam, name="dfparam", dtype=tf.int32)

deepmd/tf/fit/ener.py:550

Variable t_daparam is not used.

            t_daparam = tf.constant(self.numb_aparam, name="daparam", dtype=tf.int32)

deepmd/tf/fit/polar.py:244

Variable mean_polar is not used.

            mean_polar = np.zeros([len(self.sel_type), 9])  # pylint: disable=no-explicit-dtype

source/tests/pd/model/test_model.py:420

Variable bdata is not used.
source/tests/pt/model/test_model.py:420
Variable bdata is not used.
source/tests/pt/model/test_model.py:419
Variable step is not used.
source/tests/pd/model/test_model.py:419
Variable step is not used.
deepmd/tf/train/trainer.py:461
Variable tb_valid_writer is not used.

            tb_valid_writer = tf.summary.FileWriter(self.tensorboard_log_dir + "/test")

deepmd/tf/train/trainer.py:464

Variable tb_valid_writer is not used.

            tb_valid_writer = None

deepmd/tf/train/trainer.py:490

Variable fitting_key is not used.

                fitting_key = next_fitting_key

deepmd/pt/train/training.py:1123

Variable module is not used.

                module = (

deepmd/tf/train/trainer.py:21

Import of 'deepmd' is not used.

import deepmd.tf.op  # noqa: F401

deepmd/pd/train/training.py:403

This statement is unreachable.

            self.model = paddle.jit.to_static(self.model)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T04:34:07Z

source/lmp/builtin.cmake


 file(GLOB DEEPMD_LMP_SRC ${CMAKE_CURRENT_LIST_DIR}/*.cpp)

+find_package(Torch REQUIRED)


This line appears to be unrelated to learning rate refactoring. It adds a required Torch dependency to LAMMPS which seems out of scope for this PR focused on learning rate schedulers. If this is intentional, it should be explained in the PR description or moved to a separate PR.

Suggested change

find_package(Torch REQUIRED)

Copilot · 2026-01-15T04:34:07Z

source/CMakeLists.txt

 project(DeePMD)

+# generate compile_commands.json
+set(CMAKE_EXPORT_COMPILE_COMMANDS ON)


This change is unrelated to the learning rate refactoring. This developer convenience setting should either be documented in the PR description or moved to a separate PR.

Suggested change

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

if(NOT DEFINED CMAKE_EXPORT_COMPILE_COMMANDS)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

endif()

Copilot · 2026-01-15T04:34:08Z

examples/water/.gitignore

 # for training dirs
 *.out
 *.pb
+*.hdf5


This change is unrelated to learning rate refactoring and should be moved to a separate PR or explained in the description.

Suggested change

*.hdf5

Copilot · 2026-01-15T04:34:08Z

README.md

 [3]: https://arxiv.org/abs/1805.09003
 [4]: https://aip.scitation.org/doi/full/10.1063/1.5027645
+
+Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('/home/outisli/Research/dpmd/deepmd_json_schema.json', 'w'), indent=2)"`


This line contains a hardcoded personal file path ('/home/outisli/Research/...') which should not be committed. Either remove this line or use a generic path placeholder like '$HOME/deepmd_json_schema.json'.

Suggested change

Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('/home/outisli/Research/dpmd/deepmd_json_schema.json', 'w'), indent=2)"`

Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('$HOME/deepmd_json_schema.json', 'w'), indent=2)"`

Copilot · 2026-01-15T04:34:08Z

deepmd/pt/train/training.py

                        error_if_nonfinite=True,
                    )
-                with torch.device("cpu"):
+                with torch.device(env.DEVICE):


This change from torch.device('cpu') to torch.device(env.DEVICE) is unrelated to learning rate refactoring. While it may be a bug fix, it should be documented in the PR description or moved to a separate commit with proper explanation.

Suggested change

with torch.device(env.DEVICE):

with torch.device(DEVICE):

Copilot · 2026-01-15T04:34:08Z

debug/train_debug.py

+            skip_neighbor_stat=skip_neighbor_stat,
+            use_pretrain_script=use_pretrain_script,
+            force_load=force_load,
+            compile_model=compile_model,


Keyword argument 'compile_model' is not a supported parameter name of function train.

Suggested change

compile_model=compile_model,

Copilot · 2026-01-15T04:34:09Z

debug/train_debug_gradient.py

+            skip_neighbor_stat=True,
+            use_pretrain_script=False,
+            force_load=False,
+            compile_model=False,


Keyword argument 'compile_model' is not a supported parameter name of function train.

Suggested change

compile_model=False,

OutisLi added 11 commits January 14, 2026 17:26

delete original outisli branch for format problems but picked some vi…

c95f731

…tal changes

add new folder outisli in doc

20e27f6

add description for compression

d9c4b90

update CLAUDE.md

c787b3e

feat: add debug scripts for model inference and training

6854f5f

feat: add inference script

add compress doc

281fd23

refactor: pre-commit

e81e41c

feat: enable generation of compile_commands.json for cmake

429f638

fix: use monotonic time for debug scripts

5da2747

feat: add multiple runs for infer scripts feat: add pytorch profiler for infer debug ignore profiler files in gitognore

add grad debug script

0622935

Copilot AI review requested due to automatic review settings January 15, 2026 04:25

github-actions bot added Python LAMMPS Docs Examples labels Jan 15, 2026

OutisLi closed this Jan 15, 2026

Copilot started reviewing on behalf of OutisLi January 15, 2026 04:26 View session

dosubot bot added the enhancement label Jan 15, 2026

OutisLi deleted the self/lr branch January 15, 2026 04:27

Copilot AI reviewed Jan 15, 2026

View reviewed changes


		file(GLOB DEEPMD_LMP_SRC ${CMAKE_CURRENT_LIST_DIR}/*.cpp)

		find_package(Torch REQUIRED)

	Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('/home/outisli/Research/dpmd/deepmd_json_schema.json', 'w'), indent=2)"`
	Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('$HOME/deepmd_json_schema.json', 'w'), indent=2)"`

refactor: unify learning rate schedulers with array API #5153

refactor: unify learning rate schedulers with array API #5153

Uh oh!

Conversation

OutisLi commented Jan 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 15, 2026

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

OutisLi commented Jan 15, 2026 •

edited by coderabbitai bot

Loading