-
Notifications
You must be signed in to change notification settings - Fork 585
refactor: unify learning rate schedulers with array API #5153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
feat: add inference script
feat: add multiple runs for infer scripts feat: add pytorch profiler for infer debug ignore profiler files in gitognore
- Refactor BaseLR in dpmodel to use array_api_compat for backend-agnostic implementation - Consolidate learning rate logic from TF/PT/PD backends into unified dpmodel layer - Use array API operations (xp.where, xp.clip, etc.) for JIT compatibility across backends - Add warmup support (warmup_steps, warmup_ratio, warmup_start_factor) during refactoring - Add stop_ratio parameter as alternative to stop_lr for flexible configuration - Implement mutual exclusion validation for stop_lr/stop_ratio and warmup_steps/warmup_ratio - Update all backends to use unified BaseLR implementation - Add comprehensive consistency tests across NumPy/PyTorch/JAX/array_api_strict backends
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughThis PR introduces a comprehensive learning rate scheduler refactor with warmup support, adds debug utilities for model training/compression/inference/testing, expands documentation (DPA3, compression, installation), updates configuration files, and modernizes learning rate implementations across TensorFlow, PyTorch, and dpmodel backends. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing touches
📜 Recent review detailsConfiguration used: Repository UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (37)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request refactors learning rate schedulers to use array API for backend-agnostic implementation, consolidating previously backend-specific logic into a unified dpmodel layer. The refactoring adds warmup support and flexible configuration options while maintaining JIT compatibility across all backends.
Changes:
- Unified learning rate implementation using
array_api_compatfor backend independence - Added warmup support with
warmup_steps,warmup_ratio, andwarmup_start_factorparameters - Introduced
stop_ratioas an alternative tostop_lrfor flexible configuration - Moved warmup logic from training code to learning rate scheduler itself
- Updated TensorFlow backend to wrap the unified implementation
- Added comprehensive test coverage for all new features
Reviewed changes
Copilot reviewed 36 out of 37 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| deepmd/dpmodel/utils/learning_rate.py | Unified learning rate scheduler with array API and warmup support |
| deepmd/tf/utils/learning_rate.py | TensorFlow wrapper using tf.numpy_function |
| deepmd/utils/argcheck.py | Added validation for mutual exclusion of stop_lr/stop_ratio and warmup parameters |
| deepmd/pt/train/training.py | Removed warmup logic now handled by scheduler |
| deepmd/pd/train/training.py | Removed warmup logic now handled by scheduler |
| source/tests/universal/dpmodel/utils/test_learning_rate.py | Comprehensive unit tests for new features |
| source/tests/tf/test_lr.py | TensorFlow wrapper tests |
| source/tests/pt/test_lr.py | Updated PyTorch tests |
| source/tests/pd/test_lr.py | Updated Paddle tests |
| source/tests/consistent/test_learning_rate.py | Cross-backend consistency tests |
Comments suppressed due to low confidence (20)
deepmd/utils/argcheck.py:2055
- This comment appears to contain commented-out code.
# def fitting_global_polar():
# return fitting_polar()
deepmd/tf/fit/polar.py:171
- This comment appears to contain commented-out code.
# if type(self.diag_shift) is not list:
# self.diag_shift = [self.diag_shift]
deepmd/tf/train/trainer.py:675
- This comment appears to contain commented-out code.
# def print_head (self) : # depreciated
# if self.run_opt.is_chief:
# fp = open(self.disp_file, "a")
# print_str = "# %5s" % 'batch'
# print_str += self.loss.print_header()
# print_str += ' %8s\n' % 'lr'
# fp.write(print_str)
# fp.close ()
deepmd/pd/utils/utils.py:177
- This comment appears to contain commented-out code.
# if paddle.any(mask).item():
# tanh_part = paddle.tanh(self.slope * (x - self.threshold)) + self.const
# return paddle.where(x < self.threshold, silu_part, tanh_part)
# else:
# return silu_part
deepmd/tf/fit/dos.py:452
- Variable t_dfparam is not used.
t_dfparam = tf.constant(self.numb_fparam, name="dfparam", dtype=tf.int32)
deepmd/tf/fit/dos.py:453
- Variable t_daparam is not used.
t_daparam = tf.constant(self.numb_aparam, name="daparam", dtype=tf.int32)
deepmd/tf/fit/dos.py:454
- Variable t_numb_dos is not used.
t_numb_dos = tf.constant(self.numb_dos, name="numb_dos", dtype=tf.int32)
deepmd/tf/fit/ener.py:549
- Variable t_dfparam is not used.
t_dfparam = tf.constant(self.numb_fparam, name="dfparam", dtype=tf.int32)
deepmd/tf/fit/ener.py:550
- Variable t_daparam is not used.
t_daparam = tf.constant(self.numb_aparam, name="daparam", dtype=tf.int32)
deepmd/tf/fit/polar.py:244
- Variable mean_polar is not used.
mean_polar = np.zeros([len(self.sel_type), 9]) # pylint: disable=no-explicit-dtype
source/tests/pd/model/test_model.py:420
- Variable bdata is not used.
source/tests/pt/model/test_model.py:420 - Variable bdata is not used.
source/tests/pt/model/test_model.py:419 - Variable step is not used.
source/tests/pd/model/test_model.py:419 - Variable step is not used.
deepmd/tf/train/trainer.py:461 - Variable tb_valid_writer is not used.
tb_valid_writer = tf.summary.FileWriter(self.tensorboard_log_dir + "/test")
deepmd/tf/train/trainer.py:464
- Variable tb_valid_writer is not used.
tb_valid_writer = None
deepmd/tf/train/trainer.py:490
- Variable fitting_key is not used.
fitting_key = next_fitting_key
deepmd/pt/train/training.py:1123
- Variable module is not used.
module = (
deepmd/tf/train/trainer.py:21
- Import of 'deepmd' is not used.
import deepmd.tf.op # noqa: F401
deepmd/pd/train/training.py:403
- This statement is unreachable.
self.model = paddle.jit.to_static(self.model)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| file(GLOB DEEPMD_LMP_SRC ${CMAKE_CURRENT_LIST_DIR}/*.cpp) | ||
|
|
||
| find_package(Torch REQUIRED) |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line appears to be unrelated to learning rate refactoring. It adds a required Torch dependency to LAMMPS which seems out of scope for this PR focused on learning rate schedulers. If this is intentional, it should be explained in the PR description or moved to a separate PR.
| find_package(Torch REQUIRED) |
| project(DeePMD) | ||
|
|
||
| # generate compile_commands.json | ||
| set(CMAKE_EXPORT_COMPILE_COMMANDS ON) |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is unrelated to the learning rate refactoring. This developer convenience setting should either be documented in the PR description or moved to a separate PR.
| set(CMAKE_EXPORT_COMPILE_COMMANDS ON) | |
| if(NOT DEFINED CMAKE_EXPORT_COMPILE_COMMANDS) | |
| set(CMAKE_EXPORT_COMPILE_COMMANDS ON) | |
| endif() |
| # for training dirs | ||
| *.out | ||
| *.pb | ||
| *.hdf5 |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is unrelated to learning rate refactoring and should be moved to a separate PR or explained in the description.
| *.hdf5 |
| [3]: https://arxiv.org/abs/1805.09003 | ||
| [4]: https://aip.scitation.org/doi/full/10.1063/1.5027645 | ||
|
|
||
| Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('/home/outisli/Research/dpmd/deepmd_json_schema.json', 'w'), indent=2)"` |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains a hardcoded personal file path ('/home/outisli/Research/...') which should not be committed. Either remove this line or use a generic path placeholder like '$HOME/deepmd_json_schema.json'.
| Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('/home/outisli/Research/dpmd/deepmd_json_schema.json', 'w'), indent=2)"` | |
| Use this command to generate json schema: `python -c "from deepmd.utils.argcheck import gen_json_schema; import json; json.dump(json.loads(gen_json_schema(multi_task=True)), open('$HOME/deepmd_json_schema.json', 'w'), indent=2)"` |
| error_if_nonfinite=True, | ||
| ) | ||
| with torch.device("cpu"): | ||
| with torch.device(env.DEVICE): |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change from torch.device('cpu') to torch.device(env.DEVICE) is unrelated to learning rate refactoring. While it may be a bug fix, it should be documented in the PR description or moved to a separate commit with proper explanation.
| with torch.device(env.DEVICE): | |
| with torch.device(DEVICE): |
| skip_neighbor_stat=skip_neighbor_stat, | ||
| use_pretrain_script=use_pretrain_script, | ||
| force_load=force_load, | ||
| compile_model=compile_model, |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keyword argument 'compile_model' is not a supported parameter name of function train.
| compile_model=compile_model, |
| skip_neighbor_stat=True, | ||
| use_pretrain_script=False, | ||
| force_load=False, | ||
| compile_model=False, |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keyword argument 'compile_model' is not a supported parameter name of function train.
| compile_model=False, |
refactor: unify learning rate schedulers with array API
Summary by CodeRabbit
Release Notes
New Features
Documentation
Refactor
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.