docs: update learning rate documentation for warmup and stop_lr_rate

OutisLi · OutisLi · commit 189ed4ceac16 · 2026-01-15T13:41:23.000+08:00
- Add comprehensive documentation for warmup parameters (warmup_steps, warmup_ratio, warmup_start_factor)
- Document stop_lr_rate as an alternative to stop_lr for ratio-based specification
- Add complete documentation for cosine annealing scheduler
- Update theory section with separate warmup and decay phase formulas
- Provide 7 configuration examples covering both exp and cosine types with warmup variants
diff --git a/doc/train/training-advanced.md b/doc/train/training-advanced.md
@@ -6,44 +6,215 @@ In this section, we will take `$deepmd_source_dir/examples/water/se_e2_a/input.j
 
 ### Theory
 
-The learning rate $\gamma$ decays exponentially:
+The learning rate schedule consists of two phases: an optional warmup phase followed by a decay phase.
+
+#### Warmup phase (optional)
+
+During the warmup phase (steps $0 \leq \tau < \tau^{\text{warmup}}$), the learning rate increases linearly from an initial warmup learning rate to the target starting learning rate:
 
 ```math
-    \gamma(\tau) = \gamma^0 r ^ {\lfloor  \tau/s \rfloor},
+    \gamma(\tau) = \gamma^{\text{warmup}} + \frac{\gamma^0 - \gamma^{\text{warmup}}}{\tau^{\text{warmup}}} \tau,
 ```
 
-where $\tau \in \mathbb{N}$ is the index of the training step, $\gamma^0  \in \mathbb{R}$ is the learning rate at the first step, and the decay rate $r$ is given by
+where $\gamma^{\text{warmup}} = f^{\text{warmup}} \cdot \gamma^0$ is the initial warmup learning rate, $f^{\text{warmup}} \in [0, 1]$ is the warmup start factor (default 0.0), and $\tau^{\text{warmup}} \in \mathbb{N}$ is the number of warmup steps.
+
+#### Decay phase
+
+After the warmup phase (steps $\tau \geq \tau^{\text{warmup}}$), the learning rate decays according to the selected schedule type.
+
+**Exponential decay (`type: "exp"`):**
+
+The learning rate decays exponentially:
 
 ```math
-    r = {\left(\frac{\gamma^{\text{stop}}}{\gamma^0}\right )} ^{\frac{s}{\tau^{\text{stop}}}},
+    \gamma(\tau) = \gamma^0 r ^ {\lfloor  (\tau - \tau^{\text{warmup}})/s \rfloor},
 ```
 
-where $\tau^{\text{stop}} \in \mathbb{N}$, $\gamma^{\text{stop}} \in \mathbb{R}$, and $s \in \mathbb{N}$ are the stopping step, the stopping learning rate, and the decay steps, respectively, all of which are hyperparameters provided in advance.
+where $\tau \in \mathbb{N}$ is the index of the training step, $\gamma^0  \in \mathbb{R}$ is the learning rate at the start of the decay phase (i.e., after warmup), and the decay rate $r$ is given by
+
+```math
+    r = {\left(\frac{\gamma^{\text{stop}}}{\gamma^0}\right )} ^{\frac{s}{\tau^{\text{decay}}}},
+```
+
+where $\tau^{\text{decay}} = \tau^{\text{stop}} - \tau^{\text{warmup}}$ is the number of decay steps, $\tau^{\text{stop}} \in \mathbb{N}$ is the total training steps, $\gamma^{\text{stop}} \in \mathbb{R}$ is the stopping learning rate, and $s \in \mathbb{N}$ is the decay steps.
+
+**Cosine annealing (`type: "cosine"`):**
+
+The learning rate follows a cosine annealing schedule:
+
+```math
+    \gamma(\tau) = \gamma^{\text{stop}} + \frac{\gamma^0 - \gamma^{\text{stop}}}{2} \left(1 + \cos\left(\frac{\pi (\tau - \tau^{\text{warmup}})}{\tau^{\text{decay}}}\right)\right),
+```
+
+where the learning rate smoothly decreases from $\gamma^0$ to $\gamma^{\text{stop}}$ following a cosine curve over the decay phase.
+
+For both schedule types, the stopping learning rate can be specified directly as $\gamma^{\text{stop}}$ or as a ratio: $\gamma^{\text{stop}} = \rho^{\text{stop}} \cdot \gamma^0$, where $\rho^{\text{stop}} \in (0, 1]$ is the stopping learning rate ratio.
 [^1]
 
 [^1]: This section is built upon Jinzhe Zeng, Duo Zhang, Denghui Lu, Pinghui Mo, Zeyu Li, Yixiao Chen, Marián Rynik, Li'ang Huang, Ziyao Li, Shaochen Shi, Yingze Wang, Haotian Ye, Ping Tuo, Jiabin Yang, Ye Ding, Yifan Li, Davide Tisi, Qiyu Zeng, Han Bao, Yu Xia, Jiameng Huang, Koki Muraoka, Yibo Wang, Junhan Chang, Fengbo Yuan, Sigbjørn Løland Bore, Chun Cai, Yinnian Lin, Bo Wang, Jiayan Xu, Jia-Xin Zhu, Chenxing Luo, Yuzhi Zhang, Rhys E. A. Goodall, Wenshuo Liang, Anurag Kumar Singh, Sikai Yao, Jingchao Zhang, Renata Wentzcovitch, Jiequn Han, Jie Liu, Weile Jia, Darrin M. York, Weinan E, Roberto Car, Linfeng Zhang, Han Wang, [J. Chem. Phys. 159, 054801 (2023)](https://doi.org/10.1063/5.0155600) licensed under a [Creative Commons Attribution (CC BY) license](http://creativecommons.org/licenses/by/4.0/).
 
 ### Instructions
 
-The {ref}`learning_rate <learning_rate>` section in `input.json` is given as follows
+DeePMD-kit supports two types of learning rate schedules: exponential decay (`type: "exp"`) and cosine annealing (`type: "cosine"`). Both types support optional warmup and can use either absolute stopping learning rate or a ratio-based specification.
+
+#### Exponential decay schedule
+
+The {ref}`learning_rate <learning_rate>` section for exponential decay in `input.json` is given as follows
 
 ```json
     "learning_rate" :{
 	"type":		"exp",
 	"start_lr":	0.001,
-	"stop_lr":	3.51e-8,
+	"stop_lr":	1e-6,
 	"decay_steps":	5000,
 	"_comment":	"that's all"
     }
 ```
 
-- {ref}`start_lr <learning_rate[exp]/start_lr>` gives the learning rate at the beginning of the training.
-- {ref}`stop_lr <learning_rate[exp]/stop_lr>` gives the learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge.
-- During the training, the learning rate decays exponentially from {ref}`start_lr <learning_rate[exp]/start_lr>` to {ref}`stop_lr <learning_rate[exp]/stop_lr>` following the formula:
+#### Basic parameters
+
+**Common parameters for both `exp` and `cosine` types:**
+
+- {ref}`start_lr <learning_rate[exp]/start_lr>` gives the learning rate at the start of the decay phase (i.e., after warmup if enabled). It should be set appropriately based on the model architecture and dataset.
+- {ref}`stop_lr <learning_rate[exp]/stop_lr>` gives the target learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge. This parameter is mutually exclusive with {ref}`stop_lr_rate <learning_rate[exp]/stop_lr_rate>`.
+- {ref}`stop_lr_rate <learning_rate[exp]/stop_lr_rate>` (optional) specifies the stopping learning rate as a ratio of {ref}`start_lr <learning_rate[exp]/start_lr>`. For example, `stop_lr_rate: 1e-3` means `stop_lr = start_lr * 1e-3`. This parameter is mutually exclusive with {ref}`stop_lr <learning_rate[exp]/stop_lr>`. Either {ref}`stop_lr <learning_rate[exp]/stop_lr>` or {ref}`stop_lr_rate <learning_rate[exp]/stop_lr_rate>` must be provided.
+
+**Additional parameter for `exp` type only:**
+
+- {ref}`decay_steps <learning_rate[exp]/decay_steps>` specifies the interval (in training steps) at which the learning rate is decayed. The learning rate is updated every {ref}`decay_steps <learning_rate[exp]/decay_steps>` steps during the decay phase.
+
+**Learning rate formula for `exp` type:**
+
+During the decay phase, the learning rate decays exponentially from {ref}`start_lr <learning_rate[exp]/start_lr>` to {ref}`stop_lr <learning_rate[exp]/stop_lr>` following the formula:
+
+```
+lr(t) = start_lr * decay_rate ^ ( (t - warmup_steps) / decay_steps )
+```
+
+where `t` is the current training step and `warmup_steps` is the number of warmup steps (0 if warmup is not enabled).
+
+**Learning rate formula for `cosine` type:**
+
+For cosine annealing, the learning rate smoothly decreases following a cosine curve:
+
+```
+lr(t) = stop_lr + (start_lr - stop_lr) / 2 * (1 + cos(pi * (t - warmup_steps) / decay_steps))
+```
+
+where `decay_steps = numb_steps - warmup_steps` is the number of steps in the decay phase.
+
+#### Warmup parameters (optional)
+
+Warmup is a technique to stabilize training in the early stages by gradually increasing the learning rate from a small initial value to the target {ref}`start_lr <learning_rate[exp]/start_lr>`. The warmup parameters are optional and can be configured as follows:
+
+- {ref}`warmup_steps <learning_rate[exp]/warmup_steps>` (optional, default: 0) specifies the number of steps for learning rate warmup. During warmup, the learning rate increases linearly from `warmup_start_factor * start_lr` to {ref}`start_lr <learning_rate[exp]/start_lr>`. This parameter is mutually exclusive with {ref}`warmup_ratio <learning_rate[exp]/warmup_ratio>`.
+- {ref}`warmup_ratio <learning_rate[exp]/warmup_ratio>` (optional) specifies the warmup duration as a ratio of the total training steps. For example, `warmup_ratio: 0.1` means the warmup phase will last for 10% of the total training steps. The actual number of warmup steps is computed as `int(warmup_ratio * numb_steps)`. This parameter is mutually exclusive with {ref}`warmup_steps <learning_rate[exp]/warmup_steps>`.
+- {ref}`warmup_start_factor <learning_rate[exp]/warmup_start_factor>` (optional, default: 0.0) specifies the factor for the initial warmup learning rate. The warmup learning rate starts from `warmup_start_factor * start_lr` and increases linearly to {ref}`start_lr <learning_rate[exp]/start_lr>`. A value of 0.0 means the learning rate starts from zero.
+
+#### Configuration examples
+
+**Example 1: Basic exponential decay without warmup**
+
+```json
+    "learning_rate": {
+	"type":		"exp",
+	"start_lr":	0.001,
+	"stop_lr":	1e-6,
+	"decay_steps":	5000
+    }
+```
+
+**Example 2: Using stop_lr_rate instead of stop_lr**
+
+```json
+    "learning_rate": {
+	"type":		"exp",
+	"start_lr":	0.001,
+	"stop_lr_rate":	1e-3,
+	"decay_steps":	5000
+    }
+```
+
+This is equivalent to setting `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`).
+
+**Example 3: Exponential decay with warmup (using warmup_steps)**
+
+```json
+    "learning_rate": {
+	"type":		"exp",
+	"start_lr":	0.001,
+	"stop_lr":	1e-6,
+	"decay_steps":	5000,
+	"warmup_steps":	10000,
+	"warmup_start_factor": 0.1
+    }
+```
+
+In this example, the learning rate starts from `0.0001` (i.e., `0.1 * 0.001`) and increases linearly to `0.001` over the first 10,000 steps. After that, it decays exponentially to `1e-6`.
+
+**Example 4: Exponential decay with warmup (using warmup_ratio)**
+
+```json
+    "learning_rate": {
+	"type":		"exp",
+	"start_lr":	0.001,
+	"stop_lr_rate":	1e-3,
+	"decay_steps":	5000,
+	"warmup_ratio":	0.05
+    }
+```
+
+In this example, if the total training steps (`numb_steps`) is 1,000,000, the warmup phase will last for 50,000 steps (i.e., `0.05 * 1,000,000`). The learning rate starts from `0.0` (default `warmup_start_factor: 0.0`) and increases linearly to `0.001` over the first 50,000 steps, then decays exponentially.
+
+#### Cosine annealing schedule
+
+The {ref}`learning_rate <learning_rate>` section for cosine annealing in `input.json` is given as follows
+
+```json
+    "learning_rate": {
+	"type":		"cosine",
+	"start_lr":	0.001,
+	"stop_lr":	1e-6
+    }
+```
+
+Cosine annealing provides a smooth decay curve that often works well for training neural networks. Unlike exponential decay, it does not require the `decay_steps` parameter.
+
+**Example 5: Basic cosine annealing without warmup**
+
+```json
+    "learning_rate": {
+	"type":		"cosine",
+	"start_lr":	0.001,
+	"stop_lr":	1e-6
+    }
+```
+
+**Example 6: Cosine annealing with stop_lr_rate**
+
+```json
+    "learning_rate": {
+	"type":		"cosine",
+	"start_lr":	0.001,
+	"stop_lr_rate":	1e-3
+    }
+```
+
+This is equivalent to setting `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`).
+
+**Example 7: Cosine annealing with warmup**
+
+```json
+    "learning_rate": {
+	"type":		"cosine",
+	"start_lr":	0.001,
+	"stop_lr":	1e-6,
+	"warmup_steps":	5000,
+	"warmup_start_factor": 0.0
+    }
+```
 
-  ```
-  lr(t) = start_lr * decay_rate ^ ( t / decay_steps )
-  ```
+In this example, the learning rate starts from `0.0` and increases linearly to `0.001` over the first 5,000 steps, then follows a cosine annealing curve down to `1e-6`.
 
 ## Training parameters