You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: update learning rate documentation for warmup and stop_lr_rate
- Add comprehensive documentation for warmup parameters (warmup_steps, warmup_ratio, warmup_start_factor)
- Document stop_lr_rate as an alternative to stop_lr for ratio-based specification
- Add complete documentation for cosine annealing scheduler
- Update theory section with separate warmup and decay phase formulas
- Provide 7 configuration examples covering both exp and cosine types with warmup variants
@@ -6,44 +6,215 @@ In this section, we will take `$deepmd_source_dir/examples/water/se_e2_a/input.j
6
6
7
7
### Theory
8
8
9
-
The learning rate $\gamma$ decays exponentially:
9
+
The learning rate schedule consists of two phases: an optional warmup phase followed by a decay phase.
10
+
11
+
#### Warmup phase (optional)
12
+
13
+
During the warmup phase (steps $0 \leq \tau < \tau^{\text{warmup}}$), the learning rate increases linearly from an initial warmup learning rate to the target starting learning rate:
10
14
11
15
```math
12
-
\gamma(\tau) = \gamma^0 r ^ {\lfloor \tau/s \rfloor},
where $\tau \in \mathbb{N}$ is the index of the training step, $\gamma^0 \in \mathbb{R}$ is the learning rate at the first step, and the decay rate $r$ is given by
19
+
where $\gamma^{\text{warmup}} = f^{\text{warmup}} \cdot \gamma^0$ is the initial warmup learning rate, $f^{\text{warmup}} \in [0, 1]$ is the warmup start factor (default 0.0), and $\tau^{\text{warmup}} \in \mathbb{N}$ is the number of warmup steps.
20
+
21
+
#### Decay phase
22
+
23
+
After the warmup phase (steps $\tau \geq \tau^{\text{warmup}}$), the learning rate decays according to the selected schedule type.
24
+
25
+
**Exponential decay (`type: "exp"`):**
26
+
27
+
The learning rate decays exponentially:
16
28
17
29
```math
18
-
r = {\left(\frac{\gamma^{\text{stop}}}{\gamma^0}\right )} ^{\frac{s}{\tau^{\text{stop}}}},
30
+
\gamma(\tau) = \gamma^0 r ^ {\lfloor (\tau - \tau^{\text{warmup}})/s \rfloor},
19
31
```
20
32
21
-
where $\tau^{\text{stop}} \in \mathbb{N}$, $\gamma^{\text{stop}} \in \mathbb{R}$, and $s \in \mathbb{N}$ are the stopping step, the stopping learning rate, and the decay steps, respectively, all of which are hyperparameters provided in advance.
33
+
where $\tau \in \mathbb{N}$ is the index of the training step, $\gamma^0 \in \mathbb{R}$ is the learning rate at the start of the decay phase (i.e., after warmup), and the decay rate $r$ is given by
34
+
35
+
```math
36
+
r = {\left(\frac{\gamma^{\text{stop}}}{\gamma^0}\right )} ^{\frac{s}{\tau^{\text{decay}}}},
37
+
```
38
+
39
+
where $\tau^{\text{decay}} = \tau^{\text{stop}} - \tau^{\text{warmup}}$ is the number of decay steps, $\tau^{\text{stop}} \in \mathbb{N}$ is the total training steps, $\gamma^{\text{stop}} \in \mathbb{R}$ is the stopping learning rate, and $s \in \mathbb{N}$ is the decay steps.
40
+
41
+
**Cosine annealing (`type: "cosine"`):**
42
+
43
+
The learning rate follows a cosine annealing schedule:
where the learning rate smoothly decreases from $\gamma^0$ to $\gamma^{\text{stop}}$ following a cosine curve over the decay phase.
50
+
51
+
For both schedule types, the stopping learning rate can be specified directly as $\gamma^{\text{stop}}$ or as a ratio: $\gamma^{\text{stop}} = \rho^{\text{stop}} \cdot \gamma^0$, where $\rho^{\text{stop}} \in (0, 1]$ is the stopping learning rate ratio.
22
52
[^1]
23
53
24
54
[^1]: This section is built upon Jinzhe Zeng, Duo Zhang, Denghui Lu, Pinghui Mo, Zeyu Li, Yixiao Chen, Marián Rynik, Li'ang Huang, Ziyao Li, Shaochen Shi, Yingze Wang, Haotian Ye, Ping Tuo, Jiabin Yang, Ye Ding, Yifan Li, Davide Tisi, Qiyu Zeng, Han Bao, Yu Xia, Jiameng Huang, Koki Muraoka, Yibo Wang, Junhan Chang, Fengbo Yuan, Sigbjørn Løland Bore, Chun Cai, Yinnian Lin, Bo Wang, Jiayan Xu, Jia-Xin Zhu, Chenxing Luo, Yuzhi Zhang, Rhys E. A. Goodall, Wenshuo Liang, Anurag Kumar Singh, Sikai Yao, Jingchao Zhang, Renata Wentzcovitch, Jiequn Han, Jie Liu, Weile Jia, Darrin M. York, Weinan E, Roberto Car, Linfeng Zhang, Han Wang, [J. Chem. Phys. 159, 054801 (2023)](https://doi.org/10.1063/5.0155600) licensed under a [Creative Commons Attribution (CC BY) license](http://creativecommons.org/licenses/by/4.0/).
25
55
26
56
### Instructions
27
57
28
-
The {ref}`learning_rate <learning_rate>` section in `input.json` is given as follows
58
+
DeePMD-kit supports two types of learning rate schedules: exponential decay (`type: "exp"`) and cosine annealing (`type: "cosine"`). Both types support optional warmup and can use either absolute stopping learning rate or a ratio-based specification.
59
+
60
+
#### Exponential decay schedule
61
+
62
+
The {ref}`learning_rate <learning_rate>` section for exponential decay in `input.json` is given as follows
29
63
30
64
```json
31
65
"learning_rate" :{
32
66
"type": "exp",
33
67
"start_lr": 0.001,
34
-
"stop_lr": 3.51e-8,
68
+
"stop_lr": 1e-6,
35
69
"decay_steps": 5000,
36
70
"_comment": "that's all"
37
71
}
38
72
```
39
73
40
-
- {ref}`start_lr <learning_rate[exp]/start_lr>` gives the learning rate at the beginning of the training.
41
-
- {ref}`stop_lr <learning_rate[exp]/stop_lr>` gives the learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge.
42
-
- During the training, the learning rate decays exponentially from {ref}`start_lr <learning_rate[exp]/start_lr>` to {ref}`stop_lr <learning_rate[exp]/stop_lr>` following the formula:
74
+
#### Basic parameters
75
+
76
+
**Common parameters for both `exp` and `cosine` types:**
77
+
78
+
- {ref}`start_lr <learning_rate[exp]/start_lr>` gives the learning rate at the start of the decay phase (i.e., after warmup if enabled). It should be set appropriately based on the model architecture and dataset.
79
+
- {ref}`stop_lr <learning_rate[exp]/stop_lr>` gives the target learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge. This parameter is mutually exclusive with {ref}`stop_lr_rate <learning_rate[exp]/stop_lr_rate>`.
80
+
- {ref}`stop_lr_rate <learning_rate[exp]/stop_lr_rate>` (optional) specifies the stopping learning rate as a ratio of {ref}`start_lr <learning_rate[exp]/start_lr>`. For example, `stop_lr_rate: 1e-3` means `stop_lr = start_lr * 1e-3`. This parameter is mutually exclusive with {ref}`stop_lr <learning_rate[exp]/stop_lr>`. Either {ref}`stop_lr <learning_rate[exp]/stop_lr>` or {ref}`stop_lr_rate <learning_rate[exp]/stop_lr_rate>` must be provided.
81
+
82
+
**Additional parameter for `exp` type only:**
83
+
84
+
- {ref}`decay_steps <learning_rate[exp]/decay_steps>` specifies the interval (in training steps) at which the learning rate is decayed. The learning rate is updated every {ref}`decay_steps <learning_rate[exp]/decay_steps>` steps during the decay phase.
85
+
86
+
**Learning rate formula for `exp` type:**
87
+
88
+
During the decay phase, the learning rate decays exponentially from {ref}`start_lr <learning_rate[exp]/start_lr>` to {ref}`stop_lr <learning_rate[exp]/stop_lr>` following the formula:
where `decay_steps = numb_steps - warmup_steps` is the number of steps in the decay phase.
105
+
106
+
#### Warmup parameters (optional)
107
+
108
+
Warmup is a technique to stabilize training in the early stages by gradually increasing the learning rate from a small initial value to the target {ref}`start_lr <learning_rate[exp]/start_lr>`. The warmup parameters are optional and can be configured as follows:
109
+
110
+
- {ref}`warmup_steps <learning_rate[exp]/warmup_steps>` (optional, default: 0) specifies the number of steps for learning rate warmup. During warmup, the learning rate increases linearly from `warmup_start_factor * start_lr` to {ref}`start_lr <learning_rate[exp]/start_lr>`. This parameter is mutually exclusive with {ref}`warmup_ratio <learning_rate[exp]/warmup_ratio>`.
111
+
- {ref}`warmup_ratio <learning_rate[exp]/warmup_ratio>` (optional) specifies the warmup duration as a ratio of the total training steps. For example, `warmup_ratio: 0.1` means the warmup phase will last for 10% of the total training steps. The actual number of warmup steps is computed as `int(warmup_ratio * numb_steps)`. This parameter is mutually exclusive with {ref}`warmup_steps <learning_rate[exp]/warmup_steps>`.
112
+
- {ref}`warmup_start_factor <learning_rate[exp]/warmup_start_factor>` (optional, default: 0.0) specifies the factor for the initial warmup learning rate. The warmup learning rate starts from `warmup_start_factor * start_lr` and increases linearly to {ref}`start_lr <learning_rate[exp]/start_lr>`. A value of 0.0 means the learning rate starts from zero.
113
+
114
+
#### Configuration examples
115
+
116
+
**Example 1: Basic exponential decay without warmup**
117
+
118
+
```json
119
+
"learning_rate": {
120
+
"type": "exp",
121
+
"start_lr": 0.001,
122
+
"stop_lr": 1e-6,
123
+
"decay_steps": 5000
124
+
}
125
+
```
126
+
127
+
**Example 2: Using stop_lr_rate instead of stop_lr**
128
+
129
+
```json
130
+
"learning_rate": {
131
+
"type": "exp",
132
+
"start_lr": 0.001,
133
+
"stop_lr_rate": 1e-3,
134
+
"decay_steps": 5000
135
+
}
136
+
```
137
+
138
+
This is equivalent to setting `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`).
139
+
140
+
**Example 3: Exponential decay with warmup (using warmup_steps)**
141
+
142
+
```json
143
+
"learning_rate": {
144
+
"type": "exp",
145
+
"start_lr": 0.001,
146
+
"stop_lr": 1e-6,
147
+
"decay_steps": 5000,
148
+
"warmup_steps": 10000,
149
+
"warmup_start_factor": 0.1
150
+
}
151
+
```
152
+
153
+
In this example, the learning rate starts from `0.0001` (i.e., `0.1 * 0.001`) and increases linearly to `0.001` over the first 10,000 steps. After that, it decays exponentially to `1e-6`.
154
+
155
+
**Example 4: Exponential decay with warmup (using warmup_ratio)**
156
+
157
+
```json
158
+
"learning_rate": {
159
+
"type": "exp",
160
+
"start_lr": 0.001,
161
+
"stop_lr_rate": 1e-3,
162
+
"decay_steps": 5000,
163
+
"warmup_ratio": 0.05
164
+
}
165
+
```
166
+
167
+
In this example, if the total training steps (`numb_steps`) is 1,000,000, the warmup phase will last for 50,000 steps (i.e., `0.05 * 1,000,000`). The learning rate starts from `0.0` (default `warmup_start_factor: 0.0`) and increases linearly to `0.001` over the first 50,000 steps, then decays exponentially.
168
+
169
+
#### Cosine annealing schedule
170
+
171
+
The {ref}`learning_rate <learning_rate>` section for cosine annealing in `input.json` is given as follows
172
+
173
+
```json
174
+
"learning_rate": {
175
+
"type": "cosine",
176
+
"start_lr": 0.001,
177
+
"stop_lr": 1e-6
178
+
}
179
+
```
180
+
181
+
Cosine annealing provides a smooth decay curve that often works well for training neural networks. Unlike exponential decay, it does not require the `decay_steps` parameter.
182
+
183
+
**Example 5: Basic cosine annealing without warmup**
184
+
185
+
```json
186
+
"learning_rate": {
187
+
"type": "cosine",
188
+
"start_lr": 0.001,
189
+
"stop_lr": 1e-6
190
+
}
191
+
```
192
+
193
+
**Example 6: Cosine annealing with stop_lr_rate**
194
+
195
+
```json
196
+
"learning_rate": {
197
+
"type": "cosine",
198
+
"start_lr": 0.001,
199
+
"stop_lr_rate": 1e-3
200
+
}
201
+
```
202
+
203
+
This is equivalent to setting `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`).
In this example, the learning rate starts from `0.0` and increases linearly to `0.001` over the first 5,000 steps, then follows a cosine annealing curve down to `1e-6`.
0 commit comments