Skip to content

Commit 5ba70db

Browse files
authored
Allow to scale LR in different ways. (#1167)
1 parent a2708ea commit 5ba70db

File tree

3 files changed

+27
-4
lines changed

3 files changed

+27
-4
lines changed

deepmd/train/trainer.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,13 @@ def _init_param(self, jdata):
157157

158158
# learning rate
159159
lr_param = j_must_have(jdata, 'learning_rate')
160+
scale_by_worker = lr_param.get('scale_by_worker', 'linear')
161+
if scale_by_worker == 'linear':
162+
self.scale_lr_coef = float(self.run_opt.world_size)
163+
elif scale_by_worker == 'sqrt':
164+
self.scale_lr_coef = np.sqrt(self.run_opt.world_size).real
165+
else:
166+
self.scale_lr_coef = 1.
160167
lr_type = lr_param.get('type', 'exp')
161168
if lr_type == 'exp':
162169
self.lr = LearningRateExp(lr_param['start_lr'],
@@ -330,7 +337,11 @@ def _build_network(self, data):
330337
def _build_training(self):
331338
trainable_variables = tf.trainable_variables()
332339
if self.run_opt.is_distrib:
333-
optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate*self.run_opt.world_size)
340+
if self.scale_lr_coef > 1.:
341+
log.info('Scale learning rate by coef: %f', self.scale_lr_coef)
342+
optimizer = tf.train.AdamOptimizer(self.learning_rate*self.scale_lr_coef)
343+
else:
344+
optimizer = tf.train.AdamOptimizer(self.learning_rate)
334345
optimizer = self.run_opt._HVD.DistributedOptimizer(optimizer)
335346
else:
336347
optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate)

deepmd/utils/argcheck.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -452,8 +452,10 @@ def learning_rate_variant_type_args():
452452

453453

454454
def learning_rate_args():
455+
doc_scale_by_worker = 'When parallel training or batch size scaled, how to alter learning rate. Valid values are `linear`(default), `sqrt` or `none`.'
455456
doc_lr = "The definitio of learning rate"
456-
return Argument("learning_rate", dict, [],
457+
return Argument("learning_rate", dict,
458+
[Argument("scale_by_worker", str, optional=True, default='linear', doc=doc_scale_by_worker)],
457459
[learning_rate_variant_type_args()],
458460
doc = doc_lr)
459461

doc/train/parallel-training.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,19 @@
33
Currently, parallel training is enabled in a sychoronized way with help of [Horovod](https://github.com/horovod/horovod).
44
Depend on the number of training processes (according to MPI context) and number of GPU cards avaliable, DeePMD-kit will decide whether to launch the training in parallel (distributed) mode or in serial mode. Therefore, no additional options is specified in your JSON/YAML input file.
55

6-
Horovod works in the data-parallel mode, resulting in a larger global batch size. For example, the real batch size is 8 when `batch_size` is set to 2 in the input file and you launch 4 workers. Thus, `learning_rate` is automatically scaled by the number of workers for better convergence. The number of decay steps required to achieve same accuracy will also reduce based on the number of cards (e.g., 1/4 of steps in the above case), but needs to be scaled manually in the input file.
6+
## Tuning learning rate
77

8-
Technical details of such heuristic rule are discussed at [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
8+
Horovod works in the data-parallel mode, resulting in a larger global batch size. For example, the real batch size is 8 when `batch_size` is set to 2 in the input file and you launch 4 workers. Thus, `learning_rate` is automatically scaled by the number of workers for better convergence. Technical details of such heuristic rule are discussed at [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
9+
10+
The number of decay steps required to achieve same accuracy can decrease by the number of cards (e.g., 1/2 of steps in the above case), but needs to be scaled manually in the input file.
11+
12+
In some cases, it won't work well when scale learning rate by worker count in a `linear` way. Then you can try `sqrt` or `none` by setting argument `scale_by_worker` like below.
13+
```json
14+
"learning_rate" :{
15+
"scale_by_worker": "none",
16+
"type": "exp"
17+
}
18+
```
919

1020
## Scaling test
1121

0 commit comments

Comments
 (0)