Problems in parallel training on GPU #1307

qiurong13 · 2021-11-22T16:51:48Z

qiurong13
Nov 22, 2021

With the help of Horovod, the real batch size is 8 when batch_size is set to 2 in the input file and we launch 4 workers. According to the Linear Scaling Rule in the attached paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour: When the minibatch size is multiplied by k, multiply the learning rate by k.
Then how should we manually change the value of decay steps?
Why the decay steps should be reduced to 1/2 of steps in the above case?
In this relationship lr(t) = start_lr * decay_rate ^ ( t / decay_steps ), if t / decay_steps always equals to 200, then we could not possibly multiply the learning rate by k without changing decay_rate...

wanghan-iapcm · 2021-11-23T00:44:36Z

wanghan-iapcm
Nov 23, 2021
Maintainer

In the latest deepmd-kit, one does not need to set decay rate, but to set stop_lr instead.

1 reply

qiurong13 Nov 23, 2021
Author

Then how about the start_lr? do we have to set it manually? or Horovod will multiply it by k atomatically?

And, what is the rule to change decay steps? as in the manual, when batch size multiplies 4, decay steps decrease by half. but i think we can reduce it to 1/4 (1/k), as well as stop batch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems in parallel training on GPU #1307

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Problems in parallel training on GPU #1307

Uh oh!

qiurong13 Nov 22, 2021

Replies: 1 comment · 1 reply

Uh oh!

wanghan-iapcm Nov 23, 2021 Maintainer

Uh oh!

Uh oh!

qiurong13 Nov 23, 2021 Author

qiurong13
Nov 22, 2021

Replies: 1 comment 1 reply

wanghan-iapcm
Nov 23, 2021
Maintainer

qiurong13 Nov 23, 2021
Author