Training steps vs training batches #6979

del2z · 2021-04-13T06:17:47Z

del2z
Apr 13, 2021

I have some doubts about training steps and training batches.
When training a model using max_steps, the number of training batches showed in progressing bar using 1 GPU is the same as using 8 GPUs. If using X GPUs, is the actual batch size during one optimization step X-times larger, which results in a smaller number of training batches?

The batch size should be expanded in multi-GPU setting according to DDP of pytorch (PyTorch Distributed: Experiences on Accelerating Data Parallel Training). From my observation in 1, the number of training batches equals to total number of steps (I don't think they should be the same). I wonder the DDP training mechanisms between pytorch and pytorch-lightning are still the same.

In pytorch, each device gets a copy of the model and other states exactly like the original one. After backwards, gradients from all devices are accumulated together and broadcasted to every device. Then models are updated using same gradients. Such process acts like splitting a large batch, execute forward and backward pass on a smaller, splitted part in each GPU and then accumulate all gradients and update model parameters.
It seems like pytorch-lightning treats a single optimization step in X GPUs as X steps. If so, the number of training batches becomes the number of steps. Only difference is that model parameters are updated once, not X times as expected. Another coincidence is that learning rate scheduler setting interval='step' updates learning rate in every batch.

I hope anyone could explain this.

awaelchli · 2021-04-22T22:11:45Z

awaelchli
Apr 22, 2021

For anyone finding this in the future, the answers can be found here: #6984

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training steps vs training batches #6979

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training steps vs training batches #6979

Uh oh!

del2z Apr 13, 2021

Replies: 1 comment

Uh oh!

awaelchli Apr 22, 2021

del2z
Apr 13, 2021

awaelchli
Apr 22, 2021