Training steps vs training batches #6979
Unanswered
del2z
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment
-
For anyone finding this in the future, the answers can be found here: #6984 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have some doubts about training steps and training batches.
When training a model using
max_steps
, the number of training batches showed in progressing bar using 1 GPU is the same as using 8 GPUs. If using X GPUs, is the actual batch size during one optimization step X-times larger, which results in a smaller number of training batches?The batch size should be expanded in multi-GPU setting according to DDP of pytorch (PyTorch Distributed: Experiences on Accelerating Data Parallel Training). From my observation in 1, the number of training batches equals to total number of steps (I don't think they should be the same). I wonder the DDP training mechanisms between pytorch and pytorch-lightning are still the same.
In pytorch, each device gets a copy of the model and other states exactly like the original one. After backwards, gradients from all devices are accumulated together and broadcasted to every device. Then models are updated using same gradients. Such process acts like splitting a large batch, execute forward and backward pass on a smaller, splitted part in each GPU and then accumulate all gradients and update model parameters.
It seems like pytorch-lightning treats a single optimization step in X GPUs as X steps. If so, the number of training batches becomes the number of steps. Only difference is that model parameters are updated once, not X times as expected. Another coincidence is that learning rate scheduler setting
interval='step'
updates learning rate in every batch.I hope anyone could explain this.
Beta Was this translation helpful? Give feedback.
All reactions