Should the total epoch size be less when using multi-gpu DDP? #7175
-
If I have 100 training examples and 100 validation examples, and I run on a single gpu with a batch size of 10, the tqdm bar will show 20 epoch iterations. If I run on 2 gpus with ddp and the same batch size, the tqdm bar will still show 20 epoch iterations, but isnt the effective batch size now 20 instead of 10 because theres 2 gpus? Shouldnt the total number of iterations be half? Thanks for any clarification. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
Hi @jipson7 , We tried to reproduce this, but for us this produced the following (correct) output. Do you have a minimal reproduction example?
|
Beta Was this translation helpful? Give feedback.
-
Hi again. I figured it out. I was using a prefetch dataloader adopted from NVIDIA Apex. It wrapped the data loader in a custom generator to pipeline some of the data loading and GPU transfer. It was breaking DDP and causing the above mentioned error. |
Beta Was this translation helpful? Give feedback.
Hi @jipson7 ,
First of all: You're right, that's how it should be.
We tried to reproduce this, but for us this produced the following (correct) output. Do you have a minimal reproduction example?