Should the total epoch size be less when using multi-gpu DDP? #7175

jipson7 · 2021-04-23T00:34:56Z

jipson7
Apr 23, 2021

If I have 100 training examples and 100 validation examples, and I run on a single gpu with a batch size of 10, the tqdm bar will show 20 epoch iterations. If I run on 2 gpus with ddp and the same batch size, the tqdm bar will still show 20 epoch iterations, but isnt the effective batch size now 20 instead of 10 because theres 2 gpus? Shouldnt the total number of iterations be half?

Thanks for any clarification.

Answered by justusschock

Apr 23, 2021

Hi @jipson7 ,
First of all: You're right, that's how it should be.

We tried to reproduce this, but for us this produced the following (correct) output. Do you have a minimal reproduction example?

Epoch 0: 100%|███████████████████████████████████████████████████████| 10/10 [00:00<00:00, 17.23it/s, loss=-43.6, v_num=272]
seen train: 5
seen train: 5

View full answer

justusschock · 2021-04-23T07:50:55Z

justusschock
Apr 23, 2021
Maintainer

Hi @jipson7 ,
First of all: You're right, that's how it should be.

We tried to reproduce this, but for us this produced the following (correct) output. Do you have a minimal reproduction example?

Epoch 0: 100%|███████████████████████████████████████████████████████| 10/10 [00:00<00:00, 17.23it/s, loss=-43.6, v_num=272]
seen train: 5
seen train: 5

4 replies

jipson7 Apr 23, 2021
Author

Thanks for looking into it!. My working code is rather complicated, but its nice to know the expected behaviour. Ill see if I can create a minimal repro.

jipson7 Apr 23, 2021
Author

@justusschock I think it may be because I am using limit_train_batches? I have a rather large dataset, so its useful to treat 1/10 of it as an epoch. So I set limit_train_batches to 0.1. How is ddp's batch size meant to behave in this case?

jipson7 Apr 23, 2021
Author

@justusschock I think it may be because I am using limit_train_batches? I have a rather large dataset, so its useful to treat 1/10 of it as an epoch. So I set limit_train_batches to 0.1. How is ddp's batch size meant to behave in this case?

Tested and this did not fix it :( .

awaelchli Apr 23, 2021

Hi, DDP runs independently on each GPU/node.
batch size is per gpu, limit_xyz_batches is per GPU. This is intended behavior, as it mimicks as close as possible how ddp works in pytorch pure (simply think of it as parallel processes).

If you want your batch to be split across GPUs, there is the DP accelerator (DataParallel), but it is slower and we don't recommend it if you can use ddp :)

jipson7 · 2021-04-23T12:02:40Z

jipson7
Apr 23, 2021
Author

Hi again. I figured it out. I was using a prefetch dataloader adopted from NVIDIA Apex. It wrapped the data loader in a custom generator to pipeline some of the data loading and GPU transfer. It was breaking DDP and causing the above mentioned error.

2 replies

justusschock Apr 23, 2021
Maintainer

Yeah, that can be the issue. What happens under the hood is that we replace the usual sampler with a distributed one so that on each GPU only a subset of the data is actually sampled. This work, however, only if we have a pytorch dataloader/a subclass of it. So custom loader classes don't allow this kind of behavior.

But great that you found it :)

jipson7 Apr 23, 2021
Author

That makes sense. Thanks to you both for the explanations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should the total epoch size be less when using multi-gpu DDP? #7175

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Should the total epoch size be less when using multi-gpu DDP? #7175

Uh oh!

jipson7 Apr 23, 2021

Replies: 2 comments · 6 replies

Uh oh!

Uh oh!

justusschock Apr 23, 2021 Maintainer

Uh oh!

jipson7 Apr 23, 2021 Author

Uh oh!

jipson7 Apr 23, 2021 Author

Uh oh!

jipson7 Apr 23, 2021 Author

Uh oh!

awaelchli Apr 23, 2021

Uh oh!

jipson7 Apr 23, 2021 Author

Uh oh!

justusschock Apr 23, 2021 Maintainer

Uh oh!

jipson7 Apr 23, 2021 Author

jipson7
Apr 23, 2021

Replies: 2 comments 6 replies

justusschock
Apr 23, 2021
Maintainer

jipson7 Apr 23, 2021
Author

jipson7 Apr 23, 2021
Author

jipson7 Apr 23, 2021
Author

jipson7
Apr 23, 2021
Author

justusschock Apr 23, 2021
Maintainer

jipson7 Apr 23, 2021
Author