Understanding batch size with multiple dataloaders #14235

Michael-Geuenich · 2022-08-16T21:27:32Z

Michael-Geuenich
Aug 16, 2022

I'm currently working on a model with multiple dataloaders of different sizes. My first dataset is composed of 176 samples that are loaded in their entirety in my training loop (batch_size=176). My second dataset is composed of ~22,000 samples, though I am using the same batch size of 176.

When I implemented this in plain pytorch I went through one batch (176 samples) per epoch which is what I wanted (3 total epochs, three batches sampled, three total steps).

I'm only testing things out at the moment, so I've run this in PL using 3 epochs. I was expecting to run through 3 global steps, however, PL runs through 332 global steps and I don't understand how it arrives at this number.

If the end of an epoch is defined as having sampled the entirety of the larger loader then I would expect to go through 22,000 / 176 * 3 = 375 global steps.

Note that I also have a validation step with two dataloaders (using datasets with 61 and 3k samples, as well as a batch size of 61 for both). I've also specified val_check_interval=1 in my trainer call.

I am aware of this question/answer (https://forums.pytorchlightning.ai/t/weird-number-of-steps-per-epoch/773) that states that the global steps = total train + total val steps. However, if that is the case, shouldn't the total steps PL goes through be even higher than 332 or 375?

rohitgr7 · 2022-08-17T13:30:47Z

rohitgr7
Aug 17, 2022

global steps = total train + total val steps

this is slightly incorrect. This applies only to the progress bar, not the actual global step

trainer.global_step computes total optimizer steps and doesn't include any validation step
can you check the value of len(train_dataloader) in your case?

4 replies

Michael-Geuenich Sep 1, 2022
Author

Hi Rohit,

Thank you for your suggestions. I've had to change a few things which reduced the total number of global steps to 220, however, I'm still confused about how it calculates the number of steps.

I printed the length of each batch in the training_step and the validation_step: As expected these were 176 for both loaders in the training step and 61 in the validation step.

I've also printed the length as follows

len(trainer.train_dataloader.loaders['large'])
len(trainer.train_dataloader.loaders['small'])

Surprisingly both returned 110 despite me specifying a batch size of 176. If I change the batch size to 110 it goes through 352 total steps and printing out the lengths as above results in 176. Any idea what is going on?

rohitgr7 Sep 1, 2022

Surprisingly both returned 110

that means the estimation is correct based on what was provided. possibly your total sample count is incorrect here or your dataset isn't loading all the samples. Can you share a reproducible example to check?

Michael-Geuenich Sep 15, 2022
Author

Yes, just got around to creating one: https://colab.research.google.com/drive/1Sbr5bn56OEoCSAAZuB5rXSOP14REFrxE?usp=sharing

Michael-Geuenich Oct 12, 2022
Author

Hi @rohitgr7, just wondering whether the above reprex makes sense?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding batch size with multiple dataloaders #14235

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Understanding batch size with multiple dataloaders #14235

Uh oh!

Michael-Geuenich Aug 16, 2022

Replies: 1 comment · 4 replies

Uh oh!

rohitgr7 Aug 17, 2022

Uh oh!

Michael-Geuenich Sep 1, 2022 Author

Uh oh!

rohitgr7 Sep 1, 2022

Uh oh!

Michael-Geuenich Sep 15, 2022 Author

Uh oh!

Michael-Geuenich Oct 12, 2022 Author

Michael-Geuenich
Aug 16, 2022

Replies: 1 comment 4 replies

rohitgr7
Aug 17, 2022

Michael-Geuenich Sep 1, 2022
Author

Michael-Geuenich Sep 15, 2022
Author

Michael-Geuenich Oct 12, 2022
Author