Handling batch normalization with gradient accumulation #13889

brunomaga · 2022-07-27T18:20:29Z

brunomaga
Jul 27, 2022

Hi. How does pytorch lightning handle batch normalization combined with gradient accumulation?

As an example, take the training of a neural network with batch normalization on its layers, on 3 different setups:

A: a run on a single GPU with a model.batch_size=8 and no gradient accumulation;
B: a DistributedDataParallel run on 2 GPUs, with model.batch_size=4 and no gradient accumulation, i.e. an effective batch size of 2*4=8; and
C: a DistributedDataParallel run on 2 GPUs, model.batch_size=2 and accumulate_grad_batches=2, i.e. also an effective batch size of 2*2*2=8;

When sync_batchnorm=True, executions A and B will produce similar results (right?).
What about the results of C, will sync_batchnorm synchronize across all gradient accumulation steps, or only across GPUs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling batch normalization with gradient accumulation #13889

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Handling batch normalization with gradient accumulation #13889

Uh oh!

Uh oh!

brunomaga Jul 27, 2022

Replies: 0 comments

brunomaga
Jul 27, 2022