Handling batch normalization with gradient accumulation #13889
Unanswered
brunomaga
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi. How does pytorch lightning handle batch normalization combined with gradient accumulation?
As an example, take the training of a neural network with batch normalization on its layers, on 3 different setups:
model.batch_size=8
and no gradient accumulation;DistributedDataParallel
run on2
GPUs, withmodel.batch_size=4
and no gradient accumulation, i.e. an effective batch size of2*4=8
; andDistributedDataParallel
run on2
GPUs,model.batch_size=2
andaccumulate_grad_batches=2
, i.e. also an effective batch size of2*2*2=8
;When
sync_batchnorm=True
, executions A and B will produce similar results (right?).What about the results of C, will
sync_batchnorm
synchronize across all gradient accumulation steps, or only across GPUs?Beta Was this translation helpful? Give feedback.
All reactions