How should the number of steps be set against to processed data when using ddp and multi GPU #11192
-
Hi I'm newbie for pytorch-lightning. I want to process 100,000 records. So I set Is there any recognition that the number of steps specified when using multiple GPUs needs to be divided by the number of GPUs with the expected number of steps? (my pytorch-lightinng version is 1.5.4) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
hey @kash203 with DDP if batch_size is 32 with 4 GPUs then the effective batch size actually 32*4. See the docs here. now in your case, there are 100_000 samples. Considering batch_size as 1, a single training step with 4 GPUs means 4 calls to the dataloader at 4 samples are covered in a single step. Thus your max_steps here should be 25_000 as you stated. In case you want to process all the samples only once, you can just set max_epoch=1. |
Beta Was this translation helpful? Give feedback.
hey @kash203
with DDP if batch_size is 32 with 4 GPUs then the effective batch size actually 32*4. See the docs here.
now in your case, there are 100_000 samples. Considering batch_size as 1, a single training step with 4 GPUs means 4 calls to the dataloader at 4 samples are covered in a single step. Thus your max_steps here should be 25_000 as you stated.
In case you want to process all the samples only once, you can just set max_epoch=1.
max_steps
is generally used for sequential learning tasks where data is iteratively created.