using dp to speed up training #6751

ironv · 2021-03-30T16:31:15Z

ironv
Mar 30, 2021

I am training the model shown here using pytorch lightning. My train dataloader looks like this

def collate_fn(batch):
    return nn.utils.rnn.pad_sequence([x[0] for x in batch], batch_first=True), torch.stack([x[1] for x in batch])

tudata.DataLoader(self.text_train, batch_size=128, num_workers=4, drop_last=True
                                  , shuffle=False, collate_fn=collate_fn, pin_memory=True)

I am trying to use dp and see how that helps.

trainer = pl.Trainer(gpus=clargs["NGPUS"], accelerator="dp" if clargs["NGPUS"]>1 else None, max_epochs=clargs["EPOCHS"]
        , progress_bar_refresh_rate=0 if clargs["NO_PROGRESSBAR"] else None, auto_scale_batch_size="power"
        , deterministic=True, logger=False, callbacks=[checkpoint_CB,earlystopping_CB])

Questions:

is it efficient to have batch_size be a multiple of num_workers?
VM has 24CPUs. How do I figure out if num_workers should be increased and to what number without doing several experiments. I am not seeing too much of a difference when I go up to 8.
VM has 4 V100s (16GB mem). Total train time does not change much when I go from --ngpu=4 to 3 to 2. However the amount of memory used per GPU goes up. So much so that when I set ngpu=1 I get a RuntimeError: CUDA out of memory. error. While it is good that I can train larger models, I was hoping to see a reduction in run time? Do I need to use a different setting for that?

awaelchli · 2021-04-04T23:13:40Z

awaelchli
Apr 4, 2021

no, they are unrelated.
my rule of thumb, start with four workers per gpu (so 2 gpus = 8 workers). but too many workers is also not optimal. then observe where the bottleneck is. if the gpu usage jumping up and down from 0 to 100, it means the cpu is too slow and gpu needs to wait a long time for data.
If you use DP, and you don't change the batch size, yes it can be that the training time is not going to change much. You need to increase the batch size too (and probably other hyper parameters like learning rate), because the batch gets split evenly between the gpus.

2 replies

ironv Apr 5, 2021
Author

Thanks! pytorch_lightning.tuner can be used to tune lr and batch_size. It appears that cannot be done simultaneously. Which one should be tuned first - batch_size?

The trainer function has arguments auto_lr_find and auto_scale_batch_size. Should they be turned off when the tuners are called? Is there an example where both these tuners are used and finally the model is fit?

awaelchli Apr 16, 2021

not an expert enough on this, haven't used this feature too often myself.
it's hard to say which one should come first, we are talking about hyper parameters here. There is rarely a rule of thumb for such things. One thing that I can say is that, if you once have a good learning rate for a certain batch size, you can increase the batch size and multiply the learning rate by that factor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

using dp to speed up training #6751

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

using dp to speed up training #6751

Uh oh!

Uh oh!

ironv Mar 30, 2021

Replies: 1 comment · 2 replies

Uh oh!

awaelchli Apr 4, 2021

Uh oh!

ironv Apr 5, 2021 Author

Uh oh!

awaelchli Apr 16, 2021

ironv
Mar 30, 2021

Replies: 1 comment 2 replies

awaelchli
Apr 4, 2021

ironv Apr 5, 2021
Author