How do I set the steps_per_epoch parameter of a lr scheduler in multi-GPU environment? #2149

tayden · 2020-06-11T14:45:18Z

tayden
Jun 11, 2020

What is your question?

For some learning rate schedulers, there is a required steps_per_epoch parameter. One example is the OneCycleLR scheduler. On a CPU or single GPU, this parameter should be set to the length of the train dataloader. My question is, how should this parameter be set on a multi-GPU machine using DDP. Does this parameter need to be updated to len(self.train_dataloader()) / num_gpus? Or is this done automatically?

What have you tried?

I've tried manually dividing the steps_per_epoch of the OneCycleLR scheduler by the number of GPUs when training on a multi-GPU machine. The LR doesn't seem to be following the expected update pattern and I think the scheduler may be the source of the problem.

What's your environment?

OS: Linux
Packaging: conda
Version: 0.7.6

Answered by tayden

Jun 11, 2020

After some more investigation, it seems like dividing the dataloader size by the number of GPUs is the correct way. The documentation could be more clear on this, but I'm closing this now.

View full answer

tayden · 2020-06-11T22:04:10Z

tayden
Jun 11, 2020
Author

After some more investigation, it seems like dividing the dataloader size by the number of GPUs is the correct way. The documentation could be more clear on this, but I'm closing this now.

2 replies

rsomani95 Jun 4, 2021

Hi @tayden, I'm trying to do exactly this, and would appreciate your input.

In your initial question, you said manually dividing len(train_dataloader) / NUM_GPUs didn't follow the expected pattern but later mentioned that this is indeed the correct way.

Could you please share a code snippet that solved the issue for you? Thanks.

tayden Jun 4, 2021
Author

Hey @rsomani95 , here are the relevant methods from my pt-lightning class and their usage in configure_optimizers

class DeepLabv3ResNet101(pl.LightningModule):
    # ...
    @property
    def num_training_steps(self) -> int:
        """Total training steps inferred from datamodule and devices."""
        if self.trainer.max_steps:
            return self.trainer.max_steps

        limit_batches = self.trainer.limit_train_batches
        batches = len(self.train_dataloader())
        batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(
            limit_batches * batches)

        num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
        if self.trainer.tpu_cores:
            num_devices = max(num_devices, self.trainer.tpu_cores)

        effective_accum = self.trainer.accumulate_grad_batches * num_devices
        return (batches // effective_accum) * self.trainer.max_epochs

    @property
    def steps_per_epoch(self) -> int:
        return self.num_training_steps // self.trainer.max_epochs

    def configure_optimizers(self):
        # Get trainable params
        head_params = itertools.chain(self.model.classifier.parameters(),
                                      self.model.aux_classifier.parameters())
        backbone_params = itertools.chain(self.model.backbone.layer4.parameters(),
                                          self.model.backbone.layer3.parameters())

        # Init optimizer and scheduler
        optimizer = torch.optim.AdamW([
            {"params": head_params},
            {"params": backbone_params, "lr": self.hparams.backbone_lr},
        ], lr=self.hparams.lr, amsgrad=True, weight_decay=self.hparams.weight_decay)
        lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                                           max_lr=[self.hparams.lr,
                                                                   self.hparams.backbone_lr],
                                                           total_steps=self.num_training_steps)

        return [optimizer], [{'scheduler': lr_scheduler, 'interval': 'step'}]

Full model definition is here. Sorry if it is a bit of a mess.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How do I set the steps_per_epoch parameter of a lr scheduler in multi-GPU environment? #2149

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

How do I set the steps_per_epoch parameter of a lr scheduler in multi-GPU environment? #2149

Uh oh!

tayden Jun 11, 2020

What is your question?

What have you tried?

What's your environment?

Replies: 1 comment · 2 replies

Uh oh!

tayden Jun 11, 2020 Author

Uh oh!

rsomani95 Jun 4, 2021

Uh oh!

Uh oh!

tayden Jun 4, 2021 Author

tayden
Jun 11, 2020

Replies: 1 comment 2 replies

tayden
Jun 11, 2020
Author

tayden Jun 4, 2021
Author