How to properly configure lr schedulers when using fine tuning and DDP? #8724

Scass0807 · 2021-08-04T21:43:40Z

Scass0807
Aug 4, 2021

Hi I am currently trying to train an image classifier with ResNet Iam trying to use a ReduceLROnPlateau to improve training accuracy. I also have split the model and last layer into a backbone and classifier. My goal is to freeze the backbone for the first 10 or so epochs and then unfreeze and train the whole model. The problem I am having is that is scheduler does not seem to be behaving as expected and I am getting errors using DDP.

I am trying to lower the learning rate by a factor of 10 on both the backbone and classifier when Top 1 Validation Accuracy plateaus. My expectation is that the backbone and classifier learning rates will be equal at all times and they will decrease at the same time. However, this does not appear to be the case. It appears that as soon as the backbone unfreezes its learning rate is decreasing by a factor of 10 Even though there is no plateau in accuracy There is however a slight increase in loss and I read here that scheduler.stepis always called on val_loss. However that appears to be an old thread and I then read here that this is no longer the case. Even then, the interval is set to epoch and the patience should be high enough that is has no effect.

This happens both with and without DDP. However in DDP and am getting a warning when the backbone is unfroze:

The provided params to be freezed already exist within another group of this optimizer. Those parameters will be skipped.
HINT: Did you init your optimizer in `configure_optimizer` as such:
<class 'torch.optim.sgd.SGD'>(filter(lambda p: p.requires_grad, self.parameters()), ...)
rank_zero_warn(

This is occurring even though I am already using the filter and I have also tried setting find_unused_parameters=True The other thing that is happening is midway through training the scheduler tres to kick in and causes this error crashing the program.

trainer.fit(model,dm)
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
    self._run(model)
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
    self.dispatch()
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
    self.accelerator.start_training(self)
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
    return self.run_train()
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
    self.train_loop.run_training_epoch()
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1000, in run_evaluation
    self.optimizer_connector.update_learning_rates(
  File "/home/steven/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/optimizer_connector.py", line 81, in update_learning_rates
    lr_scheduler['scheduler'].step(monitor_val)
  File "/home/steven/.local/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 642, in step
    self._reduce_lr(epoch)
  File "/home/steven/.local/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 651, in _reduce_lr
    new_lr = max(old_lr * self.factor, self.min_lrs[i])
IndexError: list index out of range
Exception ignored in: <function tqdm.__del__ at 0x7f05ec532d30>
Traceback (most recent call last):
  File "/home/steven/.local/lib/python3.8/site-packages/tqdm/std.py", line 1152, in __del__
  File "/home/steven/.local/lib/python3.8/site-packages/tqdm/std.py", line 1306, in close
  File "/home/steven/.local/lib/python3.8/site-packages/tqdm/std.py", line 1499, in display
  File "/home/steven/.local/lib/python3.8/site-packages/tqdm/std.py", line 1155, in __str__
  File "/home/steven/.local/lib/python3.8/site-packages/tqdm/std.py", line 1457, in format_dict
TypeError: cannot unpack non-iterable NoneType object

I have posted figures and limited code below to show what is happening:

Both parameter groups should have a learning rate of 8e-3 not 8e-4.

def __init__(self,config):
        super().__init__()
        self.number_of_classes = ...
        self.backbone = resnet34(pretrained=True)
        del self.backbone.fc

def forward(self, x):
        x = self.backbone(x)
        return self.classifier(x)
#...
def configure_optimizers(self):
        optimizer = optim.SGD(filter(lambda p: p.requires_grad, self.parameters()), lr=self.lr, momentum=0.9)
        lr_scheduler = {
            'scheduler': ReduceLROnPlateau(optimizer,patience=20,mode='max',threshold_mode='abs',min_lr=8e-5),
            'monitor': 'Val Top 1',
            "interval": "epoch",
            'name':'lr'
        }
        return [optimizer], [lr_scheduler]

class FreezeUnFreeze(BaseFinetuning):
    def __init__(self, unfreeze_at_epoch=10):
        super().__init__()
        self._unfreeze_at_epoch = unfreeze_at_epoch
    def freeze_before_training(self, pl_module):
        # freeze any module you want
        # Here, we are freezing `backbone`
        self.freeze(pl_module.backbone)
    def finetune_function(self, pl_module, current_epoch, optimizer, optimizer_idx):
        # When `current_epoch` is 10, backbone will start training.
        if current_epoch == self._unfreeze_at_epoch:
            self.unfreeze_and_add_param_group(
                modules=pl_module.backbone,
                optimizer=optimizer,
                train_bn=True,

I have already tried splitting the backbone and classifier into multiple optimizers, lr_schedulers with no success, I know this is really multiple issues not sure if this should be multiple threads. Any help would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to properly configure lr schedulers when using fine tuning and DDP? #8724

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to properly configure lr schedulers when using fine tuning and DDP? #8724

Uh oh!

Uh oh!

Scass0807 Aug 4, 2021

Replies: 0 comments

Scass0807
Aug 4, 2021