Error with loading model checkpoint #12399

dmandair · 2022-03-21T18:53:40Z

dmandair
Mar 21, 2022

Hi everyone. I was recently running a lightning model and saved a checkpoint to store the intermediate results. When I try to open the checkpoint, I get an error that positional arguments (used to initialize the lightning module) are not present. This wouldn't be a big deal but one of the positional arguments is the encoder (used for BarlowTwins training). I was worried if I loaded the model checkpoint with an encoder initialized with starting weights, this would overwrite the weight parameters stored in the checkpoint. See the error log and a block of code below. Any suggestions on how I can appropriately load this stored model to resume training?

  model_ckpt = BarlowTwins.load_from_checkpoint('/wynton/protected/home/ichs/dmandair/BRCAness/datasets/train/pcam/epoch=199-step=25599.ckpt')

      Traceback (most recent call last):
      File "/wynton/protected/home/ichs/dmandair/BRCA/barlow.py", line 435, in <module>
        main(default_config)
      File "/wynton/protected/home/ichs/dmandair/BRCA/barlow.py", line 427, in main
        model_ckpt = BarlowTwins.load_from_checkpoint('/wynton/protected/home/ichs/dmandair/BRCAness/datasets/train/pcam/epoch=199-step=25599.ckpt')
      File "/wynton/protected/home/ichs/dmandair/anaconda3/envs/BRCA/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 156, in load_from_checkpoint
        model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
      File "/wynton/protected/home/ichs/dmandair/anaconda3/envs/BRCA/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 198, in _load_model_state
        model = cls(**_cls_kwargs)
    TypeError: __init__() missing 5 required positional arguments: 'encoder', 'encoder_out_dim', 'num_training_samples', 'batch_size', and 'weight_decay'

original model loaded with:

    encoder = resnet18(zero_init_residual=True)

    model = BarlowTwins(
        encoder=encoder,
        encoder_out_dim=encoder_out_dim,
        learning_rate = default_config['LR'],
        weight_decay = default_config['WD'],
        num_training_samples=262144,
        batch_size=BATCH_SIZE,
        z_dim=default_config['Z_DIM'],
        lambda_coeff = default_config['LAMBDA'],
        max_epochs=MAX_EPOCHS
    )

Answered by rohitgr7

Mar 22, 2022

hey @dmandair !

did you call self.save_hyperparameters() inside your LM.__init__? else hyperparameters won't be saved inside the checkpoint and you might need to provide them again using LMModel.load_from_checkpoint(..., encoder=encoder, encoder_out_dim=encoder_out_dim, ...).

also note that, if you are passing an nn.Module inside your LM and calling self.save_hyperparameters(), it will save that too inside your hparams, which is not a good thing considering that nn.Modules are saved inside checkpoint state_dict and might create issues for you. Ideally, you should ignore them using self.save_hyperparameters(ignore=['encoder']). Check out this PR: #12068

View full answer

rohitgr7 · 2022-03-22T10:14:34Z

rohitgr7
Mar 22, 2022

hey @dmandair !

did you call self.save_hyperparameters() inside your LM.__init__? else hyperparameters won't be saved inside the checkpoint and you might need to provide them again using LMModel.load_from_checkpoint(..., encoder=encoder, encoder_out_dim=encoder_out_dim, ...).

also note that, if you are passing an nn.Module inside your LM and calling self.save_hyperparameters(), it will save that too inside your hparams, which is not a good thing considering that nn.Modules are saved inside checkpoint state_dict and might create issues for you. Ideally, you should ignore them using self.save_hyperparameters(ignore=['encoder']). Check out this PR: #12068

3 replies

dmandair Mar 22, 2022
Author

hey @rohitgr7 thanks for the reply! One question about that - when I pass 'encoder' (initially a randomly initialized resnet18) using LMModel.load_from_checkpoint(..., encoder=encoder, encoder_out_dim=encoder_out_dim, ...), will this override the weights saved in the checkpoint for the resnet? Just wanted to make sure this didn't happen.

rohitgr7 Mar 22, 2022

load_from_checkpoint works like this:

LitModel.load_from_checkpoint()
|
 -> process new kwargs
|
 -> initialize a new model (model = LitModel(new_kwargs))
|
 -> model.load_state_dict(checkpoint_weights)

so it will load the checkpoint weights after the resnet is initialized.

dmandair Mar 22, 2022
Author

gotcha, perfect - thanks for the help with this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error with loading model checkpoint #12399

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Error with loading model checkpoint #12399

Uh oh!

dmandair Mar 21, 2022

Replies: 1 comment · 3 replies

Uh oh!

rohitgr7 Mar 22, 2022

Uh oh!

dmandair Mar 22, 2022 Author

Uh oh!

rohitgr7 Mar 22, 2022

Uh oh!

dmandair Mar 22, 2022 Author

dmandair
Mar 21, 2022

Replies: 1 comment 3 replies

rohitgr7
Mar 22, 2022

dmandair Mar 22, 2022
Author

dmandair Mar 22, 2022
Author