Skip to content

Dictionary accumulate_grad_batches doesn't work with model checkpoint loading #5334

@jonashaag

Description

@jonashaag

🐛 Bug

If you pass Trainer(accumulate_grad_batches={5: 2}) and reload a model checkpoint using resume_from_checkpoint, checkpoint loading will crash with

Traceback (most recent call last):
  ...
    trainer.fit(system)
  File "/home/jo/.venvs/au/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 470, in fit
    results = self.accelerator_backend.train()
  File "/home/jo/.venvs/au/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 63, in train
    self.trainer.train_loop.setup_training(model)
  File "/home/jo/.venvs/au/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 175, in setup_training
    self.trainer.checkpoint_connector.restore_weights(model)
  File "/home/jo/.venvs/au/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 64, in restore_weights
    self.restore(self.trainer.resume_from_checkpoint, on_gpu=self.trainer.on_gpu)
  File "/home/jo/.venvs/au/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 102, in restore
    self.restore_training_state(checkpoint)
  File "/home/jo/.venvs/au/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 164, in restore_training_state
    expected_steps = self.trainer.num_training_batches / n_accum
TypeError: unsupported operand type(s) for /: 'int' and 'dict'

Relevant code assumes accumulate_grad_batches is an integer:

https://github.com/PyTorchLightning/pytorch-lightning/blob/d20fd8e5ab1a52747fee2cd53290a679d8b726d0/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L156-L157

Environment

* Packages:
	- numpy:             1.19.4
	- pyTorch_debug:     True
	- pyTorch_version:   1.7.0+cu110
	- pytorch-lightning: 1.1.0
	- tqdm:              4.51.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:
	- python:            3.8.5
	- version:           #1 SMP PREEMPT Tue Dec 22 08:14:42 UTC 2020

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked onpriority: 1Medium priority task

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions