Callback Checkpointing never called #13800

MaugrimEP · 2022-07-22T14:29:40Z

MaugrimEP
Jul 22, 2022

I am trying to save my Callback state along with my model at the end of each epoch.
I tried reimplementing the example here.

I put print statement/breakpoint at the state_dict and load_state_dict method, and they are never called during my training. And the Counter state is not recovered.

I tried to use checkpointing both with a callback ModelCheckpoint in the callback list of the trainer and with enable_checkpointing set to True (with or without the ModelCheckpoint) without it working.

For reference, other states are well loaded (for example the right epoch, the model weights etc).

Here is a minimum working example:

First train a model from scratch and save it
put loading = True
Rerun the script, in the training_step the value self.current_epoch et self.cpt are recovered but not the state of the Callback Counter.

import pytorch_lightning as pl
import torch
from pytorch_lightning import Callback, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.nn import Parameter
from torch.optim import Adam
from torch.utils.data import Dataset, DataLoader


class Dummy_dataset(Dataset):
	def __init__(self, length):
		self.length = length

	def __len__(self):
		return self.length

	def __getitem__(self, item):
		return 1


class Dummy_Model(pl.LightningModule):
	def __init__(self):
		super(Dummy_Model, self).__init__()
		self.cpt = Parameter(torch.tensor(0.), requires_grad=False)

	def configure_optimizers(self):
		return Adam([
			{'params': self.cpt},
		])

	def forward(self, data):
		return data

	def training_step(self, data, batch_idx):
		self.cpt += batch_idx
		print(f'\n{self.current_epoch=} {self.cpt=}')
		return None

	def validation_step(self, data, batch_idx):
			pass

	def test_step(self, data, batch_idx):
		pass


class Counter(Callback):
	def __init__(self):
		self.state = {"epochs": 0, "batches": 0}

	def on_train_epoch_end(self, *args, **kwargs):
		self.state['epochs'] += 1
		print(self.state)

	def on_train_batch_end(self, *args, **kwargs):
		self.state['batches'] += 1

	def load_state_dict(self, state_dict):
		print('load_state_dict')
		self.state.update(state_dict)

	def state_dict(self):
		print('state_dict')
		return self.state.copy()


train_dataloader = DataLoader(Dummy_dataset(length=10), batch_size=2)
valid_dataloader = DataLoader(Dummy_dataset(length=10), batch_size=2)

model = Dummy_Model()
checkpoint_callback = ModelCheckpoint(
	dirpath='./',
	monitor=None,
	verbose=True,
	save_last=True,
	every_n_epochs=1,
)
loading = False
trainer = Trainer(
	max_epochs=300 if loading else 50,
	callbacks=[Counter(), checkpoint_callback],
	enable_checkpointing=True,
)
trainer.fit(
	model=model,
	train_dataloader=train_dataloader,
	val_dataloaders=valid_dataloader,
	ckpt_path='last.ckpt' if loading else None,
)

Answered by rohitgr7

Jul 25, 2022

what's your lightning version??

View full answer

awaelchli · 2022-07-24T21:55:00Z

awaelchli
Jul 24, 2022

Your training_step returns None, hence no parameter updates and thus no parameter change. Lightning only saves a checkpoint if the parameters change. Try again by returning a loss with gradient.

1 reply

MaugrimEP Jul 24, 2022
Author

The training_step return None, but the parameter model.cpt is still saved and loaded; everything works as intended on the model side saving and loading.
The model parameter change since I call self.cpt += batch_idx in the training_step.

The issue that I have is on the Callback side; the Callback Counter is never saved/loaded and the methods load_state_dict and state_dict are never called. https://pytorch-lightning.readthedocs.io/en/stable/extensions/callbacks.html#persisting-callback-state

rohitgr7 · 2022-07-25T14:01:45Z

rohitgr7
Jul 25, 2022

what's your lightning version??

3 replies

MaugrimEP Jul 25, 2022
Author

1.5.8

rohitgr7 Jul 25, 2022

try the latest one.. it will work.

MaugrimEP Jul 25, 2022
Author

Thank you for your help. My bad I guess.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Callback Checkpointing never called #13800

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Callback Checkpointing never called #13800

Uh oh!

Uh oh!

MaugrimEP Jul 22, 2022

Replies: 2 comments · 4 replies

Uh oh!

awaelchli Jul 24, 2022

Uh oh!

Uh oh!

MaugrimEP Jul 24, 2022 Author

Uh oh!

rohitgr7 Jul 25, 2022

Uh oh!

MaugrimEP Jul 25, 2022 Author

Uh oh!

rohitgr7 Jul 25, 2022

Uh oh!

MaugrimEP Jul 25, 2022 Author

MaugrimEP
Jul 22, 2022

Replies: 2 comments 4 replies

awaelchli
Jul 24, 2022

MaugrimEP Jul 24, 2022
Author

rohitgr7
Jul 25, 2022

MaugrimEP Jul 25, 2022
Author

MaugrimEP Jul 25, 2022
Author