How to prevent more than 'save_top_k' checkpoints from being saved across (interrupted) training runs? #13020

xsys-technology · 2022-05-09T19:37:34Z

xsys-technology
May 9, 2022

When trainer.fit() runs to completion, the 'save_top_k' checkpoint callback (below) behaves as expected. In this case, it maintains and saves the 2 best scoring (min 'val_loss') checkpoints. This is great and exactly what I want.

checkpoint_callback = ModelCheckpoint(monitor="val_loss",
                                              dirpath=ckpt_dir,
                                              filename='best.{epoch}-{val_loss:.2f}',
                                              save_top_k=2,
                                              every_n_epochs=args.checkpoint_every_n_epochs,
                                              mode='min')

When training is interrupted (e.g. keyboard interrupt) before trainer.fit() completes, and I have to resume training from the best checkpoint, unfortunately the 'save_top_k' checkpoint callback seems to forget about the 'top_k' checkpoints it had been maintaining in the previous trainer.fit() run and it saves 'top_k' more checkpoints during each resumed run.

trainer.fit(my_model,
                    datamodule=my_data_module,
                    ckpt_path=torch.load("path_to_best_checkpoint")

Does anyone know how to ensure that only 'top_k' checkpoints are maintained across trainer.fit() interruptions like I've described, so that even if I've started and stopped trainer.fit() many times that only 2 checkpoints are stored and maintained in 'ckpt_dir' ?

Fyi, my lightning version is "1.5.9".

xsys-technology · 2022-05-10T16:44:58Z

xsys-technology
May 10, 2022
Author

For anyone interested, I believe this behavior was caused by my changing 'every_n_epochs' in between training runs. There is some logic related to this inside the code for the checkpoint connector. Since using the same value for 'every_n_epochs' this issue hasn't resurfaced.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to prevent more than 'save_top_k' checkpoints from being saved across (interrupted) training runs? #13020

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to prevent more than 'save_top_k' checkpoints from being saved across (interrupted) training runs? #13020

Uh oh!

Uh oh!

xsys-technology May 9, 2022

Replies: 1 comment

Uh oh!

xsys-technology May 10, 2022 Author

xsys-technology
May 9, 2022

xsys-technology
May 10, 2022
Author