How to prevent more than 'save_top_k' checkpoints from being saved across (interrupted) training runs? #13020
Unanswered
xsys-technology
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment
-
For anyone interested, I believe this behavior was caused by my changing 'every_n_epochs' in between training runs. There is some logic related to this inside the code for the checkpoint connector. Since using the same value for 'every_n_epochs' this issue hasn't resurfaced. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When trainer.fit() runs to completion, the 'save_top_k' checkpoint callback (below) behaves as expected. In this case, it maintains and saves the 2 best scoring (min 'val_loss') checkpoints. This is great and exactly what I want.
When training is interrupted (e.g. keyboard interrupt) before trainer.fit() completes, and I have to resume training from the best checkpoint, unfortunately the 'save_top_k' checkpoint callback seems to forget about the 'top_k' checkpoints it had been maintaining in the previous trainer.fit() run and it saves 'top_k' more checkpoints during each resumed run.
Does anyone know how to ensure that only 'top_k' checkpoints are maintained across trainer.fit() interruptions like I've described, so that even if I've started and stopped trainer.fit() many times that only 2 checkpoints are stored and maintained in 'ckpt_dir' ?
Fyi, my lightning version is "1.5.9".
Beta Was this translation helpful? Give feedback.
All reactions