Is there a way to save a checkpoint with a time interval and also at the end of epochs? #13226
Unanswered
fishbotics
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment 2 replies
-
Set val_check_interval=0.1 to do validation check 10 times during each epoch. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Looking at the code, it seems like I need to choose whether to checkpoint every so often or after every epoch. My epochs are very long (40 hours), so I need to checkpoint more often. But, I'd like to be able to resume training if a job dies and this seems to only be possible if I use the fault tolerant training or saving after the end of an epoch. I'd like to do the latter. Any ideas?
Beta Was this translation helpful? Give feedback.
All reactions