About resume training #19713
Unanswered
XLR-man
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
My version of pytorch lighting is 1.6
First I trained the model for 100000 steps, with 12500 steps per epoch and the last one saved by checkpoint. Now I want to continue training with 5000 steps more, but I want to add some regularization to the loss function. So the change I made was to change the loss function in the training steps function and increase the Trainer's max_steps variable to set the resume_from_checkpoint path (I don't know if this is the right thing to do).
Then when we re-enter the training instructions, we find that the training has not continued, but started at epoch=0, and the loss is nan. Is it supposed to start training at epoch=8? Does it mean that the training was not successfully resumed?
And a new version file is generated for each training.
How can I change the code to meet my needs?
Beta Was this translation helpful? Give feedback.
All reactions