About resume training #19713

XLR-man · 2024-03-29T12:23:31Z

XLR-man
Mar 29, 2024

My version of pytorch lighting is 1.6
First I trained the model for 100000 steps, with 12500 steps per epoch and the last one saved by checkpoint. Now I want to continue training with 5000 steps more, but I want to add some regularization to the loss function. So the change I made was to change the loss function in the training steps function and increase the Trainer's max_steps variable to set the resume_from_checkpoint path (I don't know if this is the right thing to do).

Then when we re-enter the training instructions, we find that the training has not continued, but started at epoch=0, and the loss is nan. Is it supposed to start training at epoch=8? Does it mean that the training was not successfully resumed?
And a new version file is generated for each training.
How can I change the code to meet my needs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

About resume training #19713

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

About resume training #19713

Uh oh!

XLR-man Mar 29, 2024

Replies: 0 comments

XLR-man
Mar 29, 2024