Resuming Training #6740
Unanswered
shtoshni
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 2 comments 1 reply
-
Hi, I'm not sure I understand. You want that the Trainer resumes from a checkpoint but then immediately stops again if the early stopping was triggere? If so, early stopping is currently epoch based, so it needs to run until the end of the epoch. For per step early stopping evaluation, please open a feature request here on github. Happy to take a look. |
Beta Was this translation helpful? Give feedback.
1 reply
-
@shtoshni92 Just FYI I've submitted #8278 with respect to this, in case you want to subscribe to it. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I have a question regarding what are the best practices to resume training supposing that the training was aborted at some point. Right now I have been setting the
resume_from_checkpoint
argument in Trainer to the last checkpoint's location. This doesn't work when early stopping led to the training stoppage because the trainer starts a new epoch without bothering to check the early stopping criteria when resuming training. I'm wondering if this is by design or there's a way around this. I'm using a very hacky way of separately loading the checkpoint which has thecallback_state
forEarlyStopping
and checking if thewait
number is >=patience
.EarlyStopping
does have a callbackon_load_checkpoint
but that function lacks thetrainer
argument to allow it to stop training.Thanks,
Shubham
Beta Was this translation helpful? Give feedback.
All reactions