How to rollback optimizer step (i.e. reload best checkpoint) during training? #13093
Unanswered
xsys-technology
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
During training, if the optimizer takes a really bad step that sends validation loss through the roof, how can one rollback the step and reload best model checkpoint (and reset logging/pbar results/metrics so that val_loss remains accurate)?
I imagine the preferred way to do this is via a callback function (e.g. 'on_train_epoch_start') but I'm not sure how to do this properly/safely when the model is dispatched and sharded across gpu's.
Beta Was this translation helpful? Give feedback.
All reactions