How to rollback optimizer step (i.e. reload best checkpoint) during training? #13093

xsys-technology · 2022-05-17T14:47:44Z

xsys-technology
May 17, 2022

During training, if the optimizer takes a really bad step that sends validation loss through the roof, how can one rollback the step and reload best model checkpoint (and reset logging/pbar results/metrics so that val_loss remains accurate)?

I imagine the preferred way to do this is via a callback function (e.g. 'on_train_epoch_start') but I'm not sure how to do this properly/safely when the model is dispatched and sharded across gpu's.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to rollback optimizer step (i.e. reload best checkpoint) during training? #13093

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to rollback optimizer step (i.e. reload best checkpoint) during training? #13093

Uh oh!

Uh oh!

xsys-technology May 17, 2022

Replies: 0 comments

xsys-technology
May 17, 2022