Recommended way to load previous checkpoints from trainer callbacks? #13021
Unanswered
xsys-technology
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to reload a previous checkpoint when some condition is met within a callback that triggers when training epoch begins:
What is the recommended way to do this given that the model is sharded across multiple 'gpu's when on_train_epoch_start is called? I have a hunch that TorchCheckpointIO probably can handle this but I'm not sure about getting the data mapping correct.
I'm using DDPPlugin strategy:
I believe the trainer's checkpoint connector can use TorchCheckpointIO, via restore(), something like:
Is this safe to do considering the state of the optimizer(s), lr_scheduler(s), etc...?
I imagine the right way to do this looks at least something like this.
Beta Was this translation helpful? Give feedback.
All reactions