Combining periodic checkpointing and fault tolerant training in an environment with short shutdown warnings #15148
Unanswered
SamPruden
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment
-
There are overrides and callbacks everywhere and I haven't traced them all, but at first glance this looks safe. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm looking at using Fault Tolerant Training, however the spot instance I'm running on only gives a few seconds of shutdown warning and I'm not sure that that's enough to rely on FTT to do its thing. I'd like to combine it with periodic checkpointing, so that if a FTT checkpoint doesn't get saved in time I can still restore to a recent point.
What's the best way to go about this?
One option I've considered is manually sending an interrupt signal every five minutes and letting FTT handle the checkpointing. Does that make sense?
Is there a danger that a FTT checkpoint may be partially written to disk during a hurried shutdown, and cause corrupted state?
Beta Was this translation helpful? Give feedback.
All reactions