How to enable fault-tolerant training with Lightning 2.0+ #17708
Unanswered
amorehead
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello. I have a (hopefully) simple question. How can one enable fault-tolerant training in version 2.0+ of PyTorch Lightning? I recently migrated my Lightning 1.7 code to 2.0.1, and I immediately noticed that the Lightning docs do not mention fault-tolerant training practically at all. Where can I enable this functionality when I am training a model using e.g., Google's Kubernetes Engine (GKE)?
Beta Was this translation helpful? Give feedback.
All reactions