How to set PL to work on slurm cluster with preemtive #16928
Unanswered
knoriy
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I have a PL model running on a SLURM cluster. The preemptive nodes can be killed to allow higher-priority jobs to run. Currently, PL seems to restart the training from the beginning. Is it possible to change this behaviour so that a checkpoint and auto restart by PL to continue from this checkpoint?
I have read the following, but it is not very clear. I would appreciate it if someone could advice me on this.
https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_advanced.html#run-on-a-slurm-managed-cluster:~:text=sbatch%20submit.sh-,Enable%20auto%20wall%2Dtime%20resubmitions,-When%20you%20use
Thank you very much for you time.
Beta Was this translation helpful? Give feedback.
All reactions