HPC resubmit checkpoint location #7804
Unanswered
zhengzangw
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
On slurm environment with a time limit, the program will save hpc_ckpt_{num}.ckpt. If
weights_save_path
not specific, with default logging and several jobs, they will be all saved in the root dir of the project, causing loading error. While if the logging is set to a specific dir (e.g. named by launch time), the resubmit job fails to find the checkpoint.A workaround is to set
weights_save_path
to an absolute location. I wonder if there is a better way to achieve this goal.Beta Was this translation helpful? Give feedback.
All reactions