How to restart the training in the same version directory? #6594
Unanswered
GuillaumeRochette
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment
-
Currently not possible afaik. Today, the resumed training will continue in a new directory. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Where I am working, there's a job scheduling system, with a wall time of 72 hours after which the job goes back into the queue and restarts when resources are available.
Some of my models take longer than one run to train, and therefore in the main training script, I am globbing on the
experiment_dir
in order to find the latest checkpoint and resume from it using theresume_from_checkpoint
argument from theTrainer
.However, every time it resumes training it increment the version number by one and creates a new directory so that I end up with something of the following:
This makes it unpractical to visualise results in the TensorBoard, since there will be an "experiment" each time the training restarts from a checkpoint.
If it is possible, I would like to have the training to restart in
version_0
, and the logs (images and loss curves) would be appended to the originalevents.out.tfevents.xyz123
.Is it possible to do such a thing?
Thanks in advance,
Best regards,
Guillaume
Beta Was this translation helpful? Give feedback.
All reactions