How to restart the training in the same version directory? #6594

GuillaumeRochette · 2021-03-19T10:10:39Z

GuillaumeRochette
Mar 19, 2021

Hi,
Where I am working, there's a job scheduling system, with a wall time of 72 hours after which the job goes back into the queue and restarts when resources are available.
Some of my models take longer than one run to train, and therefore in the main training script, I am globbing on the experiment_dir in order to find the latest checkpoint and resume from it using the resume_from_checkpoint argument from the Trainer.
However, every time it resumes training it increment the version number by one and creates a new directory so that I end up with something of the following:

└── lightning_logs
    ├── version_0
    │   ├── checkpoints
    │   │   ├── epoch=8-step=9215.ckpt
    │   │   └── last.ckpt
    │   ├── events.out.tfevents.xyz123
    │   └── hparams.yaml
    ├── version_1
    │   ├── checkpoints
    │   │   ├── epoch=125-step=129023.ckpt
    │   │   └── last.ckpt
    │   ├── events.out.tfevents.xyz1234
    │   └── hparams.yaml
    └── version_2
        ├── checkpoints
        │   ├── epoch=145-step=149503.ckpt
        │   └── last.ckpt
        ├── events.out.tfevents.xyz1235
        └── hparams.yaml

This makes it unpractical to visualise results in the TensorBoard, since there will be an "experiment" each time the training restarts from a checkpoint.
If it is possible, I would like to have the training to restart in version_0, and the logs (images and loss curves) would be appended to the original events.out.tfevents.xyz123.

Is it possible to do such a thing?

Thanks in advance,
Best regards,

Guillaume

awaelchli · 2021-04-04T23:48:51Z

awaelchli
Apr 4, 2021

Currently not possible afaik. Today, the resumed training will continue in a new directory.
We are working on this, see this issue:
#5342

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to restart the training in the same version directory? #6594

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to restart the training in the same version directory? #6594

Uh oh!

GuillaumeRochette Mar 19, 2021

Replies: 1 comment

Uh oh!

awaelchli Apr 4, 2021

GuillaumeRochette
Mar 19, 2021

awaelchli
Apr 4, 2021