How to set Checkpoints to be used in the automatically generated `version_N` directories? #6821

nyxynyx · 2021-04-04T17:37:26Z

nyxynyx
Apr 4, 2021

If the TensorBoard logger is set up as shown

logger = TensorBoardLogger(name="MyModel")
checkpoint_callback = ModelCheckpoint(
    filename="{epoch}-{step}-{val_loss:.2f}",
    monitor="val_loss",
    save_top_k=5,
)
trainer = pl.Trainer(
    default_root_dir=ROOT_DIR,
    callbacks=[checkpoint_callback],
    logger=[logger],
)

how do we configure the checkpoints to be written to directories that are automatically named version_0, version_1, the way it is if you do not pass a logger to Trainer?

trainer = pl.Trainer(
    default_root_dir=ROOT_DIR,
    callbacks=[checkpoint_callback],
)

If we pass in a logger to Trainer, the checkpoints are written to

<root_path>/<experiment_name>/<integer>/checkpoints

while the tensorboard logs and hparams.yaml are written to

<root_path>/<experiment_name>/version_<integer>/

If we do not pass in a logger to Trainer, then checkpoint files, Tensorboard files and hparams.yaml are all written to the same directory

<root_path>/<experiment_name>/version_<integer>/

How can both checkpoints and tensorboard files we written to the same version_<integer> directory?

Answered by awaelchli

Apr 4, 2021

Hi
For this you need to set the "default_root_dir" in the Trainer, and set the save_dir of the Logger to the same.

This works for me (latest PL version):

from argparse import ArgumentParser

import torch
from torch.nn import functional as F

import pytorch_lightning as pl
from pl_examples.basic_examples.mnist_datamodule import MNISTDataModule
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger


class LitClassifier(pl.LightningModule):

    def __init__(self, hidden_dim=128, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()

        self.l1 = torch.nn.Linear(28 * 28, self.hparams.hidden_dim)
    …

View full answer

awaelchli · 2021-04-04T21:59:51Z

awaelchli
Apr 4, 2021

Hi
For this you need to set the "default_root_dir" in the Trainer, and set the save_dir of the Logger to the same.

This works for me (latest PL version):

from argparse import ArgumentParser

import torch
from torch.nn import functional as F

import pytorch_lightning as pl
from pl_examples.basic_examples.mnist_datamodule import MNISTDataModule
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger


class LitClassifier(pl.LightningModule):

    def __init__(self, hidden_dim=128, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()

        self.l1 = torch.nn.Linear(28 * 28, self.hparams.hidden_dim)
        self.l2 = torch.nn.Linear(self.hparams.hidden_dim, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = torch.relu(self.l1(x))
        x = torch.relu(self.l2(x))
        return x

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('valid_loss', loss)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)

    @staticmethod
    def add_model_specific_args(parent_parser):
        parser = parent_parser.add_argument_group("LitClassifier")
        parser.add_argument('--hidden_dim', type=int, default=128)
        parser.add_argument('--learning_rate', type=float, default=0.0001)
        return parent_parser


def cli_main():
    pl.seed_everything(1234)

    parser = ArgumentParser()
    parser = pl.Trainer.add_argparse_args(parser)
    parser = LitClassifier.add_model_specific_args(parser)
    parser = MNISTDataModule.add_argparse_args(parser)
    args = parser.parse_args()

    dm = MNISTDataModule.from_argparse_args(args, num_workers=2)
    model = LitClassifier(args.hidden_dim, args.learning_rate)

    ROOT_DIR = "here"
    mylogger = TensorBoardLogger(name="MyModel", save_dir=ROOT_DIR)
    ckpt_callback = ModelCheckpoint(monitor="valid_loss", filename="{epoch}-{step}-{valid_loss:.2f}")
    trainer = pl.Trainer.from_argparse_args(args, default_root_dir=ROOT_DIR, logger=mylogger, callbacks=[ckpt_callback], limit_train_batches=2, limit_val_batches=2)
    trainer.fit(model, datamodule=dm)


if __name__ == '__main__':
    cli_main()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to set Checkpoints to be used in the automatically generated `version_N` directories? #6821

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to set Checkpoints to be used in the automatically generated version_N directories? #6821

Uh oh!

Uh oh!

nyxynyx Apr 4, 2021

Replies: 1 comment

Uh oh!

awaelchli Apr 4, 2021

How to set Checkpoints to be used in the automatically generated `version_N` directories? #6821

nyxynyx
Apr 4, 2021

awaelchli
Apr 4, 2021