How to checkpoint on validation metrics in Pytorch lightning? #7123

noamzilo · 2021-04-20T16:10:51Z

noamzilo
Apr 20, 2021

When using the following code fragment to log metrics:

    def training_epoch_end(self, outs):
        self.__common_epoch_end_report(mode="train")

    def validation_epoch_end(self, outs):
        self.__common_epoch_end_report(mode="validation")

and

   def __common_epoch_end_report(self, mode: str):
        """
        Args:
            mode: one of ["train", "validation", "test"]
        """
        tb = self.logger.experiment

        acc, conf, f1, mcc = self.__select_metrics_by_mode(mode)

        acc_val = acc.compute()
        f1_val = f1.compute()
        tb.add_scalars('accuracy', {mode: acc_val}, global_step=self.current_epoch)
        tb.add_scalars('F1', {mode: f1_val}, global_step=self.current_epoch)

        self.log(fr'accuracy\{mode}', acc_val, on_step=False, on_epoch=True, prog_bar=True)
        self.log(fr'F1\{mode}', f1_val, on_step=False, on_epoch=True, prog_bar=True)


    def __select_metrics_by_mode(self, mode):
        if mode == "train":
            acc = self.train_acc
            f1 = self.train_f1
            mcc = self._train_mcc
            conf = self.train_confusion
        elif mode == "validation":
            acc = self.val_acc
            f1 = self.val_f1
            mcc = self._val_mcc
            conf = self.val_confusion
        elif mode == "test":
            acc = self.test_acc
            f1 = self.test_f1
            mcc = self._test_mcc
            conf = self.test_confusion
        else:
            raise ValueError("unsupported mode")
        return acc, conf, f1, mcc

I verified that __common_epoch_end_report is indeed entered both with mode='train' and with mode='validation'.

However, only the metrics logged from train are available for checkpointing:

checkpoint_callback = ModelCheckpoint(
        ...
        save_top_k=10,
        monitor=fr'F1\validation'
)
trainer = Trainer(
      ...
        callbacks=[checkpoint_callback],
)

Getting the following error:

pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='F1\validation') not found in the returned metrics: ['loss\\train_step', 'loss\\train_epoch', 'loss\\train', 'accuracy\\train', 'F1\\train'].

How to allow check pointing by validation metrics in Pytorch-lightning?

carmocca · 2021-04-20T23:36:17Z

carmocca
Apr 20, 2021

Hi!

I am not able to reproduce the issue. This is what I tried:

def test_bug(tmpdir):

    class TestModel(BoringModel):
        def training_step(self, batch, batch_idx):
            self.log("train_batch_idx", -batch_idx, on_step=False, on_epoch=True)
            return super().training_step(batch, batch_idx)

        def validation_step(self, batch, batch_idx):
            self.log("val_batch_idx", -batch_idx, on_step=False, on_epoch=True)
            return super().validation_step(batch, batch_idx)

    model = TestModel()
    mc = ModelCheckpoint(dirpath=tmpdir, monitor="val_batch_idx", save_top_k=3)
    trainer = Trainer(default_root_dir=tmpdir, progress_bar_refresh_rate=0, max_epochs=3, callbacks=[mc])
    
    trainer.fit(model)
    print(mc.best_k_models)

What Lightning version are you using? Can you provide a minimal reproducible example?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to checkpoint on validation metrics in Pytorch lightning? #7123

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to checkpoint on validation metrics in Pytorch lightning? #7123

Uh oh!

noamzilo Apr 20, 2021

Replies: 1 comment

Uh oh!

carmocca Apr 20, 2021

noamzilo
Apr 20, 2021

carmocca
Apr 20, 2021