Different batch sizes and/or number of GPUs results in different test metrics #6859

carsonmclean · 2021-04-06T22:51:05Z

carsonmclean
Apr 6, 2021

I have noticed that when running a trainer.test(), my experiment will have different test metrics for different batch sizes and/or number of GPUs used. I am fairly certain that the same input sample will result in the exact same logit output from the model regardless of batch size or # of GPUs, so the only remaining code in the pipeline is the evaluation & metrics. That code is as follows:

class RRClassification(pl.LightningModule):

    # code shortened for GitHub.....

    def test_step(self, batch, batch_idx):
        loss, metrics = self._compute_loss_and_metrics(batch, batch_idx)
        self.log("test_loss", loss)
        self.log_dict(self._prefix_dict_keys(metrics, prefix="test"))

    def _compute_loss_and_metrics(
        self,
        batch,
        batch_idx,
        calc_metrics: bool = True,
    ) -> Tuple[torch.Tensor, Dict[str, torch.Tensor]]:
        if self.hparams.has_sample_weight:
            X, y, sample_weight = batch
            pred = self.model(X)
            loss = (self.criterion(pred, y) * sample_weight.unsqueeze(axis=-1)).sum()
        else:
            X, y = batch
            pred = self.model(X)
            loss = self.criterion(pred, y)

        # Metric calcs
        if calc_metrics:
            y_labels = y.argmax(axis=-1)
            accuracies = F.accuracy(
                pred.argmax(axis=-1), y_labels, class_reduction="none", num_classes=2
            )
            precisions = F.precision(pred, y_labels, class_reduction="none", num_classes=2)
            recalls = F.recall(pred, y_labels, class_reduction="none", num_classes=2)
            f_scores = F.f1_score(pred, y_labels, class_reduction="none", num_classes=2)

            metrics = {
                "class_accuracy_replace": accuracies[self.REPLACE_INDEX],
                "class_precision_replace": precisions[self.REPLACE_INDEX],
                "class_recall_replace": recalls[self.REPLACE_INDEX],
                "class_F1_replace": f_scores[self.REPLACE_INDEX],
                "class_accuracy_repair": accuracies[self.REPAIR_INDEX],
                "class_precision_repair": precisions[self.REPAIR_INDEX],
                "class_recall_repair": recalls[self.REPAIR_INDEX],
                "class_F1_repair": f_scores[self.REPAIR_INDEX],
                "mean_class_accuracy": torch.mean(accuracies),
                "mean_class_precision": torch.mean(precisions),
                "mean_class_recall": torch.mean(recalls),
                "mean_class_f1_score": torch.mean(f_scores),
            }
        else:
            metrics = {}
        return loss, metrics  # type: ignore

The metrics print out at the end of the test epoch looks at follows:

DATALOADER:0 TEST RESULTS
{'test_class_F1_repair': tensor(0.3929, device='cuda:0'),
 'test_class_F1_replace': tensor(0.7640, device='cuda:0'),
 'test_class_accuracy_repair': tensor(0.7420, device='cuda:0'),
 'test_class_accuracy_replace': tensor(0.6523, device='cuda:0'),
 'test_class_precision_repair': tensor(0.2833, device='cuda:0'),
 'test_class_precision_replace': tensor(0.9305, device='cuda:0'),
 'test_class_recall_repair': tensor(0.7420, device='cuda:0'),
 'test_class_recall_replace': tensor(0.6523, device='cuda:0'),
 'test_loss': tensor(0.6166, device='cuda:0'),
 'test_mean_class_accuracy': tensor(0.6971, device='cuda:0'),
 'test_mean_class_f1_score': tensor(0.5784, device='cuda:0'),
 'test_mean_class_precision': tensor(0.6069, device='cuda:0'),
 'test_mean_class_recall': tensor(0.6971, device='cuda:0')}

I believe what is happening is that the displayed metrics are only being calculated on the final batch of one of the GPUs, rather than across all batches on all GPUs in the test epoch. My understanding from the LightningModule documentation is that calling .log() would "automatically reduces the requested metrics across the full epoch". I have tried setting both on_epoch and sync_dist to True but still see inconsistencies. This is being run on a single machine with 4 GPUs and the DDP accelerator.

What is the simplest (and proper) way of calculating consistent test metrics for any batch size and number of GPUs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different batch sizes and/or number of GPUs results in different test metrics #6859

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Different batch sizes and/or number of GPUs results in different test metrics #6859

Uh oh!

Uh oh!

carsonmclean Apr 6, 2021

Replies: 0 comments

carsonmclean
Apr 6, 2021