Using `val_dataloader` after training on multiple gpus seems to return the batches `gpu-count` times #15357

daMichaelB · 2022-10-27T11:31:20Z

daMichaelB
Oct 27, 2022

Hello everyone,

i trained a model on multiple GPU's with the following approach:

trainer = flash.Trainer(accelerator='gpu',  devices=4, strategy="ddp", ...)
trainer.finetune(self.model, datamodule=self.datamodule, ...)

The training on all 4 GPUs works perfect and uses almost 100% GPU.
After the training i want to compute the losses on the validation-set per sample and did it like

    preds = self.trainer.predict(model, datamodule.val_dataloader(), output=LogitsOutput(), ckpt_path="best")
    targets = torch.cat([b["target"] for b in datamodule.val_dataloader()])
    preds_tensor = Tensor(self.flat_list(preds))
    losses_tensor = loss_fn(preds_tensor, targets, reduction="none")

Problem

As long as i train on one GPU this works fine. However since using 4 GPUs i get

File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (1698) to match target batch_size (6791).

It seems the val_dataloader now holds the dataset 4 times.
I think i am doing something completely wrong, but do not really find a solution. I am thankful for any kind of advice. Thank you

Dependencies

lightning-bolts 0.5.0
lightning-flash 0.7.5
pytorch-lightning 1.7.7

akihironitta · 2022-11-06T02:43:25Z

akihironitta
Nov 6, 2022

Hi @daMichaelB! Do you have a full script that reproduces the behaviour?

1 reply

daMichaelB Nov 9, 2022
Author

@akihironitta i will try to create a minimal script that shows the behaviour, Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using `val_dataloader` after training on multiple gpus seems to return the batches `gpu-count` times #15357

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using val_dataloader after training on multiple gpus seems to return the batches gpu-count times #15357

Uh oh!

daMichaelB Oct 27, 2022

Problem

Dependencies

Replies: 1 comment · 1 reply

Uh oh!

akihironitta Nov 6, 2022

Uh oh!

daMichaelB Nov 9, 2022 Author

Using `val_dataloader` after training on multiple gpus seems to return the batches `gpu-count` times #15357

daMichaelB
Oct 27, 2022

Replies: 1 comment 1 reply

akihironitta
Nov 6, 2022

daMichaelB Nov 9, 2022
Author