predict with multiple GPUs doesn't aggregate the predictions even with on_predict_end or on_predict_batch_end #7852

Waerden001 · 2021-06-07T03:43:15Z

Waerden001
Jun 7, 2021

I train a model with 2 GPUs， when running predictions = trainer.predict(model, datamodule) in a script, what I get is not a single variable predictions as desired, instead, I get two independent variables on two GPUs. I tried to use on_predict_end and on_predict_batch_end to aggregate the predictions on different GPUs, but it seems they change the behavior of the method trainer.predict， i.e. I can't get an aggregated predicitons. What should I do to aggregate predictions defined in predict_step when using multiple GPUs?

To see the issue, simple create a LightningModule with the predict_step method

class Model(pl.LightningModule):
    ...
    def predict_step(self, batch, batch_idx, dataloader_idx):
        y = self(x)
        return {"predict":y}
 
m = Modle(...)

and use any pl.DataModule with a predict_dataloader method

trainer = pl.Trainer(gpus=2, accelerator=ddp)
predictions = trainer.predict(model=m, datamodule=dm)

but we'll get two predictions on two GPUs(they are actually the outputs of two scripts that pytorch_lightning creates for us anyway...), and we can't use it as a single object for later processing in the script.

Answered by justusschock

Jun 10, 2021

You can either sync them by writing them both to disk (we have a PredictionWriter for that) or you can use self.all_gather inside your predict_step to sync across GPUs (note that this can lead to GPU OOM!)

View full answer

justusschock · 2021-06-10T12:02:46Z

justusschock
Jun 10, 2021
Maintainer

You can either sync them by writing them both to disk (we have a PredictionWriter for that) or you can use self.all_gather inside your predict_step to sync across GPUs (note that this can lead to GPU OOM!)

2 replies

yhl48 Jun 23, 2021

Hi, is there a way to gather_all into just one GPU and free the others?

I can't seem to find relevant doc for that here https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.core.lightning.html, not sure if that is something that has been implemented or is worth implementing, thanks!

justusschock Jun 23, 2021
Maintainer

Hi @yhl48 ,

you should be able to use torch.distributed.gather for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

predict with multiple GPUs doesn't aggregate the predictions even with on_predict_end or on_predict_batch_end #7852

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

predict with multiple GPUs doesn't aggregate the predictions even with on_predict_end or on_predict_batch_end #7852

Uh oh!

Uh oh!

Waerden001 Jun 7, 2021

Replies: 1 comment · 2 replies

Uh oh!

justusschock Jun 10, 2021 Maintainer

Uh oh!

yhl48 Jun 23, 2021

Uh oh!

justusschock Jun 23, 2021 Maintainer

Waerden001
Jun 7, 2021

Replies: 1 comment 2 replies

justusschock
Jun 10, 2021
Maintainer

justusschock Jun 23, 2021
Maintainer