CUDA OOM during validation of first epoch #10959

mishooax · 2021-12-06T19:03:23Z

mishooax
Dec 6, 2021

hi all,

My model validation code (see below) appears to leak memory which leads to a rapid increase in GPU memory usage and, eventually, to an OOM error right before the validation loop is about to complete (about 90% done or so). CUDA memory usage hovers around 8-9GB during training, then increases rapidly to ca. 15+GB during validation, hitting the memory limit of my GPU card.

What am I doing wrong here?

class Lightning_WGAN_GP(pl.LightningModule):
    """Conditional Wasserstein GAN with gradient penalty."""
 
    # (...)

    def _get_noise(self, X: torch.Tensor) -> torch.Tensor:
        bs, _, h, w = X.shape
        return torch.randn(bs, 1, h, w).type_as(X)

    def validation_step(self, batch: Tuple[Dict, ...], batch_idx: int) -> Dict:
        del batch_idx  # not used
        X, X_hr, real = batch[0]["X_lr"], batch[0]["X_hr"], batch[1]["y"]

        with torch.no_grad():
            noise = self._get_noise(X)
            fake = self.gen(noise, X, X_hr)  # calling the generator
            loss_gen_val = F.l1_loss(fake, real)   # generator loss
            disc_real = self.disc(X, real, X_hr).reshape(-1)  # calling the discriminator
            disc_fake = self.disc(X, fake, X_hr).reshape(-1)
            loss_disc_val = -torch.mean(disc_real) + torch.mean(disc_fake)  # discriminator loss
            
        self.log("gen_val_loss", loss_gen_val, on_epoch=True, on_step=False, prog_bar=True, logger=True)
        self.log("disc_val_loss", loss_disc_val, on_epoch=True, on_step=False, prog_bar=True, logger=True)

        return {"gen_val": loss_gen_val, "disc_val": loss_disc_val, "batch": batch}

RuntimeError: CUDA out of memory. Tried to allocate 266.00 MiB (GPU 0; 16.00 GiB total capacity; 12.84 GiB already allocated; 96.55 MiB free; 13.52 GiB reserved in total by PyTorch)

Decreasing (or increasing) the validation batch size doesn't make the problem go away. Any thoughts?

$ conda list | grep pytorch
pytorch                   1.9.1           cuda102py38ha031fbe_3    conda-forge
pytorch-gpu               1.9.1           cuda102py38hf05f184_3    conda-forge
pytorch-lightning         1.5.3              pyhd8ed1ab_0    conda-forge

Later edit: Skipping the validation loop, i.e.,

gan_trainer.fit(gan_model, train_dataloaders=dl_train)

gets rid of the OOM error (the trainer makes it past the 1st epoch).

Also, I am running in mixed precision (although i suspect precision doesn't have much to do with this issue?)

Thank you!

Answered by tchaton

Dec 7, 2021

Dear @mishooax,

You are returning the batch from the validation_step, which would be stored. As it is currently on the GPU, after X batches, you would get a OOM.

Unless you need the batch on epoch end, I would recommend to not return anything from the validation_step.

View full answer

tchaton · 2021-12-07T11:56:10Z

tchaton
Dec 7, 2021
Maintainer

Dear @mishooax,

You are returning the batch from the validation_step, which would be stored. As it is currently on the GPU, after X batches, you would get a OOM.

Unless you need the batch on epoch end, I would recommend to not return anything from the validation_step.

1 reply

mishooax Dec 7, 2021
Author

Thanks @tchaton ! I suspected as much - tested early this morning, and the OOM goes away if i don't return the batch.

I would recommend to not return anything from the validation_step

Does this mean that the logger knows how to compute per-epoch values for gen_val_loss and disc_val_loss w/o me having to return these explicitly from validation_step?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA OOM during validation of first epoch #10959

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CUDA OOM during validation of first epoch #10959

Uh oh!

Uh oh!

mishooax Dec 6, 2021

Replies: 1 comment · 1 reply

Uh oh!

tchaton Dec 7, 2021 Maintainer

Uh oh!

mishooax Dec 7, 2021 Author

mishooax
Dec 6, 2021

Replies: 1 comment 1 reply

tchaton
Dec 7, 2021
Maintainer

mishooax Dec 7, 2021
Author