Skip to content

OOM When Training AutoencoderKL with 512x512 Inputs on 4x32GB V100s #393

@joestitty

Description

@joestitty

Hi all,

I'm encountering out-of-memory (OOM) errors when training the AutoencoderKL model (from the latent diffusion codebase) on 512×512 single‐channel patches using 4 V100 GPUs (each 32GB).

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU 2 has a total capacity of 31.73 GiB of which 2.18 GiB is free. Including non-PyTorch memory, this process has 29.54 GiB memory in use.

Hardware: 4 × NVIDIA V100 (32GB each)
Precision: I tried using mixed precision (setting precision: '16' in the YAML) but still encounter OOM errors.
Training Configuration: I’m using DDP with 4 GPUs and manual gradient accumulation (since automatic_optimization is set to False in my custom AutoencoderKL) with an accumulation interval of 4.
Model Details:
The YAML specifies 512×512 inputs, and during the forward pass I see a log message like:
making attention of type 'vanilla' with 512 in_channels
The latent representation z has shape (1, 4, 128, 128) (≈65,536 dimensions), but I am just wondering what is causing such memory usages? Can someone help me understand architecturally why this is happening?

Questions/Requests:

  • Is there any recommended strategy for training this autoencoder with 512×512 patches without exceeding GPU memory (e.g., modifications to network architecture, using a different optimizer, or leveraging PyTorch’s checkpointing utilities)?
  • What do most people use setup wise with a gpu to optimally get this to work? Is it generally 80gb A100s?
  • Why am I suually restricted, when I do get it to work, setting the batch_size to a max of usually 4 and 0 workers?

Any help or insights would be greatly appreciated. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions