OOM When Training AutoencoderKL with 512x512 Inputs on 4x32GB V100s

Hi all,

I'm encountering out-of-memory (OOM) errors when training the AutoencoderKL model (from the latent diffusion codebase) on 512×512 single‐channel patches using 4 V100 GPUs (each 32GB). 

`torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU 2 has a total capacity of 31.73 GiB of which 2.18 GiB is free. Including non-PyTorch memory, this process has 29.54 GiB memory in use.`


Hardware: 4 × NVIDIA V100 (32GB each)
Precision: I tried using mixed precision (setting precision: '16' in the YAML) but still encounter OOM errors.
Training Configuration: I’m using DDP with 4 GPUs and manual gradient accumulation (since automatic_optimization is set to False in my custom AutoencoderKL) with an accumulation interval of 4.
Model Details:
The YAML specifies 512×512 inputs, and during the forward pass I see a log message like:
making attention of type 'vanilla' with 512 in_channels
The latent representation z has shape (1, 4, 128, 128) (≈65,536 dimensions), but I am just wondering what is causing such memory usages? Can someone help me understand architecturally why this is happening? 

Questions/Requests:

- Is there any recommended strategy for training this autoencoder with 512×512 patches without exceeding GPU memory (e.g., modifications to network architecture, using a different optimizer, or leveraging PyTorch’s checkpointing utilities)?
- What do most people use setup wise with a gpu to optimally get this to work? Is it generally 80gb A100s? 
- Why am I suually restricted, when I do get it to work, setting the batch_size to a max of usually 4 and 0 workers? 

Any help or insights would be greatly appreciated. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM When Training AutoencoderKL with 512x512 Inputs on 4x32GB V100s #393

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM When Training AutoencoderKL with 512x512 Inputs on 4x32GB V100s #393

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions