-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Hi all,
I'm encountering out-of-memory (OOM) errors when training the AutoencoderKL model (from the latent diffusion codebase) on 512×512 single‐channel patches using 4 V100 GPUs (each 32GB).
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU 2 has a total capacity of 31.73 GiB of which 2.18 GiB is free. Including non-PyTorch memory, this process has 29.54 GiB memory in use.
Hardware: 4 × NVIDIA V100 (32GB each)
Precision: I tried using mixed precision (setting precision: '16' in the YAML) but still encounter OOM errors.
Training Configuration: I’m using DDP with 4 GPUs and manual gradient accumulation (since automatic_optimization is set to False in my custom AutoencoderKL) with an accumulation interval of 4.
Model Details:
The YAML specifies 512×512 inputs, and during the forward pass I see a log message like:
making attention of type 'vanilla' with 512 in_channels
The latent representation z has shape (1, 4, 128, 128) (≈65,536 dimensions), but I am just wondering what is causing such memory usages? Can someone help me understand architecturally why this is happening?
Questions/Requests:
- Is there any recommended strategy for training this autoencoder with 512×512 patches without exceeding GPU memory (e.g., modifications to network architecture, using a different optimizer, or leveraging PyTorch’s checkpointing utilities)?
- What do most people use setup wise with a gpu to optimally get this to work? Is it generally 80gb A100s?
- Why am I suually restricted, when I do get it to work, setting the batch_size to a max of usually 4 and 0 workers?
Any help or insights would be greatly appreciated. Thanks!