huge memory leak and execution stuck after on_train_epoch_start but before training_step #15154

malfonsoarquimea · 2022-10-17T11:54:19Z

malfonsoarquimea
Oct 17, 2022

This is a follow-up issue of #12522
I am facing an issue where, when using more than one GPU and ddp strategy, the code gets stuck and starts consuming huge amounts of RAM after the execution of on_train_epoch_start but before training_step.
I managed to track the memory leak to https://github.com/Lightning-AI/lightning/blob/dbb5ca8d436a917fc9c2bdd6e9c00e4fd6187735/src/pytorch_lightning/loops/epoch/training_epoch_loop.py#L147
and then to
https://github.com/Lightning-AI/lightning/blob/dbb5ca8d436a917fc9c2bdd6e9c00e4fd6187735/src/pytorch_lightning/utilities/fetching.py#L179

I am using PyTorch lightning version 1.7.7
Regarding RAM consumption, when using one GPU this training consumes (due to data being preloaded in the Dataset):

batch size 3000: 12GB
batch size 6000: 20GB

With 2 GPUs and DDP strategy running on a docker container:

batch size 3000 per GPU (6000 total) : 296 GB
batch size 6000 per GPU (12000 total) : 587 GB

EDIT:
Afer using a distributed sampler as indicated here #15164 the memory consumption is by 40% or so:

batch size 6000 per GPU (12000 total) : 380 GB

I would expect the second case to be twice the ram consumption of the first one, as preloading is happening once per GPU and there are 2 GPUs instead of one.
Also, I noticed that this problem does not occur on the sanity check steps.

Sadly, I can not provide a code example to reproduce but I can give any information about my system or what my code does

Any help on how to solve this would be greatly appreciated. Thanks in advance and have a nice day!

malfonsoarquimea · 2022-10-26T08:35:22Z

malfonsoarquimea
Oct 26, 2022
Author

@rohitgr7 can you provide some insight?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

huge memory leak and execution stuck after on_train_epoch_start but before training_step #15154

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

huge memory leak and execution stuck after on_train_epoch_start but before training_step #15154

Uh oh!

Uh oh!

malfonsoarquimea Oct 17, 2022

Replies: 1 comment

Uh oh!

malfonsoarquimea Oct 26, 2022 Author

malfonsoarquimea
Oct 17, 2022

malfonsoarquimea
Oct 26, 2022
Author