huge memory leak and execution stuck after on_train_epoch_start but before training_step #15154
Unanswered
malfonsoarquimea
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment
-
@rohitgr7 can you provide some insight? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is a follow-up issue of #12522
I am facing an issue where, when using more than one GPU and ddp strategy, the code gets stuck and starts consuming huge amounts of RAM after the execution of
on_train_epoch_start
but beforetraining_step
.I managed to track the memory leak to https://github.com/Lightning-AI/lightning/blob/dbb5ca8d436a917fc9c2bdd6e9c00e4fd6187735/src/pytorch_lightning/loops/epoch/training_epoch_loop.py#L147
and then to
https://github.com/Lightning-AI/lightning/blob/dbb5ca8d436a917fc9c2bdd6e9c00e4fd6187735/src/pytorch_lightning/utilities/fetching.py#L179
I am using PyTorch lightning version 1.7.7
Regarding RAM consumption, when using one GPU this training consumes (due to data being preloaded in the Dataset):
With 2 GPUs and DDP strategy running on a docker container:
EDIT:
Afer using a distributed sampler as indicated here #15164 the memory consumption is by 40% or so:
I would expect the second case to be twice the ram consumption of the first one, as preloading is happening once per GPU and there are 2 GPUs instead of one.
Also, I noticed that this problem does not occur on the sanity check steps.
Sadly, I can not provide a code example to reproduce but I can give any information about my system or what my code does
Any help on how to solve this would be greatly appreciated. Thanks in advance and have a nice day!
Beta Was this translation helpful? Give feedback.
All reactions