Why would GPU memory always surge after training and cause CUDA memory error? #9048
-
I use pytorch lightning to train a model but it always strangely fail at end: After validations completed, the trainer will start an epoch that bigger that max_epoch and causing GPU memory allocation failure (CUDA out of memory) right after this epoch (which should not run) started. For my example, I set max_epoch=5 so there should only be epoch 0-4. But there will always be an additional epoch-5 after 5 validations are done and a few seconds later the CUDA memory error will occur. My dataset should be fine as CUDA memory and system memory are stable during the training period, except the GPU memory surge at the very end. And here are my code for lightning module and training loop which I think may cause this trouble:
Can I get any clue about why this would happen and how to avoid it ? I'm new to pytorch lightning so there might be problems I'm not aware of. Thanks a lot! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 10 replies
-
Dear @EMUNES, Would you mind sharing your notebook ? This would make investigation much simpler. Best, |
Beta Was this translation helpful? Give feedback.
-
I am also facing similar issue. I have 8 GB gpu, and 50 epochs, till epoch 49 my gpu usage is 4 gb, after 49 epoch, i got memory error and my gpu memory reaches to 8 gb |
Beta Was this translation helpful? Give feedback.
-
Same thing happened with me. I thought downgrading PL version resolved the
issue, but it didn't work in other notebook
…On Wed, Sep 8, 2021, 3:18 PM EMUNES ***@***.***> wrote:
Indeed this problem is more complicated than I thought... Today one of my
notebook is normal but the next version throws the same error. The only
difference I made between those two notebooks is to reduce training samples
from 3000 to 1000 (I use small samples just to test whether the pipeline
can be working). Now I'm totally confused again...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9048 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI5FYOYMI7ICGYP66IEEJMDUA42BBANCNFSM5CULZXXQ>
.
|
Beta Was this translation helpful? Give feedback.
-
Let's continue discussing in #9441 Locking this thread to avoid discussing in two places. |
Beta Was this translation helpful? Give feedback.
Let's continue discussing in #9441
Locking this thread to avoid discussing in two places.