AMP training overflow #3249
Unanswered
cristinagrs
asked this question in
Q&A
Replies: 1 comment 6 replies
-
Hi @cristinagrs , Thanks for your interest here. Thanks. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am training a UNet segmentation model with CT scans and I tried to use mixed precision as in : https://pytorch.org/docs/stable/notes/amp_examples.html#id2
But it causes nan loss and overflow that does not recover.
If I use native amp in pytorch lightning, the same overflow occurs and does not recover.
On the other hand, I tried to use apex in pytorch lightning and that solves the issue.
My question is, ¿has anyone encountered this problem? and ¿is there any way to solve this without apex?
I am using pytorch version 1.10.0
and lightning version 1.4.9
CUDA Version: 11.4
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions