Numerical unstable in mixed precision (FP16) when training with DDP #19790
Unanswered
WayenVan
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, when I was doing research, I found that using FP16 leads to a degradation of accuracy, which is not a problem when I use my own DDP code with native pytorch.
An interest finding is that when I do inference with trained model(training with FP16), the model outputs NaN if set the inference trainer as FP32, while FP16 inference is still working. I don't know if it is caused by lightning itself, or maybe because of a bad GPU?
here is a short code of my trainer
I also tried GradScaler with enable=True, the problem is still exist.
Beta Was this translation helpful? Give feedback.
All reactions