DeepSpeed Nan loss #9998
Unanswered
linhlevandlu
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment 4 replies
-
hi, can you check https://discuss.pytorch.org/t/problem-about-predict-nan-after-few-batch/47010/2 https://stackoverflow.com/questions/58457901/pytorch-model-returns-nans-after-first-round |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I have tried to use DeepSpeed (e.g. deepspeed_stage_3_offload) to train my model by using the plugins from Pytorch-lightning.
I saw that the model could fit with a small GPU (which I can not use without DeepSpeed).
However, I met another problem that all losses (train and valid) are Nan values (even I have reduced the lr_rate).
One thing can note that we have obtained a good loss without DeepSpeed.
Do somebody meet the same problem? Or, Could you explain the problem?
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions