DeepSpeed Nan loss #9998

linhlevandlu · 2021-10-18T15:13:47Z

linhlevandlu
Oct 18, 2021

Hello,
I have tried to use DeepSpeed (e.g. deepspeed_stage_3_offload) to train my model by using the plugins from Pytorch-lightning.
I saw that the model could fit with a small GPU (which I can not use without DeepSpeed).
However, I met another problem that all losses (train and valid) are Nan values (even I have reduced the lr_rate).
One thing can note that we have obtained a good loss without DeepSpeed.
Do somebody meet the same problem? Or, Could you explain the problem?
Thank you.

Programmer-RD-AI · 2021-10-26T07:40:40Z

Programmer-RD-AI
Oct 26, 2021

hi, can you check

https://discuss.pytorch.org/t/problem-about-predict-nan-after-few-batch/47010/2

https://stackoverflow.com/questions/58457901/pytorch-model-returns-nans-after-first-round

4 replies

Rami-Ismael Nov 1, 2021

When use 16 or 32 bit precision it will cause the loss to be nan?

linhlevandlu Nov 2, 2021
Author

I used 16 bit for training, and we obtained the Nan loss for all deepspeed plugins (deepspeed_stage_2(3) or deepspeed_stage_3_offload).

linhlevandlu Nov 2, 2021
Author

hi, can you check

https://discuss.pytorch.org/t/problem-about-predict-nan-after-few-batch/47010/2

https://stackoverflow.com/questions/58457901/pytorch-model-returns-nans-after-first-round

It was very interested that we could change the precision during the training. Unfortunately, we have not enough memory when changing precision = 32 during the backward pass.
Do you think that we should use the DeepSpeed optimizer? Currently, I used the Adam (from torch).

Programmer-RD-AI Nov 2, 2021

Hi, I can not recommend deep speed optimizer or Adam because I don't have much experience with Deep Speed.

Sorry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSpeed Nan loss #9998

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DeepSpeed Nan loss #9998

Uh oh!

Uh oh!

linhlevandlu Oct 18, 2021

Replies: 1 comment · 4 replies

Uh oh!

Programmer-RD-AI Oct 26, 2021

Uh oh!

Rami-Ismael Nov 1, 2021

Uh oh!

linhlevandlu Nov 2, 2021 Author

Uh oh!

linhlevandlu Nov 2, 2021 Author

Uh oh!

Programmer-RD-AI Nov 2, 2021

linhlevandlu
Oct 18, 2021

Replies: 1 comment 4 replies

Programmer-RD-AI
Oct 26, 2021

linhlevandlu Nov 2, 2021
Author

linhlevandlu Nov 2, 2021
Author