AMP training overflow #3249

cristinagrs · 2021-11-03T07:58:25Z

cristinagrs
Nov 3, 2021

I am training a UNet segmentation model with CT scans and I tried to use mixed precision as in : https://pytorch.org/docs/stable/notes/amp_examples.html#id2

But it causes nan loss and overflow that does not recover.
If I use native amp in pytorch lightning, the same overflow occurs and does not recover.
On the other hand, I tried to use apex in pytorch lightning and that solves the issue.

My question is, ¿has anyone encountered this problem? and ¿is there any way to solve this without apex?

I am using pytorch version 1.10.0
and lightning version 1.4.9
CUDA Version: 11.4

Thank you!

Nic-Ma · 2021-11-03T22:53:03Z

Nic-Ma
Nov 3, 2021
Maintainer

Hi @cristinagrs ,

Thanks for your interest here.
Could you please help share your test program? Maybe you didn't scale loss value during the training?
Here is the MONAI AMP tutorial for your reference:
https://github.com/Project-MONAI/tutorials/blob/master/acceleration/automatic_mixed_precision.ipynb

Thanks.

6 replies

Nic-Ma Nov 4, 2021
Maintainer

Hi @ericspod @dongyang0122 ,

Have you guys seen some similar issues during training?

Thanks in advance.

ericspod Nov 4, 2021
Maintainer

I had a similar issue with non-MONAI code I was trying to get running with less memory and this sort of thing happened. I couldn't trace it down to specifically where it was coming from but it was a NaN issue like this. We would have to trace through the network and loss function to see where they occur and simplify the training process until we can isolate the problem, hopefully it's something we can work around otherwise it's an AMP issue we can resolve. Which model and loss function are used here? Does it matter what the input data is? Does it matter that you're using AdamW and not original Adam?

cristinagrs Nov 5, 2021
Author

There is an issue in pytorch that refers to what happened to me: pytorch/pytorch#40497
So it may be some stability problems that they still need to fix.

What I saw is that the overflow happens in the first batch norm, after some iterations and already in the first epoch.
I am using unet module from timm and soft-dice loss. With Adam I get the same result as with AdamW.
I haven't tried to change the input data and that may be a good experiment .

ericspod Nov 5, 2021
Maintainer

It would be interesting to see what the input and output is like for the batchnorm layers, it would be a good check to make sure NaN isn't going in. If you're still getting NaNs from clean input it may be something wrong with something in the calculation of batchnorm with float16 that we can't fix, one experiment to try is to use instance norm instead to see if a different algorithm might have different results.

cristinagrs Nov 8, 2021
Author

The input of the batchnorm isn't NaN. It may be what you said, something wrong in the computation of the batchnorm with float16. I will try with instance norm as you suggest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMP training overflow #3249

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

AMP training overflow #3249

Uh oh!

Uh oh!

cristinagrs Nov 3, 2021

Replies: 1 comment · 6 replies

Uh oh!

Nic-Ma Nov 3, 2021 Maintainer

Uh oh!

Nic-Ma Nov 4, 2021 Maintainer

Uh oh!

ericspod Nov 4, 2021 Maintainer

Uh oh!

cristinagrs Nov 5, 2021 Author

Uh oh!

ericspod Nov 5, 2021 Maintainer

Uh oh!

cristinagrs Nov 8, 2021 Author

cristinagrs
Nov 3, 2021

Replies: 1 comment 6 replies

Nic-Ma
Nov 3, 2021
Maintainer

Nic-Ma Nov 4, 2021
Maintainer

ericspod Nov 4, 2021
Maintainer

cristinagrs Nov 5, 2021
Author

ericspod Nov 5, 2021
Maintainer

cristinagrs Nov 8, 2021
Author