Gradient Clipping with mix precision in case of NaN loss #11413
-
Greetings. I am getting NaN val loss
Further, i am doing regression, i dont know what value of gradient clipping should i use? Further i checked trainer doc and find that
Can you explain what this mean 'If using Automatic Mixed Precision (AMP), the gradients will be unscaled before logging them' |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
You totally can, that's saying that any scaling applied by 16bit precision training will be undone before clipping the gradients. Which means you do not need to worry about changing the gradient clipping value with vs without
Nobody does :P
Same thing as I explained above. It's just a technical detail, you do not need to worry about it |
Beta Was this translation helpful? Give feedback.
-
I have been trying both mixed precision training and gradient clipping (norm value less than 0.5) together on a transformer model. I have been getting nan after a certain point. My batch sizes are only 2 so maybe it's too small? Should I increase my gradient clipping value in that case? It initially starts out fine but will go to nan loss after ~1000 iterations. |
Beta Was this translation helpful? Give feedback.
You totally can, that's saying that any scaling applied by 16bit precision training will be undone before clipping the gradients.
Which means you do not need to worry about changing the gradient clipping value with vs without
precision=16
Nobody does :P
Try some experiments and find out!
Same thing as I explained above. It's just a technical detail, you do not need to worry about it