Gradient Clipping with mix precision in case of NaN loss #11413

talhaanwarch · 2022-01-11T08:33:40Z

talhaanwarch
Jan 11, 2022

Greetings. I am getting NaN val loss Cannot log infinite or NaN value to attribute training/val_loss/ with cnnlstm network.I am thinking to use gradient clipping.
But the doc say gradient clipping should not be used with mixed precision.

If using mixed precision, the gradient_clip_val does not need to be changed as the gradients are unscaled before applying the clipping function.

Further, i am doing regression, i dont know what value of gradient clipping should i use?

Further i checked trainer doc and find that

track_grad_norm
(Union[int, float, str]) – -1 no tracking. Otherwise tracks that p-norm. May be set to ‘inf’ infinity-norm. If using Automatic Mixed Precision (AMP), the gradients will be unscaled before logging them.

Can you explain what this mean 'If using Automatic Mixed Precision (AMP), the gradients will be unscaled before logging them'

Answered by carmocca

Jan 12, 2022

But the doc say gradient clipping should not be used with mixed precision.

You totally can, that's saying that any scaling applied by 16bit precision training will be undone before clipping the gradients.

Which means you do not need to worry about changing the gradient clipping value with vs without precision=16

i dont know what value of gradient clipping should i use?

Nobody does :P
Try some experiments and find out!

'If using Automatic Mixed Precision (AMP), the gradients will be unscaled before logging them'

Same thing as I explained above. It's just a technical detail, you do not need to worry about it

View full answer

carmocca · 2022-01-12T03:24:17Z

carmocca
Jan 12, 2022

But the doc say gradient clipping should not be used with mixed precision.

You totally can, that's saying that any scaling applied by 16bit precision training will be undone before clipping the gradients.

Which means you do not need to worry about changing the gradient clipping value with vs without precision=16

i dont know what value of gradient clipping should i use?

Nobody does :P
Try some experiments and find out!

'If using Automatic Mixed Precision (AMP), the gradients will be unscaled before logging them'

Same thing as I explained above. It's just a technical detail, you do not need to worry about it

0 replies

cmlakhan · 2022-03-29T12:19:35Z

cmlakhan
Mar 29, 2022

I have been trying both mixed precision training and gradient clipping (norm value less than 0.5) together on a transformer model. I have been getting nan after a certain point. My batch sizes are only 2 so maybe it's too small? Should I increase my gradient clipping value in that case? It initially starts out fine but will go to nan loss after ~1000 iterations.

1 reply

DanTremonti May 10, 2023

Similarly I've been getting nan loss with 16-mixed precision but only at around 8th epoch ( with ~10k iter/epoch).
Using grad-norm delay this issue to ~12th epoch. My batch_size is 32 and I tried different vales for gradient clipping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gradient Clipping with mix precision in case of NaN loss #11413

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Gradient Clipping with mix precision in case of NaN loss #11413

Uh oh!

Uh oh!

talhaanwarch Jan 11, 2022

Replies: 2 comments · 1 reply

Uh oh!

carmocca Jan 12, 2022

Uh oh!

cmlakhan Mar 29, 2022

Uh oh!

DanTremonti May 10, 2023

talhaanwarch
Jan 11, 2022

Replies: 2 comments 1 reply

carmocca
Jan 12, 2022

cmlakhan
Mar 29, 2022