how to fix loss nan in FP16 training #2601

kellenf · 2023-02-14T15:38:17Z

kellenf
Feb 14, 2023

When I use FP16(pytorch amp API), I meet loss nan bug.
So how about your AMP code? Does it can fix loss_nan? Or do you meet loss_nan in amp training ?How do you fix this problem?

Thank you for your answer.

C1rN09 · 2023-02-23T10:29:22Z

C1rN09
Feb 23, 2023
Collaborator

There are many reasons causing nan in FP16 training. You can refer to PyTorch documentation for advices.

MMEngine uses PyTorch autocast as AMP training strategy, so it typically behaves the same as native PyTorch. However, we handle advanced usages (e.g. gradient clipping, gradient accumulation) well, so this should prevent you from some issues mentioned in the above link.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to fix loss nan in FP16 training #2601

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

how to fix loss nan in FP16 training #2601

Uh oh!

kellenf Feb 14, 2023

Replies: 1 comment

Uh oh!

Uh oh!

C1rN09 Feb 23, 2023 Collaborator

kellenf
Feb 14, 2023

C1rN09
Feb 23, 2023
Collaborator