Skip to content

Clip norm after scaler.unscale_ in native fp16 training #9599

@del2z

Description

@del2z

🐛 Bug

I trained a large model using native amp, but the loss converged very slow. After a careful check of the backward and optimization code, I found the clip_gradients is executed right after backward, but scaler.unscale_ is conducted in pre_optimization_step.
According to the instruction of Pytorch, the order of clip and unscale should be exchanged. Currently gradient_clip_val may lead to a very flat learning curve if used together with native amp.
Hope to be fixed.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions