-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task
Milestone
Description
🐛 Bug
I trained a large model using native amp, but the loss converged very slow. After a careful check of the backward and optimization code, I found the clip_gradients is executed right after backward, but scaler.unscale_ is conducted in pre_optimization_step.
According to the instruction of Pytorch, the order of clip and unscale should be exchanged. Currently gradient_clip_val may lead to a very flat learning curve if used together with native amp.
Hope to be fixed.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task