Skip to content

The confused the concepts of fixed decay and decoupled decay #39

@zzp1012

Description

@zzp1012

Decoupled decay refers to isolate the weight decay from the "gradient". The usual way to apply weight decay is to add a L2 regularization in the loss function. For SGD, it is equivalent to directly do direct weight decay on parameters, i.e., $\theta_{t+1} = \theta_t - \eta (\nabla L + \lambda \theta_t)$, however, for adam or other more complex optimizers, the weight decay is hidden in the 'gradients'. See details in the paper https://arxiv.org/abs/1711.05101 .

Here, what you implemented is the fixed decay, i.e., $\theta (1 - \lambda)$.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions