The confused the concepts of fixed decay and decoupled decay

https://github.com/lucidrains/lion-pytorch/blob/6a74fdc0ba572ab5683dc0270c66c20ecbc02d09/lion_pytorch/lion_pytorch.py#L79

Decoupled decay refers to isolate the weight decay from the "gradient". The usual way to apply weight decay is to add a L2 regularization in the loss function. For SGD, it is equivalent to directly do direct weight decay on parameters, i.e., $\theta_{t+1} = \theta_t - \eta (\nabla L + \lambda \theta_t)$, however, for adam or other more complex optimizers, the weight decay is hidden in the 'gradients'. See details in the paper https://arxiv.org/abs/1711.05101 .

Here, what you implemented is the fixed decay, i.e., $\theta (1 - \lambda)$. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The confused the concepts of fixed decay and decoupled decay #39

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The confused the concepts of fixed decay and decoupled decay #39

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions