@@ -4,6 +4,7 @@ pytorch-optimizer
44| |workflow| |Documentation Status| |PyPI version| |PyPi download| |black|
55
66| Bunch of optimizer implementations in PyTorch with clean-code, strict types. Also, including useful optimization ideas.
7+ | Most of the implementations are based on the original paper, but I added some tweaks.
78| Highly inspired by `pytorch-optimizer <https://github.com/jettify/pytorch-optimizer>`__.
89
910Documentation
@@ -53,6 +54,8 @@ Supported Optimizers
5354+--------------+----------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+
5455| AdamP | *Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights * | `github <https://github.com/clovaai/AdamP >`__ | `https://arxiv.org/abs/2006.08217 <https://arxiv.org/abs/2006.08217 >`__ |
5556+--------------+----------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+
57+ | diffGrad | *An Optimization Method for Convolutional Neural Networks * | `github <https://github.com/shivram1987/diffGrad >`__ | `https://arxiv.org/abs/1909.11015v3 <https://arxiv.org/abs/1909.11015v3 >`__ |
58+ +--------------+----------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+
5659| MADGRAD | *A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic * | `github <https://github.com/facebookresearch/madgrad >`__ | `https://arxiv.org/abs/2101.11075 <https://arxiv.org/abs/2101.11075 >`__ |
5760+--------------+----------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+
5861| RAdam | *On the Variance of the Adaptive Learning Rate and Beyond * | `github <https://github.com/LiyuanLucasLiu/RAdam >`__ | `https://arxiv.org/abs/1908.03265 <https://arxiv.org/abs/1908.03265 >`__ |
@@ -70,42 +73,51 @@ of the ideas are applied in ``Ranger21`` optimizer.
7073
7174Also, most of the captures are taken from ``Ranger21 `` paper.
7275
73- Adaptive Gradient Clipping (AGC)
74- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
76+ +------------------------------------------+-------------------------------------+--------------------------------------------+
77+ | `Adaptive Gradient Clipping `_ | `Gradient Centralization `_ | `Softplus Transformation `_ |
78+ +------------------------------------------+-------------------------------------+--------------------------------------------+
79+ | `Gradient Normalization `_ | `Norm Loss `_ | `Positive-Negative Momentum `_ |
80+ +------------------------------------------+-------------------------------------+--------------------------------------------+
81+ | `Linear learning rate warmup `_ | `Stable weight decay `_ | `Explore-exploit learning rate schedule `_ |
82+ +------------------------------------------+-------------------------------------+--------------------------------------------+
83+ | `Lookahead `_ | `Chebyshev learning rate schedule `_ | `(Adaptive) Sharpness-Aware Minimization `_ |
84+ +------------------------------------------+-------------------------------------+--------------------------------------------+
85+ | `On the Convergence of Adam and Beyond `_ | | |
86+ +------------------------------------------+-------------------------------------+--------------------------------------------+
87+
88+ Adaptive Gradient Clipping
89+ --------------------------
7590
7691| This idea originally proposed in ``NFNet (Normalized-Free Network)`` paper.
77- | AGC (Adaptive Gradient Clipping) clips gradients based on the ``unit-wise ratio of gradient norms to parameter norms``.
92+ | `` AGC (Adaptive Gradient Clipping)`` clips gradients based on the ``unit-wise ratio of gradient norms to parameter norms``.
7893
79- - code :
80- `github <https://github.com/deepmind/deepmind-research/tree/master/nfnets >`__
94+ - code : `github <https://github.com/deepmind/deepmind-research/tree/master/nfnets >`__
8195- paper : `arXiv <https://arxiv.org/abs/2102.06171 >`__
8296
83- Gradient Centralization (GC)
84- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
97+ Gradient Centralization
98+ -----------------------
8599
86100+-----------------------------------------------------------------------------------------------------------------+
87101| .. image:: https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/gradient_centralization.png |
88102+-----------------------------------------------------------------------------------------------------------------+
89103
90- Gradient Centralization (GC) operates directly on gradients by
91- centralizing the gradient to have zero mean.
104+ ``Gradient Centralization (GC) `` operates directly on gradients by centralizing the gradient to have zero mean.
92105
93- - code :
94- `github <https://github.com/Yonghongwei/Gradient-Centralization >`__
106+ - code : `github <https://github.com/Yonghongwei/Gradient-Centralization >`__
95107- paper : `arXiv <https://arxiv.org/abs/2004.01461 >`__
96108
97109Softplus Transformation
98- ~~~~~~~~~~~~~~~~~~~~~~~
110+ -----------------------
99111
100112By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.
101113
102114- paper : `arXiv <https://arxiv.org/abs/1908.00700 >`__
103115
104116Gradient Normalization
105- ~~~~~~~~~~~~~~~~~~~~~~
117+ ----------------------
106118
107119Norm Loss
108- ~~~~~~~~~
120+ ---------
109121
110122+---------------------------------------------------------------------------------------------------+
111123| .. image:: https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/norm_loss.png |
@@ -114,7 +126,7 @@ Norm Loss
114126- paper : `arXiv <https://arxiv.org/abs/2103.06583 >`__
115127
116128Positive-Negative Momentum
117- ~~~~~~~~~~~~~~~~~~~~~~~~~~
129+ --------------------------
118130
119131+--------------------------------------------------------------------------------------------------------------------+
120132| .. image:: https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/positive_negative_momentum.png |
@@ -123,8 +135,8 @@ Positive-Negative Momentum
123135- code : `github <https://github.com/zeke-xie/Positive-Negative-Momentum >`__
124136- paper : `arXiv <https://arxiv.org/abs/2103.17182 >`__
125137
126- Linear learning- rate warm-up
127- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
138+ Linear learning rate warmup
139+ ---------------------------
128140
129141+----------------------------------------------------------------------------------------------------------+
130142| .. image:: https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/linear_lr_warmup.png |
@@ -133,7 +145,7 @@ Linear learning-rate warm-up
133145- paper : `arXiv <https://arxiv.org/abs/1910.04209 >`__
134146
135147Stable weight decay
136- ~~~~~~~~~~~~~~~~~~~
148+ -------------------
137149
138150+-------------------------------------------------------------------------------------------------------------+
139151| .. image:: https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/stable_weight_decay.png |
@@ -142,8 +154,8 @@ Stable weight decay
142154- code : `github <https://github.com/zeke-xie/stable-weight-decay-regularization >`__
143155- paper : `arXiv <https://arxiv.org/abs/2011.11152 >`__
144156
145- Explore-exploit learning- rate schedule
146- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
157+ Explore-exploit learning rate schedule
158+ --------------------------------------
147159
148160+---------------------------------------------------------------------------------------------------------------------+
149161| .. image:: https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/explore_exploit_lr_schedule.png |
@@ -153,7 +165,7 @@ Explore-exploit learning-rate schedule
153165- paper : `arXiv <https://arxiv.org/abs/2003.03977 >`__
154166
155167Lookahead
156- ~~~~~~~~~
168+ ---------
157169
158170| ``k`` steps forward, 1 step back. ``Lookahead`` consisting of keeping an exponential moving average of the weights that is
159171| updated and substituted to the current weights every ``k_{lookahead}`` steps (5 by default).
@@ -162,14 +174,14 @@ Lookahead
162174- paper : `arXiv <https://arxiv.org/abs/1907.08610v2 >`__
163175
164176Chebyshev learning rate schedule
165- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
177+ --------------------------------
166178
167179Acceleration via Fractal Learning Rate Schedules
168180
169181- paper : `arXiv <https://arxiv.org/abs/2103.01338v1 >`__
170182
171- (Adaptive) Sharpness-Aware Minimization (A/SAM)
172- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
183+ (Adaptive) Sharpness-Aware Minimization
184+ ---------------------------------------
173185
174186| Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
175187| In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.
@@ -178,6 +190,11 @@ Acceleration via Fractal Learning Rate Schedules
178190- ASAM paper : `paper <https://arxiv.org/abs/2102.11600 >`__
179191- A/SAM code : `github <https://github.com/davda54/sam >`__
180192
193+ On the Convergence of Adam and Beyond
194+ -------------------------------------
195+
196+ - paper : `paper <https://openreview.net/forum?id=ryQu7f-RZ >`__
197+
181198Citations
182199---------
183200
@@ -387,6 +404,32 @@ Adaptive Sharpness-Aware Minimization
387404 year={2021}
388405 }
389406
407+ diffGrad
408+
409+ ::
410+
411+ @article{dubey2019diffgrad,
412+ title={diffgrad: An optimization method for convolutional neural networks},
413+ author={Dubey, Shiv Ram and Chakraborty, Soumendu and Roy, Swalpa Kumar and Mukherjee, Snehasis and Singh, Satish Kumar and Chaudhuri, Bidyut Baran},
414+ journal={IEEE transactions on neural networks and learning systems},
415+ volume={31},
416+ number={11},
417+ pages={4500--4511},
418+ year={2019},
419+ publisher={IEEE}
420+ }
421+
422+ On the Convergence of Adam and Beyond
423+
424+ ::
425+
426+ @article{reddi2019convergence,
427+ title={On the convergence of adam and beyond},
428+ author={Reddi, Sashank J and Kale, Satyen and Kumar, Sanjiv},
429+ journal={arXiv preprint arXiv:1904.09237},
430+ year={2019}
431+ }
432+
390433Author
391434------
392435
0 commit comments