Skip to content

Commit f53a2ed

Browse files
authored
Merge pull request #15 from kozistr/docs/readme
[Docs] Add the useful resources
2 parents b57f9b6 + f001016 commit f53a2ed

File tree

7 files changed

+172
-5
lines changed

7 files changed

+172
-5
lines changed

README.md

Lines changed: 172 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,87 @@ $ pip3 install pytorch-optimizer
1515
| Optimizer | Description | Official Code | Paper |
1616
| :---: | :---: | :---: | :---: |
1717
| AdamP | *Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights* | [github](https://github.com/clovaai/AdamP) | [https://arxiv.org/abs/2006.08217](https://arxiv.org/abs/2006.08217) |
18-
| Adaptive Gradient Clipping (AGC) | *High-Performance Large-Scale Image Recognition Without Normalization* | [github](https://github.com/deepmind/deepmind-research/tree/master/nfnets) | [https://arxiv.org/abs/2102.06171](https://arxiv.org/abs/2102.06171) |
19-
| Chebyshev LR Schedules | *Acceleration via Fractal Learning Rate Schedules* | [~~github~~]() | [https://arxiv.org/abs/2103.01338v1](https://arxiv.org/abs/2103.01338v1) |
20-
| Gradient Centralization (GC) | *A New Optimization Technique for Deep Neural Networks* | [github](https://github.com/Yonghongwei/Gradient-Centralization) | [https://arxiv.org/abs/2004.01461](https://arxiv.org/abs/2004.01461) |
21-
| Lookahead | *k steps forward, 1 step back* | [github](https://github.com/alphadl/lookahead.pytorch) | [https://arxiv.org/abs/1907.08610v2](https://arxiv.org/abs/1907.08610v2) |
2218
| RAdam | *On the Variance of the Adaptive Learning Rate and Beyond* | [github](https://github.com/LiyuanLucasLiu/RAdam) | [https://arxiv.org/abs/1908.03265](https://arxiv.org/abs/1908.03265) |
2319
| Ranger | *a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer* | [github](https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer) | |
24-
| Ranger21 | *integrating the latest deep learning components into a single optimizer* | [github](https://github.com/lessw2020/Ranger21) | | |
20+
| Ranger21 | *a synergistic deep learning optimizer* | [github](https://github.com/lessw2020/Ranger21) | [https://arxiv.org/abs/2106.13731](https://arxiv.org/abs/2106.13731) |
21+
22+
## Useful Resources
23+
24+
Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in `Ranger21` optimizer.
25+
26+
Also, most of the captures are taken from `Ranger21` paper.
27+
28+
### Adaptive Gradient Clipping (AGC)
29+
30+
This idea originally proposed in `NFNet (Normalized-Free Network)` paper.
31+
AGC (Adaptive Gradient Clipping) clips gradients based on the `unit-wise ratio of gradient norms to parameter norms`.
32+
33+
* github : [code](https://github.com/deepmind/deepmind-research/tree/master/nfnets)
34+
* paper : [arXiv](https://arxiv.org/abs/2102.06171)
35+
36+
### Gradient Centralization (GC)
37+
38+
![gradient_centralization](assets/gradient_centralization.png)
39+
40+
Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.
41+
42+
* github : [code](https://github.com/Yonghongwei/Gradient-Centralization)
43+
* paper : [arXiv](https://arxiv.org/abs/2004.01461)
44+
45+
### Softplus Transformation
46+
47+
By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.
48+
49+
* paper : [arXiv](https://arxiv.org/abs/1908.00700)
50+
51+
### Gradient Normalization
52+
53+
### Norm Loss
54+
55+
![norm_loss](assets/norm_loss.png)
56+
57+
* paper : [arXiv](https://arxiv.org/abs/2103.06583)
58+
59+
### Positive-Negative Momentum
60+
61+
![positive_negative_momentum](assets/positive_negative_momentum.png)
62+
63+
* github : [code](https://github.com/zeke-xie/Positive-Negative-Momentum)
64+
* paper : [arXiv](https://arxiv.org/abs/2103.17182)
65+
66+
### Linear learning-rate warm-up
67+
68+
![linear_lr_warmup](assets/linear_lr_warmup.png)
69+
70+
* paper : [arXiv](https://arxiv.org/abs/1910.04209)
71+
72+
### Stable weight decay
73+
74+
![stable_weight_decay](assets/stable_weight_decay.png)
75+
76+
* github : [code](https://github.com/zeke-xie/stable-weight-decay-regularization)
77+
* paper : [arXiv](https://arxiv.org/abs/2011.11152)
78+
79+
### Explore-exploit learning-rate schedule
80+
81+
![explore_exploit_lr_schedule](assets/explore_exploit_lr_schedule.png)
82+
83+
* github : [code](https://github.com/nikhil-iyer-97/wide-minima-density-hypothesis)
84+
* paper : [arXiv](https://arxiv.org/abs/2003.03977)
85+
86+
### Lookahead
87+
88+
`k` steps forward, 1 step back. `Lookahead` consisting of keeping an exponential moving average of the weights that is
89+
updated and substituted to the current weights every `k_{lookahead}` steps (5 by default).
90+
91+
* github : [code](https://github.com/alphadl/lookahead.pytorch)
92+
* paper : [arXiv](https://arxiv.org/abs/1907.08610v2)
93+
94+
### Chebyshev learning rate schedule
95+
96+
Acceleration via Fractal Learning Rate Schedules
97+
98+
* paper : [arXiv](https://arxiv.org/abs/2103.01338v1)
2599

26100
## Citations
27101

@@ -118,6 +192,99 @@ $ pip3 install pytorch-optimizer
118192

119193
</details>
120194

195+
<details>
196+
197+
<summary>Norm Loss</summary>
198+
199+
```
200+
@inproceedings{georgiou2021norm,
201+
title={Norm Loss: An efficient yet effective regularization method for deep neural networks},
202+
author={Georgiou, Theodoros and Schmitt, Sebastian and B{\"a}ck, Thomas and Chen, Wei and Lew, Michael},
203+
booktitle={2020 25th International Conference on Pattern Recognition (ICPR)},
204+
pages={8812--8818},
205+
year={2021},
206+
organization={IEEE}
207+
}
208+
```
209+
210+
</details>
211+
212+
<details>
213+
214+
<summary>Positive-Negative Momentum</summary>
215+
216+
```
217+
@article{xie2021positive,
218+
title={Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization},
219+
author={Xie, Zeke and Yuan, Li and Zhu, Zhanxing and Sugiyama, Masashi},
220+
journal={arXiv preprint arXiv:2103.17182},
221+
year={2021}
222+
}
223+
```
224+
225+
</details>
226+
227+
<details>
228+
229+
<summary>Explore-Exploit learning rate schedule</summary>
230+
231+
```
232+
@article{iyer2020wide,
233+
title={Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule},
234+
author={Iyer, Nikhil and Thejas, V and Kwatra, Nipun and Ramjee, Ramachandran and Sivathanu, Muthian},
235+
journal={arXiv preprint arXiv:2003.03977},
236+
year={2020}
237+
}
238+
```
239+
240+
</details>
241+
242+
<details>
243+
244+
<summary>Linear learning-rate warm-up</summary>
245+
246+
```
247+
@article{ma2019adequacy,
248+
title={On the adequacy of untuned warmup for adaptive optimization},
249+
author={Ma, Jerry and Yarats, Denis},
250+
journal={arXiv preprint arXiv:1910.04209},
251+
volume={7},
252+
year={2019}
253+
}
254+
```
255+
256+
</details>
257+
258+
<details>
259+
260+
<summary>Stable weight decay</summary>
261+
262+
```
263+
@article{xie2020stable,
264+
title={Stable weight decay regularization},
265+
author={Xie, Zeke and Sato, Issei and Sugiyama, Masashi},
266+
journal={arXiv preprint arXiv:2011.11152},
267+
year={2020}
268+
}
269+
```
270+
271+
</details>
272+
273+
<details>
274+
275+
<summary>Softplus transformation</summary>
276+
277+
```
278+
@article{tong2019calibrating,
279+
title={Calibrating the adaptive learning rate to improve convergence of adam},
280+
author={Tong, Qianqian and Liang, Guannan and Bi, Jinbo},
281+
journal={arXiv preprint arXiv:1908.00700},
282+
year={2019}
283+
}
284+
```
285+
286+
</details>
287+
121288
## Author
122289

123290
Hyeongchan Kim / [@kozistr](http://kozistr.tech/about)
117 KB
Loading

assets/gradient_centralization.png

96.8 KB
Loading

assets/linear_lr_warmup.png

114 KB
Loading

assets/norm_loss.png

100 KB
Loading
159 KB
Loading

assets/stable_weight_decay.png

138 KB
Loading

0 commit comments

Comments
 (0)