diff --git a/README.md b/README.md index f5f5532..ccb74bb 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,7 @@ [![CICD NeMo](https://github.com/NVIDIA-NeMo/Emerging-Optimizers/actions/workflows/cicd-main.yml/badge.svg?branch=main)](https://github.com/NVIDIA-NeMo/Emerging-Optimizers/actions/workflows/cicd-main.yml) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/) ![GitHub Repo stars](https://img.shields.io/github/stars/NVIDIA-NeMo/Emerging-Optimizers) +[![Documentation](https://img.shields.io/badge/api-reference-blue.svg)](https://docs.nvidia.com/nemo/emerging-optimizers/latest/index.html) @@ -53,15 +54,13 @@ pip install . ## Usage -### Muon Optimizer +### Example -Muon (MomentUm Orthogonalized by Newton-schulz) uses orthogonalization for 2D parameters. - -For a simple usage example, see [`tests/test_orthogonalized_optimizer.py::MuonTest`](tests/test_orthogonalized_optimizer.py). +Refer to tests for usage of different optimizers, e.g. [`tests/test_orthogonalized_optimizer.py::MuonTest`](tests/test_orthogonalized_optimizer.py). ### Integration with Megatron Core -Integration with Megatron Core is in progress. See the [integration PR](https://github.com/NVIDIA/Megatron-LM/pull/1813) that demonstrates usage with Dense and MoE models. +Integration with Megatron Core is available in **dev** branch, e.g. [muon.py](https://github.com/NVIDIA/Megatron-LM/blob/dev/megatron/core/optimizer/muon.py) ## Benchmarks diff --git a/docs/apidocs/index.md b/docs/apidocs/index.md index ccd7a57..8f5391e 100644 --- a/docs/apidocs/index.md +++ b/docs/apidocs/index.md @@ -6,10 +6,11 @@ NeMo Emerging Optimizers API reference provides comprehensive technical document :caption: API Documentation :hidden: -utils.md orthogonalized-optimizers.md soap.md riemannian-optimizers.md psgd.md scalar-optimizers.md +mixin.md +utils.md ``` \ No newline at end of file diff --git a/docs/apidocs/mixin.md b/docs/apidocs/mixin.md new file mode 100644 index 0000000..77af82b --- /dev/null +++ b/docs/apidocs/mixin.md @@ -0,0 +1,12 @@ + +```{eval-rst} +.. role:: hidden + :class: hidden-section + +emerging_optimizers.mixin +========================== + +.. automodule:: emerging_optimizers.mixin + :members: + :private-members: +``` \ No newline at end of file diff --git a/docs/apidocs/orthogonalized-optimizers.md b/docs/apidocs/orthogonalized-optimizers.md index 2fb2b77..388ef48 100644 --- a/docs/apidocs/orthogonalized-optimizers.md +++ b/docs/apidocs/orthogonalized-optimizers.md @@ -21,6 +21,13 @@ emerging_optimizers.orthogonalized_optimizers :members: +:hidden:`Scion` +~~~~~~~~~~~~~~~ + +.. autoclass:: Scion + :members: + + :hidden:`Newton-Schulz` ~~~~~~~~~~~~~~~~~~~~~~~~ .. automodule:: emerging_optimizers.orthogonalized_optimizers.muon_utils diff --git a/docs/apidocs/soap.md b/docs/apidocs/soap.md index 6dcf3bc..dc107d0 100644 --- a/docs/apidocs/soap.md +++ b/docs/apidocs/soap.md @@ -20,6 +20,8 @@ emerging_optimizers.soap .. autofunction:: update_kronecker_factors +.. autofunction:: update_kronecker_factors_kl_shampoo + .. autofunction:: update_eigenbasis_and_momentum emerging_optimizers.soap.soap_utils diff --git a/docs/index.md b/docs/index.md index 8cc66c6..d90b53b 100644 --- a/docs/index.md +++ b/docs/index.md @@ -12,7 +12,7 @@ Emerging-Optimizers is under active development. All APIs are experimental and s ### Prerequisites -- Python 3.12 or higher +- Python 3.10 or higher, 3.12 is recommended - PyTorch 2.0 or higher ### Install from Source @@ -33,8 +33,8 @@ Coming soon. :caption: 🛠️ Development :hidden: -documentation.md apidocs/index.md +documentation.md ``` diff --git a/emerging_optimizers/mixin.py b/emerging_optimizers/mixin.py index 508166e..4ad2dc6 100644 --- a/emerging_optimizers/mixin.py +++ b/emerging_optimizers/mixin.py @@ -25,6 +25,7 @@ class WeightDecayMixin: """Mixin for weight decay Supports different types of weight decay: + - "decoupled": weight decay is applied directly to params without changing gradients - "independent": similar as decoupled weight decay, but without tying weight decay and learning rate - "l2": classic L2 regularization diff --git a/emerging_optimizers/orthogonalized_optimizers/__init__.py b/emerging_optimizers/orthogonalized_optimizers/__init__.py index 8b8f9a4..c809ebb 100644 --- a/emerging_optimizers/orthogonalized_optimizers/__init__.py +++ b/emerging_optimizers/orthogonalized_optimizers/__init__.py @@ -14,4 +14,5 @@ # limitations under the License. from emerging_optimizers.orthogonalized_optimizers.muon import * from emerging_optimizers.orthogonalized_optimizers.orthogonalized_optimizer import * +from emerging_optimizers.orthogonalized_optimizers.scion import * from emerging_optimizers.orthogonalized_optimizers.spectral_clipping_utils import *