Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
[![CICD NeMo](https://github.com/NVIDIA-NeMo/Emerging-Optimizers/actions/workflows/cicd-main.yml/badge.svg?branch=main)](https://github.com/NVIDIA-NeMo/Emerging-Optimizers/actions/workflows/cicd-main.yml)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
![GitHub Repo stars](https://img.shields.io/github/stars/NVIDIA-NeMo/Emerging-Optimizers)
[![Documentation](https://img.shields.io/badge/api-reference-blue.svg)](https://docs.nvidia.com/nemo/emerging-optimizers/latest/index.html)

</div>

Expand Down Expand Up @@ -53,15 +54,13 @@ pip install .

## Usage

### Muon Optimizer
### Example

Muon (MomentUm Orthogonalized by Newton-schulz) uses orthogonalization for 2D parameters.

For a simple usage example, see [`tests/test_orthogonalized_optimizer.py::MuonTest`](tests/test_orthogonalized_optimizer.py).
Refer to tests for usage of different optimizers, e.g. [`tests/test_orthogonalized_optimizer.py::MuonTest`](tests/test_orthogonalized_optimizer.py).

### Integration with Megatron Core

Integration with Megatron Core is in progress. See the [integration PR](https://github.com/NVIDIA/Megatron-LM/pull/1813) that demonstrates usage with Dense and MoE models.
Integration with Megatron Core is available in **dev** branch, e.g. [muon.py](https://github.com/NVIDIA/Megatron-LM/blob/dev/megatron/core/optimizer/muon.py)

## Benchmarks

Expand Down
3 changes: 2 additions & 1 deletion docs/apidocs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@ NeMo Emerging Optimizers API reference provides comprehensive technical document
:caption: API Documentation
:hidden:

utils.md
orthogonalized-optimizers.md
soap.md
riemannian-optimizers.md
psgd.md
scalar-optimizers.md
mixin.md
utils.md
```
12 changes: 12 additions & 0 deletions docs/apidocs/mixin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@

```{eval-rst}
.. role:: hidden
:class: hidden-section

emerging_optimizers.mixin
==========================

.. automodule:: emerging_optimizers.mixin
:members:
:private-members:
```
7 changes: 7 additions & 0 deletions docs/apidocs/orthogonalized-optimizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,13 @@ emerging_optimizers.orthogonalized_optimizers
:members:


:hidden:`Scion`
~~~~~~~~~~~~~~~

.. autoclass:: Scion
:members:


:hidden:`Newton-Schulz`
~~~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: emerging_optimizers.orthogonalized_optimizers.muon_utils
Expand Down
2 changes: 2 additions & 0 deletions docs/apidocs/soap.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ emerging_optimizers.soap

.. autofunction:: update_kronecker_factors

.. autofunction:: update_kronecker_factors_kl_shampoo

.. autofunction:: update_eigenbasis_and_momentum

emerging_optimizers.soap.soap_utils
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Emerging-Optimizers is under active development. All APIs are experimental and s

### Prerequisites

- Python 3.12 or higher
- Python 3.10 or higher, 3.12 is recommended
- PyTorch 2.0 or higher

### Install from Source
Expand All @@ -33,8 +33,8 @@ Coming soon.
:caption: 🛠️ Development
:hidden:

documentation.md
apidocs/index.md
documentation.md
```


1 change: 1 addition & 0 deletions emerging_optimizers/mixin.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ class WeightDecayMixin:
"""Mixin for weight decay

Supports different types of weight decay:

- "decoupled": weight decay is applied directly to params without changing gradients
- "independent": similar as decoupled weight decay, but without tying weight decay and learning rate
- "l2": classic L2 regularization
Expand Down
1 change: 1 addition & 0 deletions emerging_optimizers/orthogonalized_optimizers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,5 @@
# limitations under the License.
from emerging_optimizers.orthogonalized_optimizers.muon import *
from emerging_optimizers.orthogonalized_optimizers.orthogonalized_optimizer import *
from emerging_optimizers.orthogonalized_optimizers.scion import *
from emerging_optimizers.orthogonalized_optimizers.spectral_clipping_utils import *