Skip to content

Commit c5217db

Browse files
committed
Suggest 0.2 as extra_scale_factor in docstring
Signed-off-by: Hao Wu <[email protected]>
1 parent 34131a3 commit c5217db

File tree

1 file changed

+3
-2
lines changed
  • emerging_optimizers/orthogonalized_optimizers

1 file changed

+3
-2
lines changed

emerging_optimizers/orthogonalized_optimizers/muon.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,15 +53,16 @@ class Muon(OrthogonalizedOptimizer):
5353
Warning:
5454
- This optimizer requires that all parameters passed in are 2D.
5555
- It should not be used for the embedding layer, the final fully connected layer, or any 1-D
56-
parameters; those should all be optimized by a standard method (e.g., AdamW).
56+
parameters; those can all be optimized by a standard method (e.g., AdamW).
5757
5858
Args:
5959
{_args_doc}
6060
coefficient_type: The type of coefficient set to use for the Newton-Schulz iteration. Can be one of
6161
["simple", "quintic", "polar_express"].
6262
num_ns_steps: The number of iteration steps to use in the Newton-Schulz iteration.
6363
scale_mode: The type of scale factor to use for the update. Defaults to "spectral" style scaling.
64-
extra_scale_factor: The additional scale factor to use for the update.
64+
extra_scale_factor: The additional scale factor to use for the update. Set it to 0.2 can closely match
65+
the update RMS norm of AdamW as suggested by https://arxiv.org/abs/2502.16982.
6566
use_syrk: Whether to use the Triton kernel for the Newton-Schulz iteration.
6667
"""
6768

0 commit comments

Comments
 (0)