Skip to content

Comments

Add SSO (Spectral Sphere Optimizer) and MuonSphere optimizers#2466

Draft
WhenWen wants to merge 1 commit intomarin-community:mainfrom
WhenWen:add-sso-optimizer
Draft

Add SSO (Spectral Sphere Optimizer) and MuonSphere optimizers#2466
WhenWen wants to merge 1 commit intomarin-community:mainfrom
WhenWen:add-sso-optimizer

Conversation

@WhenWen
Copy link
Contributor

@WhenWen WhenWen commented Jan 26, 2026

This commit adds two new optimizers for training language models from https://arxiv.org/abs/2601.08393:

  1. SSO (Spectral Sphere Optimizer): Full spectral sphere optimization with lambda solver

    • Retracts 2D weight matrices to spectral sphere with radius R = radius_scaler * sqrt(d_out/d_in)
    • Applies msign update (matrix sign function via Newton-Schulz iteration)
    • Solves for lambda to enforce tangent constraint
  2. MuonSphere: Simplified version with lambda=0

    • Same as SSO but without the lambda solver for faster computation

Key features:

  • Support for scan layers (automatically vmaps over layer dimension)
  • Polar Express Newton-Schulz coefficients for msign computation
  • Power iteration for top singular value/vector estimation
  • Compatible with haliax partitioning system
  • Includes example experiment for radius_scaler sweep

This commit adds two new optimizers for training language models:

1. SSO (Spectral Sphere Optimizer): Full spectral sphere optimization with lambda solver
   - Retracts 2D weight matrices to spectral sphere with radius R = radius_scaler * sqrt(d_out/d_in)
   - Applies msign update (matrix sign function via Newton-Schulz iteration)
   - Solves for lambda to enforce tangent constraint

2. MuonSphere: Simplified version with lambda=0
   - Same as SSO but without the lambda solver for faster computation

Key features:
- Support for scan layers (automatically vmaps over layer dimension)
- Polar Express Newton-Schulz coefficients for msign computation
- Power iteration for top singular value/vector estimation
- Compatible with haliax partitioning system
- Includes example experiment for radius_scaler sweep

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 26, 2026 07:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@WhenWen WhenWen marked this pull request as draft January 26, 2026 07:57
@WhenWen
Copy link
Contributor Author

WhenWen commented Jan 26, 2026

A quick test run here https://api.wandb.ai/links/marin-community/emank8v9

}


def msign_newton_schulz(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WhenWen why not reuse/replace zeropower_via_newtonschulz5 from levanter.optim.muon?

Oh my, I just remembered we were using the old coefficients this entire time 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should likely update Muon to allow selecting a better coefficient. Will write a PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for SSO I have been using polar express with step 8 lol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably move this function to lib/levanter/src/levanter/optim/utils.py. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Add #2545

@github-actions
Copy link
Contributor

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions bot added the stale label Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants