Skip to content

Conversation

@mkhona-nvidia
Copy link
Contributor

Added Scion.

The main change is the parametrization to allow Franke-Wolfe. We now DO NOT use weight decay and instead use step size (i.e. the learning rate) and spectral radius, with the unit_rms choice for width scaling

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mkhona-nvidia mkhona-nvidia self-assigned this Nov 4, 2025
@valentyn1boreiko
Copy link

Amazing, thanks @mkhona-nvidia! Could you please also

  • add 5D parallelism support (which includes LayerWiseDistributedOptimizer wrapper and TensorParallel similar to what TensorParallelMuon in this commit is doing);
  • support 1D and 2D tensors in Scion with the automatic mapping to respective LMOs similar to how it is done here with class Auto and different norms in norms_dict;
  • allow to import Scion from megatron/core/optimizer/init.py as an alternative Megatron optimizer and allow for passing layer-wise different radii to tune depending on parameter groups (separately for the router, other hidden layers, embedding, output layers, and 1D tensors for example)?

@mkhona-nvidia mkhona-nvidia changed the title Mkhona/scion Scion optimizer Nov 4, 2025
Copy link
Contributor

@skyw skyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes needed. otherwise LGTM

@skyw skyw enabled auto-merge (squash) November 5, 2025 18:56
@skyw
Copy link
Contributor

skyw commented Nov 5, 2025

/ok to test 5521b58

@skyw skyw merged commit 9139d55 into NVIDIA-NeMo:main Nov 5, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants