-
Notifications
You must be signed in to change notification settings - Fork 495
[wip] Distributed Scion/Muon #1630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR on cutting-edge features!
I didn't read the papers so please forgive me if what I comment doesn't make sense.
I guess for "core" changes such as this one on optimizers, the recommended path is to first land in pytorch/pytorch, and then expose minimal interfaces to torchtitan. torchtitan shouldn't be a place to host core features.
cc @janeyx99 on interesting optimizer work
update: the init refactor is done, you can check the diff here that this optimizer is not aggressive at all, and not too much hack on 'components.optimizer'. Though we need add the configs. i added the debug configs, so can try it now. CONFIG_FILE="./torchtitan/experiments/distributed_scion/train_configs/debug_model.toml" NGPU=4 ./run_train.sh there is a "clean" version where I removed the code for logging, which can make the code easier to read and understand. a random test (be aware scion here using a higher LR, muon/scion allows us to train a model with high LR)
|
This is a distributed version of Scion or Modular Norm, muon is considered to be a variant of this by using explicit AdamW for LLM's embedding/output.
Works:
Missing
Need some extra work to adjust the EP changes for EP-[shard(1)] and ETP? And it's not working for multiple shared_experts.
CC @janEbert @ofivite