-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Optimizer state is not synchronized across replicas like model state is #2433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
From `DistributedDataParallel` docs: "The module... assumes that [gradients] will be modified by the optimizer in all processes in the same way." Note that this is "assumed", not enforced. From https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html : "each process keeps a dedicated replica of the optimizer. Since DDP has already synchronized gradients in the backward pass, all optimizer replicas will operate on the same parameter and gradient values in every iteration, and this is how DDP keeps model replicas in the same state"
Hi @timgianitsos! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
✅ Deploy Preview for pytorch-tutorials-preview ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Can you please reference an issue you are fixing in the PR description? |
It's not technically an issue fix since it is a wording change.
If you'd like, I can file an issue just so that this PR can fix it. Let me know.
… On Jun 6, 2023, at 8:26 AM, Svetlana Karslioglu ***@***.***> wrote:
Can you please reference an issue you are fixing in the PR description?
—
Reply to this email directly, view it on GitHub <#2433 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADMIJPQ7SHD25DEGO5FQV7LXJ5D2TANCNFSM6AAAAAAY34T7CU>.
You are receiving this because you were mentioned.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The states are indeed not replicated, but I believe the author intended to say that the optimizer's initial hyperparameters (lr, seed, momentum etc.) are the same across all devices. I suggest adding that clarity here instead of deleting the point about replicating optimizers
Thanks! I interpret you as making a distinction between optimizer state synchronization at (a) initialization versus (b) after each step, and you argue "yes" for the former and "no" for the latter. But I am arguing that there is no optimizer synchronization in either case. If the optimizers are incidentally initialized to be the same across processes as you say (which I concede will often be the case) that's only because of a decision the user made (or because of the user's oversight as this is default behavior) and NOT because DDP has any effect on the synchronization. That is, if you just create an optimizer like normal in a function that is sent to The way the docs were worded before my edit, it seems to imply that the optimizers are initialized to be identical because DDP enforces this. But if so, then I don't understand the mechanism - the only dependency between DDP and the optimizers is the passing of |
I agree with you. DDP does not explicitly do anything to enforce any synchronization for optimizers, and the identical-ness is only because the same states are sent to each process. A user might be able to introduce changes to the optimizer in one process and DDP will silently continue (although I haven't encountered this atypical situation myself). Approving from my end, and ccing PyTorch distributed wizards @mrshenli @rohan-varma in case they have an opinion on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is technically right, we don't explicitly sync / replicate optimizer states, but they are usually the same since the optimizer acts on the same parameters and gradients. If folks feel this change makes things more accurate then should be good to merge.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2433
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 9beb1d1: NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
This documentation mistake has lingered for a year: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html Is there anything else that needs to be addressed before accepting my revision? |
Please advise on what is causing the delay, @rohan-varma @subramen |
Please advise on what is causing the delay, @rohan-varma @subramen. It's a very simple change. |
From
DistributedDataParallel
docs:"The module... assumes that [gradients] will be modified by the optimizer in all processes in the same way." Note that this is "assumed", not enforced. From https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html : "each process keeps a dedicated replica of the optimizer. Since DDP has already synchronized gradients in the backward pass, all optimizer replicas will operate on the same parameter and gradient values in every iteration, and this is how DDP keeps model replicas in the same state"
Checklist