Optimizer state is not synchronized across replicas like model state is #2433

timgianitsos · 2023-06-06T05:50:36Z

From DistributedDataParallel docs:
"The module... assumes that [gradients] will be modified by the optimizer in all processes in the same way." Note that this is "assumed", not enforced. From https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html : "each process keeps a dedicated replica of the optimizer. Since DDP has already synchronized gradients in the backward pass, all optimizer replicas will operate on the same parameter and gradient values in every iteration, and this is how DDP keeps model replicas in the same state"

Checklist

No unnecessary issues are included into this pull request.

From `DistributedDataParallel` docs: "The module... assumes that [gradients] will be modified by the optimizer in all processes in the same way." Note that this is "assumed", not enforced. From https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html : "each process keeps a dedicated replica of the optimizer. Since DDP has already synchronized gradients in the backward pass, all optimizer replicas will operate on the same parameter and gradient values in every iteration, and this is how DDP keeps model replicas in the same state"

facebook-github-bot · 2023-06-06T05:50:41Z

Hi @timgianitsos!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

netlify · 2023-06-06T05:55:00Z

✅ Deploy Preview for pytorch-tutorials-preview ready!

Name	Link
🔨 Latest commit	`9082d4b`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/647ec92e0fd6490008055080
😎 Deploy Preview	https://deploy-preview-2433--pytorch-tutorials-preview.netlify.app/intermediate/fsdp_tutorial
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

facebook-github-bot · 2023-06-06T06:54:54Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

svekars · 2023-06-06T15:26:22Z

Can you please reference an issue you are fixing in the PR description?

timgianitsos · 2023-06-06T16:50:27Z

It's not technically an issue fix since it is a wording change. If you'd like, I can file an issue just so that this PR can fix it. Let me know.

…

On Jun 6, 2023, at 8:26 AM, Svetlana Karslioglu ***@***.***> wrote: Can you please reference an issue you are fixing in the PR description? — Reply to this email directly, view it on GitHub <#2433 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADMIJPQ7SHD25DEGO5FQV7LXJ5D2TANCNFSM6AAAAAAY34T7CU>. You are receiving this because you were mentioned.

subramen

The states are indeed not replicated, but I believe the author intended to say that the optimizer's initial hyperparameters (lr, seed, momentum etc.) are the same across all devices. I suggest adding that clarity here instead of deleting the point about replicating optimizers

timgianitsos · 2023-06-06T18:46:11Z

Thanks! I interpret you as making a distinction between optimizer state synchronization at (a) initialization versus (b) after each step, and you argue "yes" for the former and "no" for the latter. But I am arguing that there is no optimizer synchronization in either case.

If the optimizers are incidentally initialized to be the same across processes as you say (which I concede will often be the case) that's only because of a decision the user made (or because of the user's oversight as this is default behavior) and NOT because DDP has any effect on the synchronization. That is, if you just create an optimizer like normal in a function that is sent to multiprocess.spawn, then they will be identical (e.g. opt = RMSprop(ddp.parameters(), **opt_kwargs)). The user could make them different if they said if rank == x: <do something different>.

The way the docs were worded before my edit, it seems to imply that the optimizers are initialized to be identical because DDP enforces this. But if so, then I don't understand the mechanism - the only dependency between DDP and the optimizers is the passing of ddp.parameters(). This is a generator which simply yields each layer as a nn.Parameter. This seems absent of any information that the optimizer could use to tell if the model is being distributed. From this, I conclude that the optimizers on different processes are NOT synced with each other, neither at initialization nor after each step.

subramen · 2023-06-14T17:52:01Z

I agree with you. DDP does not explicitly do anything to enforce any synchronization for optimizers, and the identical-ness is only because the same states are sent to each process. A user might be able to introduce changes to the optimizer in one process and DDP will silently continue (although I haven't encountered this atypical situation myself).

Approving from my end, and ccing PyTorch distributed wizards @mrshenli @rohan-varma in case they have an opinion on this

rohan-varma

I guess this is technically right, we don't explicitly sync / replicate optimizer states, but they are usually the same since the optimizer acts on the same parameters and gradients. If folks feel this change makes things more accurate then should be good to merge.

pytorch-bot · 2023-08-14T17:19:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2433

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 9beb1d1:

NEW FAILURE - The following job has failed:

pytorch_tutorial_build_worker (3, 15, linux.g5.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2024-09-26T00:16:08Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

timgianitsos · 2024-09-26T15:20:17Z

This documentation mistake has lingered for a year: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

Is there anything else that needs to be addressed before accepting my revision?

timgianitsos · 2024-10-27T02:44:54Z

Please advise on what is causing the delay, @rohan-varma @subramen

timgianitsos · 2024-12-02T01:18:02Z

Please advise on what is causing the delay, @rohan-varma @subramen. It's a very simple change.

facebook-github-bot added the cla signed label Jun 6, 2023

svekars requested a review from subramen June 6, 2023 15:26

subramen suggested changes Jun 6, 2023

View reviewed changes

subramen approved these changes Jun 14, 2023

View reviewed changes

rohan-varma approved these changes Aug 14, 2023

View reviewed changes

Merge branch 'main' into patch-1

9beb1d1

github-actions bot added the stale Stale PRs label Sep 26, 2024

github-actions bot closed this Oct 27, 2024

Optimizer state is not synchronized across replicas like model state is #2433

Optimizer state is not synchronized across replicas like model state is #2433

Uh oh!

Conversation

timgianitsos commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

facebook-github-bot commented Jun 6, 2023

Action Required

Process

Uh oh!

netlify bot commented Jun 6, 2023

✅ Deploy Preview for pytorch-tutorials-preview ready!

Uh oh!

facebook-github-bot commented Jun 6, 2023

Uh oh!

svekars commented Jun 6, 2023

Uh oh!

timgianitsos commented Jun 6, 2023 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

subramen left a comment

Choose a reason for hiding this comment

Uh oh!

timgianitsos commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

subramen commented Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohan-varma left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented Aug 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2433

❌ 1 New Failure

Uh oh!

github-actions bot commented Sep 26, 2024

Uh oh!

timgianitsos commented Sep 26, 2024

Uh oh!

timgianitsos commented Oct 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timgianitsos commented Dec 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

timgianitsos commented Jun 6, 2023 •

edited

Loading

timgianitsos commented Jun 6, 2023 via email •

edited

Loading

timgianitsos commented Jun 6, 2023 •

edited

Loading

subramen commented Jun 14, 2023 •

edited

Loading

rohan-varma left a comment •

edited

Loading

pytorch-bot bot commented Aug 14, 2023 •

edited

Loading

timgianitsos commented Oct 27, 2024 •

edited

Loading