Skip to content

Conversation

timgianitsos
Copy link

@timgianitsos timgianitsos commented Jun 6, 2023

From DistributedDataParallel docs:
"The module... assumes that [gradients] will be modified by the optimizer in all processes in the same way." Note that this is "assumed", not enforced. From https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html : "each process keeps a dedicated replica of the optimizer. Since DDP has already synchronized gradients in the backward pass, all optimizer replicas will operate on the same parameter and gradient values in every iteration, and this is how DDP keeps model replicas in the same state"

Checklist

  • No unnecessary issues are included into this pull request.

From `DistributedDataParallel` docs:
"The module... assumes that [gradients] will be modified by the optimizer in all processes in the same way." Note that this is "assumed", not enforced.
From https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html :
"each process keeps a dedicated replica of the optimizer. Since DDP has already synchronized gradients in the backward pass, all optimizer replicas will operate on the same parameter and gradient values in every iteration, and this is how DDP keeps model replicas in the same state"
@facebook-github-bot
Copy link
Contributor

Hi @timgianitsos!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@netlify
Copy link

netlify bot commented Jun 6, 2023

Deploy Preview for pytorch-tutorials-preview ready!

Name Link
🔨 Latest commit 9082d4b
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/647ec92e0fd6490008055080
😎 Deploy Preview https://deploy-preview-2433--pytorch-tutorials-preview.netlify.app/intermediate/fsdp_tutorial
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@svekars
Copy link
Contributor

svekars commented Jun 6, 2023

Can you please reference an issue you are fixing in the PR description?

@svekars svekars requested a review from subramen June 6, 2023 15:26
@timgianitsos
Copy link
Author

timgianitsos commented Jun 6, 2023 via email

Copy link
Contributor

@subramen subramen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The states are indeed not replicated, but I believe the author intended to say that the optimizer's initial hyperparameters (lr, seed, momentum etc.) are the same across all devices. I suggest adding that clarity here instead of deleting the point about replicating optimizers

@timgianitsos
Copy link
Author

timgianitsos commented Jun 6, 2023

Thanks! I interpret you as making a distinction between optimizer state synchronization at (a) initialization versus (b) after each step, and you argue "yes" for the former and "no" for the latter. But I am arguing that there is no optimizer synchronization in either case.

If the optimizers are incidentally initialized to be the same across processes as you say (which I concede will often be the case) that's only because of a decision the user made (or because of the user's oversight as this is default behavior) and NOT because DDP has any effect on the synchronization. That is, if you just create an optimizer like normal in a function that is sent to multiprocess.spawn, then they will be identical (e.g. opt = RMSprop(ddp.parameters(), **opt_kwargs)). The user could make them different if they said if rank == x: <do something different>.

The way the docs were worded before my edit, it seems to imply that the optimizers are initialized to be identical because DDP enforces this. But if so, then I don't understand the mechanism - the only dependency between DDP and the optimizers is the passing of ddp.parameters(). This is a generator which simply yields each layer as a nn.Parameter. This seems absent of any information that the optimizer could use to tell if the model is being distributed. From this, I conclude that the optimizers on different processes are NOT synced with each other, neither at initialization nor after each step.

@subramen
Copy link
Contributor

subramen commented Jun 14, 2023

I agree with you. DDP does not explicitly do anything to enforce any synchronization for optimizers, and the identical-ness is only because the same states are sent to each process. A user might be able to introduce changes to the optimizer in one process and DDP will silently continue (although I haven't encountered this atypical situation myself).

Approving from my end, and ccing PyTorch distributed wizards @mrshenli @rohan-varma in case they have an opinion on this

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is technically right, we don't explicitly sync / replicate optimizer states, but they are usually the same since the optimizer acts on the same parameters and gradients. If folks feel this change makes things more accurate then should be good to merge.

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 14, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2433

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 9beb1d1:

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the stale Stale PRs label Sep 26, 2024
@timgianitsos
Copy link
Author

This documentation mistake has lingered for a year: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

Is there anything else that needs to be addressed before accepting my revision?

@github-actions github-actions bot closed this Oct 27, 2024
@timgianitsos
Copy link
Author

timgianitsos commented Oct 27, 2024

Please advise on what is causing the delay, @rohan-varma @subramen

@timgianitsos
Copy link
Author

Please advise on what is causing the delay, @rohan-varma @subramen. It's a very simple change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants