set with model.no_sync() in lightning training step #10792

maxmatical · 2021-11-28T06:58:45Z

maxmatical
Nov 28, 2021

Trying to run some experiments using SAM optimizer with multiple GPUs and ddp. It is recommended that when using multiple gpus, we want to compute the gradients for each gpu separately, and the example code given is

for input, output in data:
  # first forward-backward pass
  loss = loss_function(output, model(input))
  with model.no_sync():  # <- this is the important line
    loss.backward()
  optimizer.first_step(zero_grad=True)
  
  # second forward-backward pass
  loss_function(output, model(input)).backward()
  optimizer.second_step(zero_grad=True)

My question is: is with model.no_sync() already handled when using ddp strategy? i found something possible related in this discussion on using no_sync with ddp

if not, how do i ensure that the model is not syncing the gradients? I have something like

class LightningModel(pl.LightningModule):
  def __init__(self, hf_model: nn.Module):
    self.hf_model = hf_model
    ...

do i need to call with self.hf_model.no_sync(): in the training_step before calling self.manual_backward(loss)? so something like

class LightningModel(pl.LightningModule):
  def __init__(self, hf_model: nn.Module, is_ddp: bool = False):
    self.hf_model = hf_model
    self.is_ddp = is_ddp
    ...

  def training_step(self, batch, batch_idx):
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]
    out = self(input_ids, attention_mask)
    loss = self.criterion(out, labels.to(dtype=torch.float32))
    optimizer = self.optimizers()

    if self.is_ddp:
      with self.pytorch_model.no_sync():
        self.manual_backward(loss)
    else:
      self.manual_backward(loss)
    optimizer.zero_grad()
    optimizer.step()

rohitgr7 · 2021-11-28T19:28:47Z

rohitgr7
Nov 28, 2021

yes, lightning handles it for you.

A little more on the context:
during DDP, the gradients are synchronized right after loss.backward() when grads become available and synchronized gradients are required to make an optimizer.step so that the model on every device will be updated with the same gradients. no_sync is required only when you want to calculate gradients but don't want to do the synchronization process possibly because optimizer.step isn't going to happen, for eg in case of gradient accumulation. So you don't need to do it every time during .backward call.

In your code

    if self.is_ddp:
      with self.pytorch_model.no_sync():
        self.manual_backward(loss)
    else:
      self.manual_backward(loss)
    optimizer.zero_grad()
    optimizer.step()

the gradients will never sync up and thus your model will end up having different weights on each device right after the first optimizer.step.

3 replies

maxmatical Nov 28, 2021
Author

i believe that's the intended behavior, the example code in plain pytorch is

for input, output in data:
  # first forward-backward pass
  loss = loss_function(output, model(input))
  with model.no_sync():  # <- this is the important line
    loss.backward()
  optimizer.first_step(zero_grad=True)
  
  # second forward-backward pass
  loss_function(output, model(input)).backward()
  optimizer.second_step(zero_grad=True)

so in this case, i wouldn't want the gradients to be synced on the first optimizer step, but sync the gradients for the 2nd step, so the training step would look something like

    if self.is_ddp:
      with self.pytorch_model.no_sync():
        self.manual_backward(loss)
    else:
      self.manual_backward(loss)
    opt.first_step(zero_grad=True)
    # 2nd pass
    out2 = self(input_ids, attention_mask)
    loss2 = self.criterion(out2, labels.to(dtype=torch.float32))
    self.manual_backward(loss2)
    opt.second_step(zero_grad=True)

would that be functionally equivalent to the above pytorch example?

rohitgr7 Nov 28, 2021

yeah seems to be correct. This is a very different optimizer since it's doing calculating gradients twice in the same training step.

also I think it should be:

with self.trainer.model.no_sync():

since I don't think your pytorch_model is wrapped around DistributedDataParallel and no_sync is a method for DDP model.

rsomani95 Feb 10, 2022

@maxmatical curious if the above snippet worked for you?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

set with model.no_sync() in lightning training step #10792

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

set with model.no_sync() in lightning training step #10792

Uh oh!

Uh oh!

maxmatical Nov 28, 2021

Replies: 1 comment · 3 replies

Uh oh!

rohitgr7 Nov 28, 2021

Uh oh!

Uh oh!

maxmatical Nov 28, 2021 Author

Uh oh!

Uh oh!

rohitgr7 Nov 28, 2021

Uh oh!

rsomani95 Feb 10, 2022

maxmatical
Nov 28, 2021

Replies: 1 comment 3 replies

rohitgr7
Nov 28, 2021

maxmatical Nov 28, 2021
Author