Is there an all_gather before training_step_end when using DDP? #7934

collinmccarthy · 2021-06-11T02:09:26Z

collinmccarthy
Jun 11, 2021

From the training_step_end() docs it says:

If you later switch to ddp or some other mode, this will still be called so that you don’t have to change your code

When using dp or ddp2 the first dimension is equal to the number of GPUs, and it has the per-GPU results like gpu_n_pred = training_step_outputs[n]['pred'] as shown in the example in the docs. This makes sense since there is a gather in the forward pass. For DDP though there does not need to be a gather / sync / barrier in the forward pass, only for the gradients in the backward pass.

So does this just pass through the single-GPU output in the DDP case? E.g. in the same example, is training_step_outputs just the dictionary from that single GPU, like gpu_pred = training_step_outputs['pred']?

Or if I define this method does it add a barrier / gather as if doing outputs = self.all_gather(outputs), such that all of the GPU results are actually available like gpu_n_pred = training_step_outputs[n]['pred'] as in the DP/DDP2 case?

I just want to make sure I'm not slowing down my code if I define this method for the dp / ddp2 case but then almost always use standard ddp. Sorry if this was already asked or is in the docs, I tried my best to find the answer. Thanks!

Answered by collinmccarthy

Jun 12, 2021

I'll answer my own question. Short answer is no, there is no barrier / gather before training_step_end() when using DDP.

I could be wrong, but it appears these methods just get called using the normal callback mechanism, e.g. PyTorch-Lightning doesn't post-process the output beyond what DP / DDP will do. So in the DP case the outputs are automatically aggregated by concatenating the first dimension, and in the DDP case the outputs are just passed through.

I tried returning a dictionary from training_step_end() containing (1) a scalar, e.g. loss, and (2) a tensor of output predictions e.g. pred_outs, shape (N, K) for batch size N and number of classes K.

The results were as follows, using …

View full answer

collinmccarthy · 2021-06-12T23:35:45Z

collinmccarthy
Jun 12, 2021
Author

I'll answer my own question. Short answer is no, there is no barrier / gather before training_step_end() when using DDP.

I could be wrong, but it appears these methods just get called using the normal callback mechanism, e.g. PyTorch-Lightning doesn't post-process the output beyond what DP / DDP will do. So in the DP case the outputs are automatically aggregated by concatenating the first dimension, and in the DDP case the outputs are just passed through.

I tried returning a dictionary from training_step_end() containing (1) a scalar, e.g. loss, and (2) a tensor of output predictions e.g. pred_outs, shape (N, K) for batch size N and number of classes K.

The results were as follows, using 2 GPUs for DP / DDP with a batch size of 128 for DP, and 64 for DDP (maintaining the effective batch size of 128):

Rough setup:

def training_step(
    self,
    batch: Tuple[torch.Tensor, torch.Tensor],
    batch_idx: int
) -> Dict[str, torch.Tensor]:
    inputs, targets = batch
    pred_odds = self.forward(inputs)
    log_probs = F.log_softmax(pred_odds, dim=1)
    loss = F.nll_loss(log_probs, targets)
    return {'pred_odds': pred_odds, 'loss': loss}

def training_step_end(
    self,
    step_outputs: Dict[str, torch.Tensor]
) -> Dict[str, torch.Tensor]:
    print(step_outputs['loss'])
    print(step_outputs['loss'].shape)

    print(step_outputs['pred_odds'].shape)
    print(step_outputs['pred_odds'].device)

Gave the following results:

---------------------------
SINGLE GPU
---------------------------
SCALAR OUTPUT:

step_outputs['loss']
tensor([4.8134], device='cuda:0')

step_outputs['loss'].shape
torch.Size([1])

TENSOR OUTPUT:

step_outputs['pred_odds'].shape
torch.Size([128, 100])

step_outputs['pred_odds'].device
device(type='cuda', index=0)

---------------------------
DP
---------------------------
SCALAR OUTPUT:

step_outputs['loss']
tensor([4.8472, 4.8262], device='cuda:0')

step_outputs['loss'].shape
torch.Size([2])

TENSOR OUTPUT:

step_outputs['pred_odds'].shape
torch.Size([128, 100])

step_outputs['pred_odds'].device
device(type='cuda', index=0)

-----------------------------
DDP
-----------------------------
SCALAR OUTPUT:

step_outputs['loss']
tensor(4.8477, device='cuda:0')

step_outputs['loss'].shape
torch.Size([])

TENSOR_OUTPUT:

step_outputs['pred_odds'].shape
torch.Size([64, 100])

step_outputs['pred_odds'].device
device(type='cuda', index=0)

So if you're using a tensor (the only case that I actually needed, since the loss will be computed in training_step_end() anyway), you can just use the tensor as you normally would in training_step_end() as if it were coming from a single GPU in training_step().

If you're using a scalar, it will get converted into a 1D tensor equal to the number of GPUs in the no-backend and DP case, but it will be still be a scalar tensor in the DDP case (see the size above). That could throw you off, but again, you're probably not passing scalars to training_step_end() anyway.

Hope this helps someone. Cheers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there an all_gather before training_step_end when using DDP? #7934

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Is there an all_gather before training_step_end when using DDP? #7934

Uh oh!

collinmccarthy Jun 11, 2021

Replies: 1 comment

Uh oh!

Uh oh!

collinmccarthy Jun 12, 2021 Author

collinmccarthy
Jun 11, 2021

collinmccarthy
Jun 12, 2021
Author