Is there an all_gather before training_step_end when using DDP? #7934
-
From the training_step_end() docs it says:
When using dp or ddp2 the first dimension is equal to the number of GPUs, and it has the per-GPU results like So does this just pass through the single-GPU output in the DDP case? E.g. in the same example, is Or if I define this method does it add a barrier / gather as if doing I just want to make sure I'm not slowing down my code if I define this method for the dp / ddp2 case but then almost always use standard ddp. Sorry if this was already asked or is in the docs, I tried my best to find the answer. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I'll answer my own question. Short answer is no, there is no barrier / gather before I could be wrong, but it appears these methods just get called using the normal callback mechanism, e.g. PyTorch-Lightning doesn't post-process the output beyond what DP / DDP will do. So in the DP case the outputs are automatically aggregated by concatenating the first dimension, and in the DDP case the outputs are just passed through. I tried returning a dictionary from The results were as follows, using 2 GPUs for DP / DDP with a batch size of 128 for DP, and 64 for DDP (maintaining the effective batch size of 128): Rough setup: def training_step(
self,
batch: Tuple[torch.Tensor, torch.Tensor],
batch_idx: int
) -> Dict[str, torch.Tensor]:
inputs, targets = batch
pred_odds = self.forward(inputs)
log_probs = F.log_softmax(pred_odds, dim=1)
loss = F.nll_loss(log_probs, targets)
return {'pred_odds': pred_odds, 'loss': loss}
def training_step_end(
self,
step_outputs: Dict[str, torch.Tensor]
) -> Dict[str, torch.Tensor]:
print(step_outputs['loss'])
print(step_outputs['loss'].shape)
print(step_outputs['pred_odds'].shape)
print(step_outputs['pred_odds'].device) Gave the following results:
So if you're using a tensor (the only case that I actually needed, since the loss will be computed in If you're using a scalar, it will get converted into a 1D tensor equal to the number of GPUs in the no-backend and DP case, but it will be still be a scalar tensor in the DDP case (see the size above). That could throw you off, but again, you're probably not passing scalars to Hope this helps someone. Cheers. |
Beta Was this translation helpful? Give feedback.
I'll answer my own question. Short answer is no, there is no barrier / gather before
training_step_end()
when using DDP.I could be wrong, but it appears these methods just get called using the normal callback mechanism, e.g. PyTorch-Lightning doesn't post-process the output beyond what DP / DDP will do. So in the DP case the outputs are automatically aggregated by concatenating the first dimension, and in the DDP case the outputs are just passed through.
I tried returning a dictionary from
training_step_end()
containing (1) a scalar, e.g.loss
, and (2) a tensor of output predictions e.g.pred_outs
, shape (N, K) for batch size N and number of classes K.The results were as follows, using …