Skip to content

Is there an all_gather before training_step_end when using DDP? #7934

Discussion options

You must be logged in to vote

I'll answer my own question. Short answer is no, there is no barrier / gather before training_step_end() when using DDP.

I could be wrong, but it appears these methods just get called using the normal callback mechanism, e.g. PyTorch-Lightning doesn't post-process the output beyond what DP / DDP will do. So in the DP case the outputs are automatically aggregated by concatenating the first dimension, and in the DDP case the outputs are just passed through.

I tried returning a dictionary from training_step_end() containing (1) a scalar, e.g. loss, and (2) a tensor of output predictions e.g. pred_outs, shape (N, K) for batch size N and number of classes K.

The results were as follows, using …

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by collinmccarthy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant