Correct way to use all_gather in DDP #14152
Unanswered
cnut1648
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I wonder if there is a definitive guide on how to use
all_gather
in DDP with single dataloader so that we can gather outputs from model and write to a file.I am using the latest pl 1.7.1. Say in
train_step
I return a dictionary, with a fieldgenerated
being a list of dictionaries. Then intrain_epoch_end(self, outputs)
I want to collect all dictionaries from everytrain_step
in every process and then save the wholeList[dict]
on disk. I have the following:I found that
outputs
withoutself.all_gather
(i.e. input oftrain_epoch_end
) is a list of list of dictionaries, am I correct that thelen(outputs) = num_of_train_step_called
?. Moreover, in each of the dictionaries returned bytrain_step
I have a field that is a list of floats. But afterall_gather
this field seems to be changed (eg from[0.24, 0.38 ...]
to[[0.24, 0.93], [0.38, 0.84], ....]
but I have no idea where the second float come from).Therefore I wonder what is the best practice to gather outputs from all previous
train_step
in DDP. Isall_gather
still the best practice? Am I doing correctly withall_gather
? I saw in #11019 that usingPredictionWriter
is the best practice. Is it still considered the best practice as of now?Thank you!
Beta Was this translation helpful? Give feedback.
All reactions