Serialize model predictions when running on multiple GPUs #8699
Unanswered
aleSuglia
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there,
By looking at the way the
Trainer.predict
works, it seems to me that the model is storing the predictions internally and then the idea is to serialise them at the end of the prediction loop. However, in my case, storing all the predictions is infeasible (they will occupy too much memory). Therefore, I've implemented a Callback that writes at each batch on disk intermediate results and then it's supposed to collect them and group them together at the end of the entire prediction loop.At the moment, I've achieved the first part by implementing
on_predict_batch_end()
to store the intermediate results on disk. Then, I implementedon_predict_epoch_end()
in order to make sure that the results are grouped back together. However, I have the impression thaton_predict_epoch_end()
is run when a specific GPU finishes and it's not synchronised among GPUs. How can I hook this function to the end of the prediction loop from all the GPUs? Is adding@rank_zero_only
a correct solution for this scenario?Beta Was this translation helpful? Give feedback.
All reactions