How to skip training examples during DeepSpeed multi-GPU training #10824
Unanswered
gahdritz
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In PyTorch Lightning, you can usually return None from the training_step callback to skip the backward pass of a training example, e.g. when the loss is NaN in half-precision. When DeepSpeed is enabled, this is explicitly forbidden.
A workaround I've tried to use is simply to return a fresh 0 tensor instead. However, if the outputs of each rank don't interact with the same set of parameters, training hangs. This means that in the likely case that the loss is only NaN in one of the ranks, that workaround doesn't work. Since there's no way I'm aware of to coordinate the different ranks---such that, for example, you can get them all to return 0 loss when one of the losses ends up being NaN---I need a different fix.
Is there an elegant way to deal with this problem?
Beta Was this translation helpful? Give feedback.
All reactions