How to skip training examples during DeepSpeed multi-GPU training #10824

gahdritz · 2021-11-29T22:15:24Z

gahdritz
Nov 29, 2021

In PyTorch Lightning, you can usually return None from the training_step callback to skip the backward pass of a training example, e.g. when the loss is NaN in half-precision. When DeepSpeed is enabled, this is explicitly forbidden.

A workaround I've tried to use is simply to return a fresh 0 tensor instead. However, if the outputs of each rank don't interact with the same set of parameters, training hangs. This means that in the likely case that the loss is only NaN in one of the ranks, that workaround doesn't work. Since there's no way I'm aware of to coordinate the different ranks---such that, for example, you can get them all to return 0 loss when one of the losses ends up being NaN---I need a different fix.

Is there an elegant way to deal with this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to skip training examples during DeepSpeed multi-GPU training #10824

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to skip training examples during DeepSpeed multi-GPU training #10824

Uh oh!

Uh oh!

gahdritz Nov 29, 2021

Replies: 0 comments

gahdritz
Nov 29, 2021