Accumulate features and process in training_step_end across multiple training steps. #10593

jipson7 · 2021-11-17T17:39:34Z

jipson7
Nov 17, 2021

Hi there,

I am processing images and performing an NCELoss type calculation on the features, where it is desirable to have a bigger effective batch size. Current Im processing the features using dp so that in training_step_end I can calculate the loss across features from all gpus. However this gets hard to manage when I have a distributed cluster where nodes have different numbers of gpus. Also its a bit slower than ddp.

What I'd like is something similar to accumulate_grad_batches, but rather than accumulating gradients I'd like to just accumulate the feature output from training_step, and then run training_step_end once every N training_steps. The benefit is that I can leverage ddp and still get a large effective batch size, and more features to calculate the NCELoss with.

Any suggestions are appreciated, thanks in advance.

tchaton · 2021-11-18T15:40:01Z

tchaton
Nov 18, 2021
Maintainer

Dear @jipson7,

I am wondering why you can't implement this directly within your LightningModule ?

class Model(LightningModule):

    def __init__(self):
        super().__init__()
        self.train_batches = []

    def training_step(self, batch, batch_idx):
        if batch_idx > 0 and batch_idx % 10 == 0:
            # do something with all your batches

            return loss

        else:
            self.train_batches.append(batch)
            return None # should skip optimization

1 reply

jipson7 Nov 19, 2021
Author

That seems like a nice easy solution, thanks. So if I only calculate a loss every 10 batches, but I use the features from the previous 9, will the gradients be computed correctly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accumulate features and process in training_step_end across multiple training steps. #10593

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Accumulate features and process in training_step_end across multiple training steps. #10593

Uh oh!

jipson7 Nov 17, 2021

Replies: 1 comment · 1 reply

Uh oh!

tchaton Nov 18, 2021 Maintainer

Uh oh!

Uh oh!

jipson7 Nov 19, 2021 Author

jipson7
Nov 17, 2021

Replies: 1 comment 1 reply

tchaton
Nov 18, 2021
Maintainer

jipson7 Nov 19, 2021
Author