DDP use/access entire effective batch in callback #12076

gustavhartz · 2022-02-23T23:55:32Z

gustavhartz
Feb 23, 2022

Hi, I'm training a model with 1 node and 2-4 GPU using the DDP setting. My goal is to log something once using a callback function, where I have access to the entire effective batch across all GPUs. What is a good way of doing this? If there is one.

I have looked at #9259 and #6501, but can't get It to work in my setting, as all_gather does not work from the Callback only from the pl.LightningModule.

# inside callback
def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
    pl_module.all_gather(outputs) # Never completes

Even if that worked, it would only give me the correct size of outputs - not of batch and batch_idx...

I have tried using the trainer.is_global_zero: in the callback to only log the values once. But this only gives me ¹⁄_gpu of the total effective batch.

I was thinking of something along the lines of the code below

def training_step_end(self, outputs):
    if self.is_global_zero: # don't work
        return self.all_gather(outputs)

to link the device in the training_step_end to the callback, by combining it with the trainer.is_global_zero: in the callback, but it faces the same issues as before with dimensions of the remaining callback elements.

Hope someone can help :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP use/access entire effective batch in callback #12076

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

DDP use/access entire effective batch in callback #12076

Uh oh!

Uh oh!

gustavhartz Feb 23, 2022

Replies: 0 comments

gustavhartz
Feb 23, 2022