Accessing all batches at the end of epoch in callback #12999

kevjn · 2022-05-06T12:01:33Z

kevjn
May 6, 2022

Hi, I have been using PyTorch lightning for several months now and have had a great experience overall. I have defined several callbacks that need to use all the batch inputs in the on_train_epoch_end callback. I have solved this by overriding the on_train_batch_end and on_train_end callbacks in each Callback class (contrived example):

class MyCallback(pytorch_lightning.Callback):
  def __init__(self):
    self.batches = []

  def on_train_epoch_end(self, trainer, pl_module):
    batches = torch.cat(self.batches)
    # ... do something with batches

  def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
    self.batches.append(batch)

However, with large number of callbacks I believe this has a sub-optimal memory footprint since the concatenation will allocate new memory in each callback. I would rather do the concatenation once and reference the same array in all my callbacks. The documentation for the on_train_epoch_end, https://pytorch-lightning.readthedocs.io/en/stable/extensions/callbacks.html#on-train-epoch-end, states:

To access all batch outputs at the end of the epoch, either:

Implement training_epoch_end in the LightningModule and access outputs via the module OR

Cache data across train batch hooks inside the callback implementation to post-process in this hook.

I believe my implementation above is using alternative 2), but how do I use alternative 1) ?

I thought there would be some property in the lightning module that gives access to all the batches in the training loop, but looking at the source code for PyTorch-lightning 1.6.2 inside fit_loop::on_advance_end:

# get the model and call model.training_epoch_end
model = self.trainer.lightning_module
if is_overridden("training_epoch_end", model) and self._outputs:
    epoch_end_outputs = self.epoch_loop._prepare_outputs_training_epoch_end(
        self._outputs,
        lightning_module=model,
        num_optimizers=len(self.trainer.optimizers),
    )
    # run lightning module hook training_epoch_end
    # refresh the result for custom logging at the epoch level
    epoch_end_outputs = self.trainer._call_lightning_module_hook("training_epoch_end", epoch_end_outputs)
    if epoch_end_outputs is not None:
        raise MisconfigurationException(
            "`training_epoch_end` expects a return of None. "
            "HINT: remove the return statement in `training_epoch_end`."
        )
# free memory
self._outputs = []

self.epoch_progress.increment_processed()

# call train epoch end hooks
self.trainer._call_callback_hooks("on_train_epoch_end")
self.trainer._call_lightning_module_hook("on_train_epoch_end")

It looks like the outputs of each training_step in the Lightning Module is only accessible in the train_epoch_end method in the Lightning Module because the memory is being freed before using the callback hooks.

What is the recommended way of going about this? I have thought of doing the following:

class LightningModule(pl.LightningModule):
    def __init__(self, *args):
        super().__init__()
        self.automatic_optimization = False

    def training_step(self, batch, batch_idx):
        return {'batch': batch}

    def training_epoch_end(self, training_step_outputs):
        # training_step_outputs has all my batches
        self.all_batches = torch.cat([x['batch'] for x in training_step_outputs])

class MyCallback(Callback):
    def on_train_epoch_end(self, trainer, pl_module):
        batches = pl_module.all_batches
        # ... do something with batches
        # ... if callback is last callback, also free memory of pl_module.all_batches
        return

The above code snippets assume that batch is a single array, but I have experienced the same dilemma when trying to visualize targets and predictions of the model in isolation from the test_step and test_epoch_end methods and their corresponding callback hooks. What is the recommended way of sharing memory across multiple callbacks ? Is using the pl_module as a proxy for accessing shared memory considered bad practice? I can't really think of any other way to do it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accessing all batches at the end of epoch in callback #12999

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Accessing all batches at the end of epoch in callback #12999

Uh oh!

kevjn May 6, 2022

Replies: 0 comments

kevjn
May 6, 2022