Accumulating Batches (not Gradients) via custom Loops and avoid CUDA OOM #15116

myscience · 2022-10-13T09:03:13Z

myscience
Oct 13, 2022

Hi everyone,

I am facing a CUDA Out-Of-Memory error when training my model on a custom dataset via DDP on 4 GPUs and would like to ear whether I am missing a simple solution here or where my mistake is.

The problem is that a single example in my dataset is quite big (a tensor of shape (batch, 80, 105, 85)) and the model itself has two sub-modules, one of which is a BEiT transformer (taken from HuggingFace), which has ~85.7M parameter. The Lightning report estimates the full model to be 185MB. On a single GPU (which has 16GB of memory) I can fit a batch size of 2, which is too small as I am using a contrastive-learning approach where in each batch I need positive and negative examples and I think a batch size of 128 should be more reasonable. My idea for solving this issue was the following. Before the loss computation, the model computes some vector representations of the data which are far smaller (tensors of shape (batch, 700)), so I can accumulate some batches, collect the vector representations till I reach something like (128, 700) and then compute the loss and update everything. The question is: "Does this makes sense? And if so, how can I achieve this sort of behavior?".

As I understood it, the Lightning API easily offer gradient accumulation, but I fear it is not useful here. In gradient accumulation the loss is computed on the individual mini-batches separately and then the gradients are accumulated. For me this would result in very poor individual gradients. After some investigation I found out about the Lightning Loop API and I thought I could use that to fit my needs. The idea was to subclass the TrainingEpochLoop and request multiple batches from the data_fetcher using a generator (so that we only have one or two examples in memory at a time) and use the lightning_model_hook on_train_batch_start to pre-process the batch and transform the (1, 80, 105, 85) tensor into the more manageable (1, 700) tensor and then start accumulating those. What my code is doing at the moment looks something like the following.

class AccumulateBatchLoop(TrainingEpochLoop):
  def __init__(
    self,
    accumulate : int = 4,
    min_steps: Optional[int] = None,
    max_steps : int = -1
    ) -> None:
    super().__init__(min_steps = min_steps, max_steps = max_steps)
    
    self.accumulate = accumulate

  def advance(self, data_fetcher : AbstractDataFetcher) -> None:
    if self.restarting and self._should_check_val_fx():
        # skip training and run validation in `on_advance_end`
        return
    # we are going to train first so the val loop does not need to restart
    self.val_loop.restarting = False
  
    # * Gather more than one batch to accumulate them and package
    # * all into a novel generator that can be consumed by the model
    # * on_train_batch_start hook
    batch_idx = self.batch_idx + self.accumulate
    batch = (next(data_fetcher) for _ in range(self.accumulate))
    
    # We trigger here the lightning module hook so that we can effectively process the batch!
    response, batch = self.trainer._call_lightning_module_hook("on_train_batch_start", batch, batch_idx)
  
   # Everything continues as in TrainingEpochLoop...

In my LightningModule I have implemented the on_train_batch_start hook as follows (note that the example2latent function is calling one submodule of my model):

class CustomModule(LightningModule):

  # ... some code before...

  def on_train_batch_start(self, pre_batch, batch_idx: int, unused: int = 0) -> Tuple[int, dict]:
    # Here pre_batch is the generator, we consume it and transform the single examples
    # into smaller tensors that we can accumulate
    latents = [self.example2latent(example) for example in pre_batch]

    batch = {
        'latent' : torch.cat(latents),
    }

    return 1, batch

  def training_step(self, batch, batch_idx):
    latent = batch['latent']
    # ...
    loss = compute_loss(latent)
    
    return loss

Finally in the main script I simply connect the custom loop as:

epoch_loop = AccumulateBatchLoop(accumulate)
trainer.fit_loop.connect(epoch_loop)

The problem with all of this is that if I try to use accumulate = 16 for example (thus aiming for a final latent vector of shape (16, 700)) I get the Out Of Memory Error I mentioned at the beginning. How can it be? Is this whole logic wrong? Do you guys have a more general suggestion on how to tackle this problem? Thanks!

P.S. I also tried to turn my torch.Dataset into a torch.IterableDataset and mess around with prefect_factors and num_workers and so on without luck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accumulating Batches (not Gradients) via custom Loops and avoid CUDA OOM #15116

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Accumulating Batches (not Gradients) via custom Loops and avoid CUDA OOM #15116

Uh oh!

myscience Oct 13, 2022

Replies: 0 comments

myscience
Oct 13, 2022