Drastic training slowdown after ~10 epochs #12544

erikfurevik · 2022-03-31T09:08:25Z

erikfurevik
Mar 31, 2022

Hi, for most of my runs and models the training time per epoch at some point increases drastically as shown below (click to see axis).

This data is from a custom logging setup like this, but is verified with built-in tensorboard logging system as well.

def on_train_start(self):
    self.starttime = time()

def training_epoch_end(self,outs):
    # logging other things
    self.dataframe["Time"][self.epoch] = time() - self.starttime
    self.starttime = time()
    self.epoch += 1

This is a 30 million param model,

this one has 50K parameters,

and this has only 400 parameters.

Is there a natural explanation for this behaviour, or is it a bug?

As the models are very different size, and still get very similar behaviour and speed, this leads me to assume it must have to do with dataloading. My best guess is that data is pre-cached during the start of training with model validation/sanity check, then the trainer runs out of pre-cached data and must slow down?

Is there anything I can do to improve this and speed things up?

My setup:
I train with a GPU on Google Colab. I have 4 workers per dataloader, which I've seen recommended for a GPU (even though pl warns me I should use only 2). I have a custom dataset module that just stores a list of filenames, one per input and target example. Each file is a pickled numpy array of size 51x128x128, 3.2MB for input and 8x128x128, 0.5 MB for target. Here's my getitem.

def __getitem__(self,key):
        # load files
        rootx = self.root+"inputs/"+self.x[key]
        rooty = self.root+"targets/"+self.y[key]
        with open(rootx,"rb") as handle:
            x = pickle.load(handle)
        with open(rooty,"rb") as handle:
            y = pickle.load(handle)

        # normalize
        for i in range(len(self.ind)-1):
            x[self.ind[i]:self.ind[i+1],:,:] = x[self.ind[i]:self.ind[i+1],:,:] - self.x_mean[i]
            x[self.ind[i]:self.ind[i+1],:,:] = x[self.ind[i]:self.ind[i+1],:,:] / self.x_std[i]

        return torch.tensor(x), torch.tensor(y)

erikfurevik · 2022-03-31T11:39:10Z

erikfurevik
Mar 31, 2022
Author

I did some changes that were very effective:

I did the normalization as preprocessing, reducing getitem call time by about 25%.
I moved all data from gdrive to local colab storage, which reduced time by at least further 50%.
Changed to 2 workers as recommended by pl, with further 25% reduction.

In total these changes reduced time per epoch from ~30s to ~8s, at least for the smaller models that I tested on so far.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Drastic training slowdown after ~10 epochs #12544

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Drastic training slowdown after ~10 epochs #12544

Uh oh!

erikfurevik Mar 31, 2022

Replies: 1 comment

Uh oh!

erikfurevik Mar 31, 2022 Author

erikfurevik
Mar 31, 2022

erikfurevik
Mar 31, 2022
Author