Why would GPU memory always surge after training and cause CUDA memory error? #9048

EMUNES · 2021-08-23T11:25:40Z

EMUNES
Aug 23, 2021

I use pytorch lightning to train a model but it always strangely fail at end: After validations completed, the trainer will start an epoch that bigger that max_epoch and causing GPU memory allocation failure (CUDA out of memory) right after this epoch (which should not run) started. For my example, I set max_epoch=5 so there should only be epoch 0-4. But there will always be an additional epoch-5 after 5 validations are done and a few seconds later the CUDA memory error will occur.

Notebook log:

Wandb system info:

My dataset should be fine as CUDA memory and system memory are stable during the training period, except the GPU memory surge at the very end. And here are my code for lightning module and training loop which I think may cause this trouble:

class BaseModel(pl.LightningModule):
    def __init__(self, model_name=params['model'], out_features=params['out_features'], inp_channels=params['inp_channels'], pretrained=True):
        super().__init__()
        self.model = timm.create_model(model_name, pretrained=pretrained, in_chans=inp_channels)
        
        # Change output features. Input features keep the same.
        if model_name == 'resnet18d':
            n_features = self.model.fc.in_features
            self.model.fc = nn.Linear(n_features, out_features, bias=True)
            
        if model_name == 'nfnet_f1':
            n_features = self.model.head.fc.in_features
            self.model.head.fc = nn.Linear(n_features, out_features, bias=True)
            
        elif model_name == 'efficientnet_b1':
            n_features = self.model.classifier.in_features
            self.model.classifier = nn.Linear(n_features, out_features, bias=True)
            
        self.criterion = nn.BCEWithLogitsLoss()
        
    def forward(self, x):
        output = self.model(x)
        return output

    def training_step(self, batch, batch_idx):
        x, y = batch
        output = self.model(x)
        labels = y.unsqueeze(1)
        loss = self.criterion(output, labels)

        try:
            auc = roc_auc_score(labels.detach().cpu(), output.sigmoid().detach().cpu())
            self.log('auc', auc, on_step=True, prog_bar=True, logger=True)
            self.log('Train Loss', loss, on_step=True, prog_bar=True, logger=True)
        except:
            pass

        return {'loss': loss, 'predictions': output, "labels": labels}

    def training_epoch_end(self, outputs):
        preds = []
        labels = []

        for output in outputs:
            preds += output['predictions'].detach()
            labels += output['labels'].detach()

        preds = torch.stack(preds)
        labels = torch.stack(labels)

        train_auc = roc_auc_score(labels.detach().cpu(), preds.sigmoid().detach().cpu())
        self.log('mean_train_auc', train_auc, prog_bar=True, logger=True)

    def validation_step(self, batch, batch_idx):
        x, y = batch
        output = self.model(x)
        labels = y.unsqueeze(1)
        loss = self.criterion(output, labels)

        self.log('val_loss', loss, on_step=True, prog_bar=True, logger=True)
        return {'predictions': output, 'labels': labels}

    def validation_epoch_end(self, outputs):
        preds = []
        labels = []

        for output in outputs:
            preds += output['predictions'].detach()
            labels += output['labels'].detach()

        preds = torch.stack(preds)
        labels = torch.stack(labels)

        val_auc = roc_auc_score(labels.detach().cpu(), preds.sigmoid().detach().cpu())
        self.log('val_auc', val_auc, prog_bar=True, logger=True)

    def test_step(self, batch, batch_idx):
        x, y = batch
        output = self(x).sigmoid()
        return output

    def configure_optimizers(self):
        param_optimizer = list(self.model.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] # no decay
        optimizer_parameters = [
            {
                'params': [
                    p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
                ], # Do not optimize no decay parameters.
                'weight_decay': params['weight_decay'],
            },
            {
                'params': [
                    p for n, p in param_optimizer if any(nd in n for nd in no_decay)
                ],
                'weight_decay': 0.0,
            }
        ]

        optimizer = FusedAdam(optimizer_parameters, lr=params['lr'])

        scheduler = CosineAnnealingLR(optimizer,
                                      T_max=params['T_max'],
                                      eta_min=params['min_lr'],
                                      last_epoch=-1)

        # Give out optimizer & scheduler for pytorch lightning in python dict format.
        return dict(optimizer=optimizer,
                    lr_scheduler=scheduler) # lr_scheduler for scheduler.

kfolds = StratifiedKFold(n_splits=params['nfolds'], shuffle=True, random_state=params['seed'])

model = BaseModel()

for fold, (trn_idx, val_idx) in enumerate(kfolds.split(train_df["id"], train_df['target'])):
    # Run first round.
    if fold != 0:
        continue
    
    # PL + wandb
    wandb_logger = WandbLogger(project='G2Net-steady-exp',
                               config=params,
                               group='Effnet-CQT',
                               job_type='train',
                               name=f'Fold{fold}')
    print(f"{'='*20} Fold: {fold} {'='*20}")
    
    # Set up data module.
    train_data = train_df.loc[trn_idx]
    train_sample_data = data_sample(train_data)
    valid_data = train_df.loc[val_idx] # About 65k samples.
    data_module = DataModule(train_sample_data,
                             valid_data,
                             valid_data) # No test data yet.
    data_module.setup()
    
    # Add callbacks.
    early_stopping_callback = EarlyStopping(monitor='val_auc',
                                            mode='max',
                                            patience=5)
    checkpoint_callback = ModelCheckpoint(dirpath='./checkpoints/',
                                          filename= f'fold-{fold}-best' + '-val_auc{val_auc:.3f}',
                                          save_top_k=2,
                                          verbose=True,
                                          monitor='val_auc',
                                          mode='max')
    
    trainer = pl.Trainer(gpus=1,
                         callbacks=[early_stopping_callback,
                                    checkpoint_callback],
                         max_epochs=params['epochs'],
                         precision=params['precision'],
                         progress_bar_refresh_rate=1,
                         stochastic_weight_avg=True,
                         logger=wandb_logger)
    
    trainer.fit(model, data_module)

Can I get any clue about why this would happen and how to avoid it ? I'm new to pytorch lightning so there might be problems I'm not aware of. Thanks a lot!

Answered by carmocca

Sep 11, 2021

Let's continue discussing in #9441

Locking this thread to avoid discussing in two places.

View full answer

tchaton · 2021-08-24T08:53:04Z

tchaton
Aug 24, 2021
Maintainer

Dear @EMUNES,

Would you mind sharing your notebook ? This would make investigation much simpler.

Best,
T.C

2 replies

EMUNES Aug 24, 2021
Author

g2net-deepspeed-pytorch (1).zip
Thanks! Here is my notebook containing error messages.

EMUNES Sep 7, 2021
Author

I have a clue for the cause of this problem. After two weeks I move back to pytorch lightning with little hope and set patience in EarlyStopping as 1 (original 2) and with magic the error disappears. Nothing else changed except wandb initializations and param patience but I think the later could be more relevant to this problem.

Hope this can help your investigation :)

talhaanwarch · 2021-08-24T16:32:20Z

talhaanwarch
Aug 24, 2021

I am also facing similar issue. I have 8 GB gpu, and 50 epochs, till epoch 49 my gpu usage is 4 gb, after 49 epoch, i got memory error and my gpu memory reaches to 8 gb

7 replies

EMUNES Sep 7, 2021
Author

Thanks for your reply. Today my notebook works after I changed patience in EarlyStopping to 1 (original 2) and everything works normal. I also changed wandb settings but I think that's hardly relevant to this problem. Hope it helps :)

talhaanwarch Sep 7, 2021

i again start getting this error on v1.3.0 and i am not using EarlyStopping

talhaanwarch Sep 7, 2021

@tchaton plz chech gist here
https://gist.github.com/talhaanwarch/13ffc9f14043ab7933899f41a8996bb5

EMUNES Sep 8, 2021
Author

Indeed this problem is more complicated than I thought... Today one of my notebook is normal but the next version throws the same error. The only difference I made between those two notebooks is to reduce training samples from 3000 to 1000 (I use small samples just to test whether the pipeline can be working). Now I'm totally confused again...

talhaanwarch Sep 8, 2021

i uploaded the same notebook to colab, and there is no error

talhaanwarch · 2021-09-08T10:23:24Z

talhaanwarch
Sep 8, 2021

Same thing happened with me. I thought downgrading PL version resolved the issue, but it didn't work in other notebook

…

On Wed, Sep 8, 2021, 3:18 PM EMUNES ***@***.***> wrote: Indeed this problem is more complicated than I thought... Today one of my notebook is normal but the next version throws the same error. The only difference I made between those two notebooks is to reduce training samples from 3000 to 1000 (I use small samples just to test whether the pipeline can be working). Now I'm totally confused again... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9048 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI5FYOYMI7ICGYP66IEEJMDUA42BBANCNFSM5CULZXXQ> .

1 reply

talhaanwarch Sep 10, 2021

Shifted code from windows to Ubuntu, but error is there

carmocca · 2021-09-11T14:21:38Z

carmocca
Sep 11, 2021

Let's continue discussing in #9441

Locking this thread to avoid discussing in two places.

0 replies

Why would GPU memory always surge after training and cause CUDA memory error? #9048

Uh oh!

Replies: 4 comments · 10 replies

Uh oh!

tchaton Aug 24, 2021 Maintainer

Uh oh!

EMUNES Aug 24, 2021 Author

Uh oh!

Uh oh!

EMUNES Sep 7, 2021 Author

Uh oh!

Uh oh!

EMUNES Sep 7, 2021 Author

Uh oh!

Uh oh!

Uh oh!

EMUNES Sep 8, 2021 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 4 comments 10 replies

tchaton
Aug 24, 2021
Maintainer

EMUNES Aug 24, 2021
Author

EMUNES Sep 7, 2021
Author

EMUNES Sep 7, 2021
Author

EMUNES Sep 8, 2021
Author