Skip to content

No error crash, just a never ending pause #20523

@KeesariVigneshwarReddy

Description

@KeesariVigneshwarReddy

Bug description

Notebook - https://www.kaggle.com/code/vigneshwar472/baseline-residualunetse3d
Github repo (ResidualUNetSE3D implementation) - https://github.com/wolny/pytorch-3dunet/tree/master

Issue - Training starts on 1x P100 GPU but it does not start on 2x T4 GPU

I want to use 2 GPUs simultaneoulsy for training (ddp_notebook strategy) But I do not know, training does not start and 2 GPUs were not in use

I have no idea "why it's not working".

Check the error messages and logs section.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

Go the Kaggle Notebook https://www.kaggle.com/code/vigneshwar472/baseline-residualunetse3d-train

Copy & Edit 
Run All
You will encounter a never ending pause

Error messages and logs

When used with 2x T4 GPUs

n = len(folds)

for i in range(n):
    print(f'fold {i} started....')
    model = ResidualUNetSE3D(in_channels=1, out_channels=6)
    lm = CZIILightningModule(model=model)
    logger = CSVLogger(save_dir='/kaggle/working/training_results', name=f'fold_{i}')
    trainer = Trainer(accelerator='gpu',
                     strategy='ddp_notebook',
                     devices=2,
                     precision='32',
                     gradient_clip_val=None, 
                     logger=logger,
                     max_epochs=15,
                     enable_checkpointing=True,
                     enable_progress_bar=True,
                     enable_model_summary=False,
                     inference_mode=True,
                     default_root_dir='/kaggle/working/training_results',
                     num_sanity_val_steps=0)
    trainer.fit(model=lm, 
                train_dataloaders=DataLoader(folds[i][0], batch_size=1, num_workers=4, shuffle=True), 
                val_dataloaders=DataLoader(folds[i][1], batch_size=1, num_workers=4, shuffle=False))
    del model, lm, logger, trainer
    print(f'fold {i} completed....')

Screenshot 2024-12-29 092937

When used with 1x P100 GPU


Screenshot 2024-12-29 093405

Environment

Please go the kaggle notebook and run it

More info

No response

cc @justusschock @lantiga

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions