-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingstrategy: ddpDistributedDataParallelDistributedDataParallelver: 2.4.x
Description
Bug description
Notebook - https://www.kaggle.com/code/vigneshwar472/baseline-residualunetse3d
Github repo (ResidualUNetSE3D implementation) - https://github.com/wolny/pytorch-3dunet/tree/master
Issue - Training starts on 1x P100 GPU but it does not start on 2x T4 GPU
I want to use 2 GPUs simultaneoulsy for training (ddp_notebook strategy) But I do not know, training does not start and 2 GPUs were not in use
I have no idea "why it's not working".
Check the error messages and logs section.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Go the Kaggle Notebook https://www.kaggle.com/code/vigneshwar472/baseline-residualunetse3d-train
Copy & Edit
Run All
You will encounter a never ending pauseError messages and logs
When used with 2x T4 GPUs
n = len(folds)
for i in range(n):
print(f'fold {i} started....')
model = ResidualUNetSE3D(in_channels=1, out_channels=6)
lm = CZIILightningModule(model=model)
logger = CSVLogger(save_dir='/kaggle/working/training_results', name=f'fold_{i}')
trainer = Trainer(accelerator='gpu',
strategy='ddp_notebook',
devices=2,
precision='32',
gradient_clip_val=None,
logger=logger,
max_epochs=15,
enable_checkpointing=True,
enable_progress_bar=True,
enable_model_summary=False,
inference_mode=True,
default_root_dir='/kaggle/working/training_results',
num_sanity_val_steps=0)
trainer.fit(model=lm,
train_dataloaders=DataLoader(folds[i][0], batch_size=1, num_workers=4, shuffle=True),
val_dataloaders=DataLoader(folds[i][1], batch_size=1, num_workers=4, shuffle=False))
del model, lm, logger, trainer
print(f'fold {i} completed....')
When used with 1x P100 GPU
Environment
Please go the kaggle notebook and run it
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstrategy: ddpDistributedDataParallelDistributedDataParallelver: 2.4.x

