Why does DDP mode continue the program in multiple process for longer than intended? #13216
-
Hi all, I am using the following code to start a trainer with multrigpu. pl.Trainer(accelerator="gpu", devices=get_num_gpus(), strategy="ddp") and then I have this line of code: inference_outputs = self.trainer.predict(self.embedding_model, inference_dataloader)
print("abc") What I am seeing is that the print("abc" is being printed to the number of available devices while I would hav expect only the predict function to run on multiple gpu and processes and then finish before running the next line and gather all results into Am I missing something? Is there a way to achieve what I just described? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
@hfaghihi15 That's how it is! With DDP, Lightning runs the whole script in its subprocesses as described in the doc here: https://pytorch-lightning.readthedocs.io/en/stable/accelerators/gpu.html#distributed-data-parallel
In case you want to run something in only one process, you can utilise the trainer property: if trainer.is_global_zero:
print("abc") https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#is-global-zero |
Beta Was this translation helpful? Give feedback.
-
@akihironitta Thanks for your answer, what if I run this with |
Beta Was this translation helpful? Give feedback.
@hfaghihi15 That's how it is! With DDP, Lightning runs the whole script in its subprocesses as described in the doc here: https://pytorch-lightning.readthedocs.io/en/stable/accelerators/gpu.html#distributed-data-parallel