Why does DDP mode continue the program in multiple process for longer than intended? #13216

hfaghihi15 · 2022-06-02T23:35:18Z

hfaghihi15
Jun 2, 2022

Hi all,

I am using the following code to start a trainer with multrigpu.

pl.Trainer(accelerator="gpu", devices=get_num_gpus(), strategy="ddp")

and then I have this line of code:

inference_outputs = self.trainer.predict(self.embedding_model, inference_dataloader)

print("abc")

What I am seeing is that the print("abc" is being printed to the number of available devices while I would hav expect only the predict function to run on multiple gpu and processes and then finish before running the next line and gather all results into inference_outputs.

Am I missing something? Is there a way to achieve what I just described?

Answered by akihironitta

Jun 3, 2022

@hfaghihi15 That's how it is! With DDP, Lightning runs the whole script in its subprocesses as described in the doc here: https://pytorch-lightning.readthedocs.io/en/stable/accelerators/gpu.html#distributed-data-parallel

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment variables:

# example for 3 GPUs DDP
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SI…

View full answer

akihironitta · 2022-06-03T05:25:56Z

akihironitta
Jun 3, 2022

@hfaghihi15 That's how it is! With DDP, Lightning runs the whole script in its subprocesses as described in the doc here: https://pytorch-lightning.readthedocs.io/en/stable/accelerators/gpu.html#distributed-data-parallel

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment variables:

# example for 3 GPUs DDP
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc

In case you want to run something in only one process, you can utilise the trainer property:

if trainer.is_global_zero:
    print("abc")

https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#is-global-zero

0 replies

hfaghihi15 · 2022-06-06T18:53:11Z

hfaghihi15
Jun 6, 2022
Author

@akihironitta Thanks for your answer, what if I run this with DP because I saw the exact same issue even there.

1 reply

akihironitta Aug 3, 2022

Do you have a complete script for repro?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does DDP mode continue the program in multiple process for longer than intended? #13216

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why does DDP mode continue the program in multiple process for longer than intended? #13216

Uh oh!

hfaghihi15 Jun 2, 2022

Replies: 2 comments · 1 reply

Uh oh!

akihironitta Jun 3, 2022

Uh oh!

hfaghihi15 Jun 6, 2022 Author

Uh oh!

akihironitta Aug 3, 2022

hfaghihi15
Jun 2, 2022

Replies: 2 comments 1 reply

akihironitta
Jun 3, 2022

hfaghihi15
Jun 6, 2022
Author