Skip to content

Why does DDP mode continue the program in multiple process for longer than intended? #13216

Discussion options

You must be logged in to vote

@hfaghihi15 That's how it is! With DDP, Lightning runs the whole script in its subprocesses as described in the doc here: https://pytorch-lightning.readthedocs.io/en/stable/accelerators/gpu.html#distributed-data-parallel

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment variables:

# example for 3 GPUs DDP
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SI…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Answer selected by akihironitta
Comment options

You must be logged in to vote
1 reply
@akihironitta
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment