Skip to content

DDP failing when using multiple nodes #9793

Discussion options

You must be logged in to vote

This ended up being an issue with the port a node communicated through was changing everytime the script ran (since DDP re-runs the entire script for each GPU). Note that the MASTER_PORT environment variable was only set once and remained the same, but since the node kept getting new ports allocated, the original value of MASTER_PORT was invalidated.

Extra info:

The SLURM cluster I use has a utility that will allocate you a port. My issue was that ddp will re-run the entire script, so it meant if I launched a 2 node x 4 GPU job, each of the two nodes would allocate 4 different ports for communication.

I originally had:

def setup_multinode():
    port = allocate_port_for_node()
    # Set e…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@awaelchli
Comment options

@EricWiener
Comment options

Answer selected by EricWiener
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants