DDP failing when using multiple nodes #9793

EricWiener · 2021-10-01T17:36:40Z

EricWiener
Oct 1, 2021

Hi, I'm trying to use multi-node DDP, but my training job never gets past the following errors:

store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
1: [10/01/21 13:46:59] INFO: Waiting in store based barrier to initialize process group for rank: 3, key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
1: [10/01/21 13:46:59] INFO: Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)

My code works fine using DDP with multiple GPUs on a single node.

I'm defining NODE_RANK, WORLD_SIZE, MASTER_PORT, MASTER_ADDR as environment variables, but I'm not defining LOCAL_RANK, GROUP_RANK, GLOBAL_RANK or anything else. I was assuming lightning would handle creating the processes and would then set those variables accordingly. Am I mistaken?

Edit:

To add some more details, it looks like what is happening (in some cases) is that the rank 1 node is setup before the rank 0 node and this causes issues:

1: [10/02/21 13:14:57] INFO: Node index: 1
1: initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
1: initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
...
0: [10/02/21 13:15:27] INFO: Node index: 0
0: initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
0: initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
1: [10/02/21 13:15:29] INFO: Added key: store_based_barrier_key:1 to store for rank: 2
1: [10/02/21 13:15:29] INFO: Added key: store_based_barrier_key:1 to store for rank: 3
1: [10/02/21 13:15:39] INFO: Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
1: [10/02/21 13:15:39] INFO: Waiting in store based barrier to initialize process group for rank: 3, key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
1: [10/02/21 13:15:49] INFO: Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
1: [10/02/21 13:15:49] INFO: Waiting in store based barrier to initialize process group for rank: 3, key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)

However, I also get this error if I ensure the rank 0 node is setup first:

0: [10/02/21 15:42:10] INFO: Node index: 0
0: GPU available: True, used: True
0: TPU available: False, using: 0 TPU cores
0: initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
...
0: [10/02/21 15:42:12] INFO: num_nodes_to_use: 2
0: [10/02/21 15:42:12] INFO: num_gpus_to_use: 2
0: [10/02/21 15:42:12] INFO: Node index: 0
0: initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
...
1: [10/02/21 15:43:29] INFO: num_nodes_to_use: 2
1: [10/02/21 15:43:29] INFO: num_gpus_to_use: 2
1: [10/02/21 15:43:29] INFO: Node index: 1
1: GPU available: True, used: True
1: TPU available: False, using: 0 TPU cores
1: initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
...
1: [10/02/21 15:44:22] INFO: num_nodes_to_use: 2
1: [10/02/21 15:44:22] INFO: num_gpus_to_use: 2
1: [10/02/21 15:44:22] INFO: Node index: 1
1: initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
0: [10/02/21 16:12:13] INFO: Added key: store_based_barrier_key:1 to store for rank: 0
...
line 199, in _env_rendezvous_handler
0:     store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
0: RuntimeError: connect() timed out.
...
0: [10/02/21 16:12:23] INFO: Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
0: [10/02/21 16:12:33] INFO: Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
0: [10/02/21 16:12:43] INFO: Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
...
line 199, in _env_rendezvous_handler
1:     store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
1: RuntimeError: connect() timed out.

Answered by EricWiener

Oct 5, 2021

This ended up being an issue with the port a node communicated through was changing everytime the script ran (since DDP re-runs the entire script for each GPU). Note that the MASTER_PORT environment variable was only set once and remained the same, but since the node kept getting new ports allocated, the original value of MASTER_PORT was invalidated.

Extra info:

The SLURM cluster I use has a utility that will allocate you a port. My issue was that ddp will re-run the entire script, so it meant if I launched a 2 node x 4 GPU job, each of the two nodes would allocate 4 different ports for communication.

I originally had:

def setup_multinode():
    port = allocate_port_for_node()
    # Set e…

View full answer

EricWiener · 2021-10-05T00:29:18Z

EricWiener
Oct 5, 2021
Author

This ended up being an issue with the port a node communicated through was changing everytime the script ran (since DDP re-runs the entire script for each GPU). Note that the MASTER_PORT environment variable was only set once and remained the same, but since the node kept getting new ports allocated, the original value of MASTER_PORT was invalidated.

Extra info:

The SLURM cluster I use has a utility that will allocate you a port. My issue was that ddp will re-run the entire script, so it meant if I launched a 2 node x 4 GPU job, each of the two nodes would allocate 4 different ports for communication.

I originally had:

def setup_multinode():
    port = allocate_port_for_node()
    # Set environment variables ...

if __name__ == "__main__":
    setup_multinode()

    # Initialize a trainer
    trainer = pl.Trainer(
        gpus=2,
        num_nodes=2,
        # Note that you could also set this to "ddp_spawn", which uses
        # torch.multiprocessing.spawn, but performance is worse
        accelerator="ddp",
        max_epochs=3,
        progress_bar_refresh_rate=20,
    )
    
    # Train the model
    trainer.fit(ae, train, val)

However, I needed to change it so the port is only allocated once per node (not once for every GPU):

def setup_multinode():
    if os.environ.get("LOCAL_RANK", None):
        # if this node has already been setup
        return
    port = allocate_port_for_node()
    # Set environment variables ...

if __name__ == "__main__":
    setup_multinode()

    # Initialize a trainer
    trainer = pl.Trainer(
        gpus=2,
        num_nodes=2,
        # Note that you could also set this to "ddp_spawn", which uses
        # torch.multiprocessing.spawn, but performance is worse
        accelerator="ddp",
        max_epochs=3,
        progress_bar_refresh_rate=20,
    )
    
    # Train the model
    trainer.fit(ae, train, val)

This way each node only allocates a single port to communicate through once.

2 replies

awaelchli Oct 6, 2021

good observation! would you mind sharing the launch command you used on each node? did you fix the problem by just setting the MASTER_PORT env variable?

EricWiener Oct 6, 2021
Author

Just updated w more info. My original response incorrectly said MASTER_PORT was set incorrectly, but after comparing the shas before and after it worked, I realized the real issue was that the port kept being re-allocated on each node (even though the MASTER_PORT variable remained the same).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP failing when using multiple nodes #9793

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DDP failing when using multiple nodes #9793

Uh oh!

Uh oh!

EricWiener Oct 1, 2021

Edit:

Extra info:

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

EricWiener Oct 5, 2021 Author

Extra info:

Uh oh!

Uh oh!

awaelchli Oct 6, 2021

Uh oh!

EricWiener Oct 6, 2021 Author

EricWiener
Oct 1, 2021

Replies: 1 comment 2 replies

EricWiener
Oct 5, 2021
Author

EricWiener Oct 6, 2021
Author