DDP failing when using multiple nodes #9793
-
Hi, I'm trying to use multi-node DDP, but my training job never gets past the following errors:
My code works fine using DDP with multiple GPUs on a single node. I'm defining Edit:To add some more details, it looks like what is happening (in some cases) is that the rank 1 node is setup before the rank 0 node and this causes issues:
However, I also get this error if I ensure the rank 0 node is setup first:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
This ended up being an issue with the port a node communicated through was changing everytime the script ran (since DDP re-runs the entire script for each GPU). Note that the Extra info:The SLURM cluster I use has a utility that will allocate you a port. My issue was that I originally had: def setup_multinode():
port = allocate_port_for_node()
# Set environment variables ...
if __name__ == "__main__":
setup_multinode()
# Initialize a trainer
trainer = pl.Trainer(
gpus=2,
num_nodes=2,
# Note that you could also set this to "ddp_spawn", which uses
# torch.multiprocessing.spawn, but performance is worse
accelerator="ddp",
max_epochs=3,
progress_bar_refresh_rate=20,
)
# Train the model
trainer.fit(ae, train, val) However, I needed to change it so the port is only allocated once per node (not once for every GPU): def setup_multinode():
if os.environ.get("LOCAL_RANK", None):
# if this node has already been setup
return
port = allocate_port_for_node()
# Set environment variables ...
if __name__ == "__main__":
setup_multinode()
# Initialize a trainer
trainer = pl.Trainer(
gpus=2,
num_nodes=2,
# Note that you could also set this to "ddp_spawn", which uses
# torch.multiprocessing.spawn, but performance is worse
accelerator="ddp",
max_epochs=3,
progress_bar_refresh_rate=20,
)
# Train the model
trainer.fit(ae, train, val) This way each node only allocates a single port to communicate through once. |
Beta Was this translation helpful? Give feedback.
This ended up being an issue with the port a node communicated through was changing everytime the script ran (since DDP re-runs the entire script for each GPU). Note that the
MASTER_PORT
environment variable was only set once and remained the same, but since the node kept getting new ports allocated, the original value ofMASTER_PORT
was invalidated.Extra info:
The SLURM cluster I use has a utility that will allocate you a port. My issue was that
ddp
will re-run the entire script, so it meant if I launched a 2 node x 4 GPU job, each of the two nodes would allocate 4 different ports for communication.I originally had: