How to run code on multi-machine multi-gpu? #9830
ljz756245026
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am going to train a model on 2 machines (each machine with 2 GPUs).
I can run DDP on single machine(2 GPUs) through
python singlenode.py
. It runs successfully.However, I cannot run DDP on 2 machines.
I write the core code like:
When I run
python multinodes.py
on two machines. Both of the two machine output things as follows.Program will pause here.Both of these 2 machine will output the following GLOBAL_RANK.
Then ,I tried use command like
The output is as follows:
on node 0:
on node 1:
I found that these 2 node won't initializing ddp successfully. They will stop on initializing the 1st process.
What should I do to train on multi-machine multi-GPU ??
Beta Was this translation helpful? Give feedback.
All reactions