ncclSystemError: System call (socket, malloc, munmap, etc) failed. #12256
Unanswered
iqbalfarz
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment
-
Duplicate of #12257. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am running a simple MNIST classification using PL and want to run on multi-node.
The code is
I am having two nodes
node 1:
node 2:
Script running on NODE 1:
NCCL_DEBUG=INFO MASTER_ADDR=172.21.12.6 MASTER_PORT=1234 NODE_RANK=0 python train_mnist_light.py
Script running on NODE 2:
NCCL_DEBUG=INFO MASTER_ADDR=172.21.12.6 MASTER_PORT=1234 NODE_RANK=1 python train_mnist_light.py
Error on NODE 1:
Error on NODE 2:
Beta Was this translation helpful? Give feedback.
All reactions