Restarting parts of cluster #13793
Unanswered
BaruchG
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a situation where I'm trying to automate training across an entire cluster using lightning, so the user will submit their code and the system will automatically take care of the infrastructure distributing across multiple nodes using kubernetes. I'm using essentially the first option at https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster.html.
I'm running into a problem where, for example, let's say there are two pods, one the master pod and one a worker pod. Training starts (say for half an epoch) but every once in a while for reasons unrelated to lightning the worker pod will be shutdown and restarted (typically at the very beginning of training so no real loss of training info or anything). The master pod pauses training as would be expected. However, when the worker pod restarts (both master and worker have the exact same script) it can't reconnect to the master pod and NCCL returns
NCCL INFO Call to connect returned Connection refused, retrying
constantly and eventually the worker crashes out and gets restarted and the cycle continues.What would be the expected outcome here? Ideally I would like to have the worker reconnect to the master and continue training, is that possible? Should the master pod shutdown and restart if it detects that a worker pod is restarted and essentially restart the whole training?
Thank you
Beta Was this translation helpful? Give feedback.
All reactions