Restarting parts of cluster #13793

BaruchG · 2022-07-21T16:41:36Z

BaruchG
Jul 21, 2022

I have a situation where I'm trying to automate training across an entire cluster using lightning, so the user will submit their code and the system will automatically take care of the infrastructure distributing across multiple nodes using kubernetes. I'm using essentially the first option at https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster.html.
I'm running into a problem where, for example, let's say there are two pods, one the master pod and one a worker pod. Training starts (say for half an epoch) but every once in a while for reasons unrelated to lightning the worker pod will be shutdown and restarted (typically at the very beginning of training so no real loss of training info or anything). The master pod pauses training as would be expected. However, when the worker pod restarts (both master and worker have the exact same script) it can't reconnect to the master pod and NCCL returns NCCL INFO Call to connect returned Connection refused, retrying constantly and eventually the worker crashes out and gets restarted and the cycle continues.
What would be the expected outcome here? Ideally I would like to have the worker reconnect to the master and continue training, is that possible? Should the master pod shutdown and restart if it detects that a worker pod is restarted and essentially restart the whole training?
Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restarting parts of cluster #13793

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Restarting parts of cluster #13793

Uh oh!

BaruchG Jul 21, 2022

Replies: 0 comments

BaruchG
Jul 21, 2022