Support torchrun multi node on local executor#143
Conversation
|
Hello @hemildesai, thanks for implementing this; I was trying to run this and the training was not starting - I realized this is because the The simple fix here was to set Hope this helps! |
|
Hi @fdalvi |
Hello @aflah02, I am not sure I understand your question - but I was using the LocalExecutor (similar to https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html#execute-locally) and slurm. The underlying framework was indeed NeMo. |
|
Thanks @fdalvi |
|
Yes correct! |
Thanks! |
|
@fdalvi |
|
No I was not able to run it successfully, and already had quite a bit of experience with this kind of Native slurm + local executors, so just went ahead with that. |
|
Yeah same issues here |
|
Thanks @hemildesai |
Thanks @hemildesai for taking care of this. One additional thought I had was that the rendezvous ID must be different for different runs happening in parallel. So perhaps instead of setting |
|
Just to follow up, I have tried with |
| mounts=mounts, | ||
| debug=executor.packager.debug, | ||
| max_retries=executor.retries, | ||
| use_env=use_env, |
There was a problem hiding this comment.
thanks for this PR, @hemildesai! Just to understand: how torchrun is launched on the multiple nodes? Do we need to launch nemo-run script any specific way? (ref: docs)
It would be great if we could add brief note about this in the docs.
Signed-off-by: Hemil Desai <hemild@nvidia.com>
7a242ec to
a6274e0
Compare
|
Thanks @hemildesai! Just wanted to mention that you can use |
Oh nice, updated the PR |
Hi @hemildesai, looks like you accidentally used |
Created #209 |
No description provided.