Skip to content

Support torchrun multi node on local executor#143

Merged
hemildesai merged 3 commits intomainfrom
hemil/local-multi-node
Apr 11, 2025
Merged

Support torchrun multi node on local executor#143
hemildesai merged 3 commits intomainfrom
hemil/local-multi-node

Conversation

@hemildesai
Copy link
Contributor

No description provided.

@fdalvi
Copy link

fdalvi commented Mar 5, 2025

Hello @hemildesai, thanks for implementing this; I was trying to run this and the training was not starting - I realized this is because the --rdzv-id (or run_id) must be the same between distributed instances:

https://github.com/NVIDIA/NeMo-Run/blob/7a242ec7746630fa5c03943c1ae9c84a9a1e9f8b/src/nemo_run/run/torchx_backend/components/torchrun.py#L142-L143

The simple fix here was to set random.seed(...) before the above lines are processed, so all of the torchrun instances get the same run_id.

Hope this helps!

@aflah02
Copy link

aflah02 commented Mar 14, 2025

Hi @fdalvi
Were you using this with NeMo by any chance to train models? Does it work well with these changes for multinode training?

@fdalvi
Copy link

fdalvi commented Mar 16, 2025

Hi @fdalvi Were you using this with NeMo by any chance to train models? Does it work well with these changes for multinode training?

Hello @aflah02, I am not sure I understand your question - but I was using the LocalExecutor (similar to https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html#execute-locally) and slurm. The underlying framework was indeed NeMo.

@aflah02
Copy link

aflah02 commented Mar 17, 2025

Thanks @fdalvi
These were multinode runs, right?

@fdalvi
Copy link

fdalvi commented Mar 17, 2025

Yes correct!

@aflah02
Copy link

aflah02 commented Mar 17, 2025

Yes correct!

Thanks!

@aflah02
Copy link

aflah02 commented Mar 17, 2025

@fdalvi
Did you also try SLURMExecutor?

@fdalvi
Copy link

fdalvi commented Mar 17, 2025

No I was not able to run it successfully, and already had quite a bit of experience with this kind of Native slurm + local executors, so just went ahead with that.

@aflah02
Copy link

aflah02 commented Mar 17, 2025

Yeah same issues here
Will try this PR then, thanks!

@hemildesai
Copy link
Contributor Author

Thanks @fdalvi and @aflah02. I will add an option to use a fixed random seed for this usecase, and then hopefully merge the PR.

@aflah02
Copy link

aflah02 commented Mar 19, 2025

Thanks @hemildesai

@fdalvi
Copy link

fdalvi commented Mar 19, 2025

Thanks @fdalvi and @aflah02. I will add an option to use a fixed random seed for this usecase, and then hopefully merge the PR.

Thanks @hemildesai for taking care of this. One additional thought I had was that the rendezvous ID must be different for different runs happening in parallel. So perhaps instead of setting random.seed(...) to a set value, we can probably derive it from something unique to a run/experiment (perhaps some hash from master address+port?)

@fdalvi
Copy link

fdalvi commented Mar 27, 2025

Just to follow up, I have tried with random.seed(rdzv_endpoint) and parallel runs on the same cluster seem to work fine.

mounts=mounts,
debug=executor.packager.debug,
max_retries=executor.retries,
use_env=use_env,
Copy link
Contributor

@pramodk pramodk Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this PR, @hemildesai! Just to understand: how torchrun is launched on the multiple nodes? Do we need to launch nemo-run script any specific way? (ref: docs)

It would be great if we could add brief note about this in the docs.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
@hemildesai
Copy link
Contributor Author

@fdalvi @aflah02 I have added random.seed and also added an option to provide custom rdzv_id. This PR is now ready to be merged.

@hemildesai hemildesai requested a review from marcromeyn April 11, 2025 18:04
marcromeyn
marcromeyn previously approved these changes Apr 11, 2025
@fdalvi
Copy link

fdalvi commented Apr 11, 2025

Thanks @hemildesai! Just wanted to mention that you can use rndz_endpoint as the seed and this allows for the same ID for a run across nodes (since they will share the same endpoint), but also automatically give parallel runs different IDs.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
@hemildesai
Copy link
Contributor Author

Thanks @hemildesai! Just wanted to mention that you can use rndz_endpoint as the seed and this allows for the same ID for a run across nodes (since they will share the same endpoint), but also automatically give parallel runs different IDs.

Oh nice, updated the PR

@hemildesai hemildesai merged commit 33c0e0b into main Apr 11, 2025
20 checks passed
@fdalvi
Copy link

fdalvi commented Apr 11, 2025

Thanks @hemildesai! Just wanted to mention that you can use rndz_endpoint as the seed and this allows for the same ID for a run across nodes (since they will share the same endpoint), but also automatically give parallel runs different IDs.

Oh nice, updated the PR

Hi @hemildesai, looks like you accidentally used rndz_id (which will be None by default) instead of rndz_endpoint for the seed.

@hemildesai hemildesai mentioned this pull request Apr 11, 2025
@hemildesai
Copy link
Contributor Author

Thanks @hemildesai! Just wanted to mention that you can use rndz_endpoint as the seed and this allows for the same ID for a run across nodes (since they will share the same endpoint), but also automatically give parallel runs different IDs.

Oh nice, updated the PR

Hi @hemildesai, looks like you accidentally used rndz_id (which will be None by default) instead of rndz_endpoint for the seed.

Created #209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants