Support torchrun multi node on local executor by hemildesai · Pull Request #143 · NVIDIA-NeMo/Run

hemildesai · 2025-01-30T11:37:16Z

No description provided.

fdalvi · 2025-03-05T06:38:05Z

Hello @hemildesai, thanks for implementing this; I was trying to run this and the training was not starting - I realized this is because the --rdzv-id (or run_id) must be the same between distributed instances:

https://github.com/NVIDIA/NeMo-Run/blob/7a242ec7746630fa5c03943c1ae9c84a9a1e9f8b/src/nemo_run/run/torchx_backend/components/torchrun.py#L142-L143

The simple fix here was to set random.seed(...) before the above lines are processed, so all of the torchrun instances get the same run_id.

Hope this helps!

aflah02 · 2025-03-14T18:21:03Z

Hi @fdalvi
Were you using this with NeMo by any chance to train models? Does it work well with these changes for multinode training?

fdalvi · 2025-03-16T06:42:45Z

Hi @fdalvi Were you using this with NeMo by any chance to train models? Does it work well with these changes for multinode training?

Hello @aflah02, I am not sure I understand your question - but I was using the LocalExecutor (similar to https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html#execute-locally) and slurm. The underlying framework was indeed NeMo.

aflah02 · 2025-03-17T17:58:54Z

Thanks @fdalvi
These were multinode runs, right?

fdalvi · 2025-03-17T18:11:06Z

Yes correct!

aflah02 · 2025-03-17T18:15:06Z

Yes correct!

Thanks!

aflah02 · 2025-03-17T18:15:34Z

@fdalvi
Did you also try SLURMExecutor?

fdalvi · 2025-03-17T18:17:11Z

No I was not able to run it successfully, and already had quite a bit of experience with this kind of Native slurm + local executors, so just went ahead with that.

aflah02 · 2025-03-17T18:26:29Z

Yeah same issues here
Will try this PR then, thanks!

hemildesai · 2025-03-19T06:24:05Z

Thanks @fdalvi and @aflah02. I will add an option to use a fixed random seed for this usecase, and then hopefully merge the PR.

aflah02 · 2025-03-19T06:31:03Z

Thanks @hemildesai

fdalvi · 2025-03-19T17:34:20Z

Thanks @fdalvi and @aflah02. I will add an option to use a fixed random seed for this usecase, and then hopefully merge the PR.

Thanks @hemildesai for taking care of this. One additional thought I had was that the rendezvous ID must be different for different runs happening in parallel. So perhaps instead of setting random.seed(...) to a set value, we can probably derive it from something unique to a run/experiment (perhaps some hash from master address+port?)

fdalvi · 2025-03-27T12:43:04Z

Just to follow up, I have tried with random.seed(rdzv_endpoint) and parallel runs on the same cluster seem to work fine.

pramodk · 2025-04-10T08:36:35Z

src/nemo_run/run/torchx_backend/packaging.py

            mounts=mounts,
            debug=executor.packager.debug,
            max_retries=executor.retries,
+            use_env=use_env,


thanks for this PR, @hemildesai! Just to understand: how torchrun is launched on the multiple nodes? Do we need to launch nemo-run script any specific way? (ref: docs)

It would be great if we could add brief note about this in the docs.

Signed-off-by: Hemil Desai <hemild@nvidia.com>

hemildesai · 2025-04-11T18:03:58Z

@fdalvi @aflah02 I have added random.seed and also added an option to provide custom rdzv_id. This PR is now ready to be merged.

fdalvi · 2025-04-11T18:08:19Z

Thanks @hemildesai! Just wanted to mention that you can use rndz_endpoint as the seed and this allows for the same ID for a run across nodes (since they will share the same endpoint), but also automatically give parallel runs different IDs.

Signed-off-by: Hemil Desai <hemild@nvidia.com>

hemildesai · 2025-04-11T18:15:53Z

Thanks @hemildesai! Just wanted to mention that you can use rndz_endpoint as the seed and this allows for the same ID for a run across nodes (since they will share the same endpoint), but also automatically give parallel runs different IDs.

Oh nice, updated the PR

fdalvi · 2025-04-11T18:40:41Z

Thanks @hemildesai! Just wanted to mention that you can use rndz_endpoint as the seed and this allows for the same ID for a run across nodes (since they will share the same endpoint), but also automatically give parallel runs different IDs.

Oh nice, updated the PR

Hi @hemildesai, looks like you accidentally used rndz_id (which will be None by default) instead of rndz_endpoint for the seed.

hemildesai · 2025-04-11T18:57:15Z

Thanks @hemildesai! Just wanted to mention that you can use rndz_endpoint as the seed and this allows for the same ID for a run across nodes (since they will share the same endpoint), but also automatically give parallel runs different IDs.

Oh nice, updated the PR

Hi @hemildesai, looks like you accidentally used rndz_id (which will be None by default) instead of rndz_endpoint for the seed.

Created #209

hemildesai mentioned this pull request Jan 30, 2025

Using multi-node with LocalExecutor #130

Closed

aflah02 mentioned this pull request Mar 19, 2025

Require Clarity on Running with SLURM Executor NVIDIA-NeMo/NeMo#12615

Closed

pramodk reviewed Apr 10, 2025

View reviewed changes

Support torchrun multi node on local executor

a6274e0

Signed-off-by: Hemil Desai <hemild@nvidia.com>

hemildesai force-pushed the hemil/local-multi-node branch from 7a242ec to a6274e0 Compare April 11, 2025 17:54

hemildesai temporarily deployed to public April 11, 2025 17:55 — with GitHub Actions Inactive

fix

75dbc7e

Signed-off-by: Hemil Desai <hemild@nvidia.com>

hemildesai temporarily deployed to public April 11, 2025 18:02 — with GitHub Actions Inactive

hemildesai requested a review from marcromeyn April 11, 2025 18:04

marcromeyn previously approved these changes Apr 11, 2025

View reviewed changes

fix

32bd623

Signed-off-by: Hemil Desai <hemild@nvidia.com>

hemildesai dismissed marcromeyn’s stale review via 32bd623 April 11, 2025 18:15

marcromeyn approved these changes Apr 11, 2025

View reviewed changes

hemildesai temporarily deployed to public April 11, 2025 18:16 — with GitHub Actions Inactive

hemildesai merged commit 33c0e0b into main Apr 11, 2025
20 checks passed

hemildesai mentioned this pull request Apr 11, 2025

Provide Support for Non-Containerized Slurm Executor #169

Closed

hemildesai mentioned this pull request Apr 11, 2025

Fix seed for torchrun #209

Merged

bhaddow mentioned this pull request Jun 7, 2025

Q: Will there be a docker update soon to include nemo_run 0.4.0 NVIDIA-NeMo/NeMo#13852

Closed

Conversation

hemildesai commented Jan 30, 2025

Uh oh!

fdalvi commented Mar 5, 2025

Uh oh!

aflah02 commented Mar 14, 2025

Uh oh!

fdalvi commented Mar 16, 2025

Uh oh!

aflah02 commented Mar 17, 2025

Uh oh!

fdalvi commented Mar 17, 2025

Uh oh!

aflah02 commented Mar 17, 2025

Uh oh!

aflah02 commented Mar 17, 2025

Uh oh!

fdalvi commented Mar 17, 2025

Uh oh!

aflah02 commented Mar 17, 2025

Uh oh!

hemildesai commented Mar 19, 2025

Uh oh!

aflah02 commented Mar 19, 2025

Uh oh!

fdalvi commented Mar 19, 2025

Uh oh!

fdalvi commented Mar 27, 2025

Uh oh!

pramodk Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hemildesai commented Apr 11, 2025

Uh oh!

fdalvi commented Apr 11, 2025

Uh oh!

hemildesai commented Apr 11, 2025

Uh oh!

Uh oh!

fdalvi commented Apr 11, 2025

Uh oh!

hemildesai commented Apr 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pramodk Apr 10, 2025 •

edited

Loading