Skip to content

Commit 0a25fd6

Browse files
authored
Remove breaking torchrun config for single-node runs (#292)
* remove breaking torchrun config for single-node runs Signed-off-by: Roee Landesman <[email protected]> * fix lint Signed-off-by: Roee Landesman <[email protected]> --------- Signed-off-by: Roee Landesman <[email protected]>
1 parent 00f78ae commit 0a25fd6

File tree

1 file changed

+0
-13
lines changed

1 file changed

+0
-13
lines changed

nemo_run/core/execution/skypilot.py

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@
2727
Executor,
2828
ExecutorMacros,
2929
)
30-
from nemo_run.core.execution.launcher import FaultTolerance, Torchrun
3130
from nemo_run.core.packaging.base import Packager
3231
from nemo_run.core.packaging.git import GitArchivePackager
3332

@@ -342,18 +341,6 @@ def macro_values(self) -> Optional[ExecutorMacros]:
342341
het_group_host_var=self.HET_GROUP_HOST_VAR,
343342
)
344343

345-
def _setup_launcher(self):
346-
super()._setup_launcher()
347-
launcher = self.launcher
348-
# Dynamic rendezvous has an error in Skypilot Kubernetes currently
349-
if (
350-
launcher
351-
and isinstance(launcher, (Torchrun, FaultTolerance))
352-
and self.cloud == "kubernetes"
353-
):
354-
launcher.rdzv_backend = "static"
355-
launcher.rdzv_port = 49500
356-
357344
def to_task(
358345
self,
359346
name: str,

0 commit comments

Comments
 (0)