Commit b3a2f62
committed
Add support for torchrun to DGXC Executor
DGX Cloud uses the PyTorch Training Operator from KubeFlow under the
hood to launch jobs. This handles many of the variables necessary for
running distributed PyTorch jobs with torchrun and only a subset of
settings are required to launch the job as the original default
settings will conflict with the auto-configured setup from DGX Cloud.
Signed-Off-By: Robert Clark <[email protected]>1 parent 61bb965 commit b3a2f62
File tree
4 files changed
+33
-20
lines changed- src/nemo_run
- core/execution
- run/torchx_backend
- components
4 files changed
+33
-20
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | | - | |
| 20 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
141 | | - | |
| 141 | + | |
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
| |||
156 | 156 | | |
157 | 157 | | |
158 | 158 | | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
159 | 165 | | |
160 | 166 | | |
161 | 167 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
| 62 | + | |
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
| |||
92 | 93 | | |
93 | 94 | | |
94 | 95 | | |
| 96 | + | |
95 | 97 | | |
96 | 98 | | |
97 | 99 | | |
| |||
130 | 132 | | |
131 | 133 | | |
132 | 134 | | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
137 | | - | |
138 | | - | |
139 | | - | |
140 | | - | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
151 | 156 | | |
152 | 157 | | |
153 | 158 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
26 | 27 | | |
27 | 28 | | |
28 | 29 | | |
| |||
139 | 140 | | |
140 | 141 | | |
141 | 142 | | |
| 143 | + | |
142 | 144 | | |
143 | 145 | | |
144 | 146 | | |
| |||
0 commit comments