Commit d3be9ac
Add RayCluster support for DGX Cloud Lepton (#389)
* Add Ray support for DGX Cloud Lepton
Add support for launching a RayCluster on DGX Cloud Lepton and submitting
RayJobs on the clusters using the lepton SDK. This uses the new RayCluster
feature on DGX Cloud Lepton to dynamically spawn clusters up and down via
the Python SDK and jobs can be submitted to deployed clusters directly.
Signed-Off-By: Robert Clark <[email protected]>
* making name unique and add resource shape for lepton raycluster
Signed-off-by: Zoey Zhang <[email protected]>
* adding head node reference in RayCluster, support defining secrets in RayCluster and linting
Signed-off-by: Zoey Zhang <[email protected]>
* Remove Slurm packager comments from RayCluster
Removed the placeholder Slurm packager handling comments from the Lepton
RayCluster code. For now, the "workdir" parameter should be used for
transferring local data to the remote Ray cluster.
Signed-Off-By: Robert Clark <[email protected]>
* Fix RayCluster head resource shape
Fix issue to ensure the proper head node resource shape is used if it
isn't explicitly given by the user.
Signed-Off-By: Robert Clark <[email protected]>
* Update LeptonRay comments
Updated the comments in the LeptonRayCluster and LeptonRayJob classes to
accurately reflect the code.
Signed-Off-By: Robert Clark <[email protected]>
* Fix RayJob logs streaming connection dropping
The RayJob logs stream would sometimes timeout and reset, causing a very
long output of logs in the terminal as it continually resets.
Signed-Off-By: Robert Clark <[email protected]>
* Make RayCluster head resource shape optional
The head node resource shape for a LeptonRayCluster should be optional. If
it isn't specified by the user, it should default to the same shape used
for the worker nodes.
Signed-Off-By: Robert Clark <[email protected]>
* Add doc for DGXC Lepton RayClusters
Added an example to the Ray quick-start guide on how to use RayClusters
and RayJobs with NeMo-Run on DGX Cloud Lepton.
Signed-Off-By: Robert Clark <[email protected]>
* Update license date
Signed-Off-By: Robert Clark <[email protected]>
* Fix Ray guide typo
Signed-Off-By: Robert Clark <[email protected]>
* Make cluster readiness timeout a variable
Allows users to specify how long to wait for a RayCluster to be created
on DGX Cloud Lepton.
Signed-Off-By: Robert Clark <[email protected]>
* Remove implicit returns
Signed-Off-By: Robert Clark <[email protected]>
* Remove unused local variable
Signed-Off-By: Robert Clark <[email protected]>
* Fix linting errors
Signed-Off-By: Robert Clark <[email protected]>
* Fix formatting errors
Signed-Off-By: Robert Clark <[email protected]>
* Move LeptonExecutor parameters to definition
Move the RayCluster-specific settings to the LeptonExecutor class for a
more seamless interface for launching and interacting with RayClusters
on DGX Cloud Lepton.
Signed-Off-By: Robert Clark <[email protected]>
* Updated leptonai package version
Need a newer version of the leptonai SDK to support RayClusters.
Signed-Off-By: Robert Clark <[email protected]>
* Add Lepton RayCluster tests
Signed-Off-By: Robert Clark <[email protected]>
---------
Signed-off-by: Robert Clark <[email protected]>
Signed-off-by: Zoey Zhang <[email protected]>
Co-authored-by: Zoey Zhang <[email protected]>1 parent e6c5a5e commit d3be9ac
File tree
8 files changed
+1776
-40
lines changed- docs/guides
- nemo_run
- core/execution
- run/ray
- test/run/ray
8 files changed
+1776
-40
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | | - | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
| 31 | + | |
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
37 | 38 | | |
38 | 39 | | |
| 40 | + | |
39 | 41 | | |
40 | 42 | | |
41 | 43 | | |
| |||
183 | 185 | | |
184 | 186 | | |
185 | 187 | | |
186 | | - | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
187 | 259 | | |
188 | 260 | | |
189 | 261 | | |
| |||
201 | 273 | | |
202 | 274 | | |
203 | 275 | | |
204 | | - | |
| 276 | + | |
205 | 277 | | |
206 | 278 | | |
207 | 279 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
| 84 | + | |
| 85 | + | |
84 | 86 | | |
85 | 87 | | |
86 | 88 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
| 23 | + | |
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
| |||
43 | 45 | | |
44 | 46 | | |
45 | 47 | | |
| 48 | + | |
46 | 49 | | |
47 | 50 | | |
48 | 51 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
| 23 | + | |
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
| |||
41 | 43 | | |
42 | 44 | | |
43 | 45 | | |
| 46 | + | |
| 47 | + | |
44 | 48 | | |
45 | 49 | | |
46 | 50 | | |
47 | 51 | | |
| 52 | + | |
48 | 53 | | |
49 | 54 | | |
50 | 55 | | |
| |||
57 | 62 | | |
58 | 63 | | |
59 | 64 | | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
60 | 69 | | |
61 | 70 | | |
62 | 71 | | |
| |||
84 | 93 | | |
85 | 94 | | |
86 | 95 | | |
87 | | - | |
88 | | - | |
| 96 | + | |
| 97 | + | |
89 | 98 | | |
90 | 99 | | |
91 | 100 | | |
| |||
0 commit comments