Skip to content

Commit f090ba1

Browse files
authored
Merge branch 'NVIDIA-NeMo:main' into patch-1
2 parents 655cf30 + 7fc5426 commit f090ba1

File tree

11 files changed

+2084
-267
lines changed

11 files changed

+2084
-267
lines changed

CHANGELOG.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,88 @@
11
# Changelog
22

33
<!-- Next changelog -->
4+
## NVIDIA Nemo Run 0.7.0
5+
6+
### Detailed Changelogs:
7+
8+
9+
#### Executors
10+
11+
12+
13+
- Add image pull secrets param for lepton [#330](https://github.com/NVIDIA-NeMo/Run/pull/330)
14+
- Add node reservations for LeptonExecutor [#336](https://github.com/NVIDIA-NeMo/Run/pull/336)
15+
- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs [#338](https://github.com/NVIDIA-NeMo/Run/pull/338)
16+
- [SkyPilot] Add retry_until_up as an optional arg to SkyPilot Executor [#340](https://github.com/NVIDIA-NeMo/Run/pull/340)
17+
- Support SkyPilot Storage configurations in `file_mounts` for automatic cloud sync [#335](https://github.com/NVIDIA-NeMo/Run/pull/335)
18+
- [SkyPilot] Update YAML dump imports + backward compatibility for SkyPilot <=0.10.3 [#339](https://github.com/NVIDIA-NeMo/Run/pull/339)
19+
- Create SkypilotJobsExecutor to allow running managed jobs [#343](https://github.com/NVIDIA-NeMo/Run/pull/343)
20+
- fix: exit code docker runs [#365](https://github.com/NVIDIA-NeMo/Run/pull/365)
21+
22+
23+
#### Ray Integration
24+
25+
26+
27+
- Add ray head start timeout [#324](https://github.com/NVIDIA-NeMo/Run/pull/324)
28+
- Remove ray deprecated dashboard-grpc-port arg [#325](https://github.com/NVIDIA-NeMo/Run/pull/325)
29+
30+
31+
#### Experiment & Job Management
32+
33+
34+
35+
- add a grace for Jobs that may start in Unknown [#291](https://github.com/NVIDIA-NeMo/Run/pull/291)
36+
- Create SkypilotJobsExecutor to allow running managed jobs [#343](https://github.com/NVIDIA-NeMo/Run/pull/343)
37+
38+
39+
#### Packaging & Deployment
40+
41+
42+
43+
- Support SkyPilot Storage configurations in `file_mounts` for automatic cloud sync [#335](https://github.com/NVIDIA-NeMo/Run/pull/335)
44+
- Refactor tar packaging logic to work for submodule and extra repo [#347](https://github.com/NVIDIA-NeMo/Run/pull/347)
45+
46+
47+
#### Documentation
48+
49+
50+
51+
- Add broken links check in docs [#333](https://github.com/NVIDIA-NeMo/Run/pull/333)
52+
- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs [#338](https://github.com/NVIDIA-NeMo/Run/pull/338)
53+
- Documentation Restructurting [#350](https://github.com/NVIDIA-NeMo/Run/pull/350)
54+
- Fix spelling in docstring [#359](https://github.com/NVIDIA-NeMo/Run/pull/359)
55+
- fix: exit code docker runs [#365](https://github.com/NVIDIA-NeMo/Run/pull/365)
56+
57+
58+
#### CI/CD
59+
60+
61+
62+
- Update cherry-pick workflow to use version 0.63.0 [#344](https://github.com/NVIDIA-NeMo/Run/pull/344)
63+
- fix: exit code docker runs [#365](https://github.com/NVIDIA-NeMo/Run/pull/365)
64+
65+
66+
#### Bug Fixes
67+
68+
69+
70+
- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs [#338](https://github.com/NVIDIA-NeMo/Run/pull/338)
71+
- Fix spelling in docstring [#359](https://github.com/NVIDIA-NeMo/Run/pull/359)
72+
- fix: exit code docker runs [#365](https://github.com/NVIDIA-NeMo/Run/pull/365)
73+
74+
75+
#### Others
76+
77+
- chore: Bump to version 0.7.0rc0.dev0 [#322](https://github.com/NVIDIA-NeMo/Run/pull/322)
78+
- Update community-bot to add community issues to shared project [#321](https://github.com/NVIDIA-NeMo/Run/pull/321)
79+
- Bump community-bot to 0.54.4 [#332](https://github.com/NVIDIA-NeMo/Run/pull/332)
80+
- remove custom dir [#351](https://github.com/NVIDIA-NeMo/Run/pull/351)
81+
- Bumping to 0.5.0 [#352](https://github.com/NVIDIA-NeMo/Run/pull/352)
82+
- Update release notes header in changelog build [#355](https://github.com/NVIDIA-NeMo/Run/pull/355)
83+
- add changelog-config [#356](https://github.com/NVIDIA-NeMo/Run/pull/356)
84+
- Changelog 0.6.0 [#357](https://github.com/NVIDIA-NeMo/Run/pull/357)
85+
- feat: new changelog-build [#367](https://github.com/NVIDIA-NeMo/Run/pull/367)
486
## NVIDIA Nemo Run 0.6.0
587

688
### Detailed Changelogs:

docs/guides/ray.md

Lines changed: 76 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,17 +25,19 @@
2525

2626
| Object | What it abstracts | Back-ends supported |
2727
|-------------|-------------------|---------------------|
28-
| `run.ray.cluster.RayCluster` | Lifecycle of a Ray **cluster** (create ⇒ wait ⇢ status ⇢ port-forward ⇢ delete). | `KubeRayExecutor`, `SlurmExecutor` |
28+
| `run.ray.cluster.RayCluster` | Lifecycle of a Ray **cluster** (create ⇒ wait ⇢ status ⇢ port-forward ⇢ delete). | `KubeRayExecutor`, `SlurmExecutor`, `LeptonExecutor` |
2929
| `run.ray.job.RayJob` | Lifecycle of a Ray **job** (submit ⇒ monitor ⇢ logs ⇢ cancel). | same |
3030

31-
The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s) or a **Slurm** job under the hood.
31+
The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s), **DGX Cloud Lepton's RayCluster**, or a **Slurm** job under the hood.
3232

3333
```mermaid
3434
classDiagram
3535
RayCluster <|-- KubeRayCluster
3636
RayCluster <|-- SlurmRayCluster
37+
RayCluster <|-- LeptonRayCluster
3738
RayJob <|-- KubeRayJob
3839
RayJob <|-- SlurmRayJob
40+
RayJob <|-- LeptonRayJob
3941
```
4042

4143
## 2. KubeRay quick-start
@@ -183,7 +185,77 @@ cluster.stop()
183185
* `executor.packager = run.GitArchivePackager()` if you prefer packaging a git tree instead of rsync.
184186
* `cluster.port_forward()` opens an SSH tunnel from *your laptop* to the Ray dashboard running on the head node.
185187

186-
## 4. API reference cheat-sheet
188+
## 4. DGX Cloud Lepton RayCluster quick-start
189+
190+
```python
191+
import os
192+
from pathlib import Path
193+
194+
import nemo_run as run
195+
from nemo_run.core.execution.lepton import LeptonExecutor
196+
from nemo_run.run.ray.cluster import RayCluster
197+
from nemo_run.run.ray.job import RayJob
198+
199+
# 1) Create a LeptonExecutor and tweak defaults
200+
mounts = [
201+
{
202+
"path": "/",
203+
"mount_path": "/nemo-workspace",
204+
"from": "node-nfs:lepton-shared-fs",
205+
}
206+
]
207+
208+
executor = LeptonExecutor(
209+
resource_shape="gpu.8xh100",
210+
container_image="rayproject/ray:2.49.2-gpu",
211+
nemo_run_dir="/nemo-workspace/nemo-run",
212+
head_resource_shape="cpu.large",
213+
ray_version="2.49.2",
214+
mounts=mounts,
215+
node_group="my-node-group",
216+
nodes=1,
217+
nprocs_per_node=8,
218+
env_vars={
219+
"TORCH_HOME": "/nemo-workspace/.cache",
220+
},
221+
secret_vars=[
222+
{"WANDB_API_KEY": "WANDB_API_KEY"},
223+
{"HF_TOKEN": "HUGGING_FACE_HUB_TOKEN"},
224+
],
225+
launcher="torchrun",
226+
image_pull_secrets=[],
227+
pre_launch_commands=[],
228+
)
229+
230+
# 2) Bring up the RayCluster on DGX Cloud Lepton and show the status
231+
cluster = RayCluster(
232+
name="lepton-ray-cluster",
233+
executor=executor,
234+
)
235+
cluster.start(timeout=1800)
236+
cluster.status(display=True)
237+
238+
# 3) Submit a RayJob that runs inside the created RayCluster
239+
job = RayJob(
240+
name="demo-lepton-ray-job",
241+
executor=executor,
242+
cluster_name="lepton-ray-cluster",
243+
)
244+
job.start(
245+
command="uv run python train.py --config cfgs/train.yaml cluster.num_nodes=2",
246+
workdir="/path/to/project/", # rsync'ed from local to the RayCluster
247+
)
248+
job.status(display=True) # Display the RayJob status
249+
job.logs(follow=True) # Tail the job logs as it runs
250+
251+
# 4) Tear down the RayCluster and free up resources
252+
cluster.stop()
253+
```
254+
255+
### Tips for DGX Cloud Lepton users
256+
* This assumes the [DGX Cloud Lepton CLI](https://docs.nvidia.com/dgx-cloud/lepton/reference/cli/get-started/) is installed and has been authenticated.
257+
258+
## 5. API reference cheat-sheet
187259

188260
```python
189261
cluster = RayCluster(name, executor)
@@ -201,7 +273,7 @@ job.stop()
201273

202274
All methods are synchronous and **return immediately** when their work is done; the helpers hide the messy details (kubectl, squeue, ssh, …).
203275

204-
## 5. Rolling your own CLI
276+
## 6. Rolling your own CLI
205277

206278
Because `RayCluster` and `RayJob` are plain Python, you can compose them inside **argparse**, **Typer**, **Click** – anything. Here is a minimal **argparse** script:
207279

nemo_run/core/execution/dgxcloud.py

Lines changed: 52 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,12 @@
1414
# limitations under the License.
1515

1616
import base64
17+
import glob
1718
import json
1819
import logging
1920
import os
20-
import queue
2121
import subprocess
2222
import tempfile
23-
import threading
2423
import time
2524
from dataclasses import dataclass, field
2625
from enum import Enum
@@ -323,7 +322,8 @@ def launch(self, name: str, cmd: list[str]) -> tuple[str, str]:
323322
launch_script = f"""
324323
ln -s {self.pvc_job_dir}/ /nemo_run
325324
cd /nemo_run/code
326-
{" ".join(cmd)}
325+
mkdir -p {self.pvc_job_dir}/logs
326+
{" ".join(cmd)} 2>&1 | tee -a {self.pvc_job_dir}/log_$HOSTNAME.out {self.pvc_job_dir}/log-allranks_0.out
327327
"""
328328
with open(os.path.join(self.job_dir, "launch_script.sh"), "w+") as f:
329329
f.write(launch_script)
@@ -371,91 +371,69 @@ def status(self, job_id: str) -> Optional[DGXCloudState]:
371371
r_json = response.json()
372372
return DGXCloudState(r_json["phase"])
373373

374-
def _stream_url_sync(self, url: str, headers: dict, q: queue.Queue):
375-
"""Stream a single URL using requests and put chunks into the queue"""
376-
try:
377-
with requests.get(url, stream=True, headers=headers, verify=False) as response:
378-
for line in response.iter_lines(decode_unicode=True):
379-
q.put((url, f"{line}\n"))
380-
except Exception as e:
381-
logger.error(f"Error streaming URL {url}: {e}")
382-
383-
finally:
384-
q.put((url, None))
385-
386374
def fetch_logs(
387375
self,
388376
job_id: str,
389377
stream: bool,
390378
stderr: Optional[bool] = None,
391379
stdout: Optional[bool] = None,
392380
) -> Iterable[str]:
393-
token = self.get_auth_token()
394-
if not token:
395-
logger.error("Failed to retrieve auth token for fetch logs request.")
396-
yield ""
397-
398-
response = requests.get(
399-
f"{self.base_url}/workloads", headers=self._default_headers(token=token)
400-
)
401-
workload_name = next(
402-
(
403-
workload["name"]
404-
for workload in response.json()["workloads"]
405-
if workload["id"] == job_id
406-
),
407-
None,
408-
)
409-
if workload_name is None:
410-
logger.error(f"No workload found with id {job_id}")
411-
yield ""
412-
413-
urls = [
414-
f"{self.kube_apiserver_url}/api/v1/namespaces/runai-{self.project_name}/pods/{workload_name}-worker-{i}/log?container=pytorch"
415-
for i in range(self.nodes)
416-
]
417-
418-
if stream:
419-
urls = [url + "&follow=true" for url in urls]
420-
421381
while self.status(job_id) != DGXCloudState.RUNNING:
422382
logger.info("Waiting for job to start...")
423383
time.sleep(15)
424384

425-
time.sleep(10)
385+
cmd = ["tail"]
386+
387+
if stream:
388+
cmd.append("-f")
426389

427-
q = queue.Queue()
428-
active_urls = set(urls)
390+
# setting linked PVC job directory
391+
nemo_run_home = get_nemorun_home()
392+
job_subdir = self.job_dir[len(nemo_run_home) + 1 :] # +1 to remove the initial backslash
393+
self.pvc_job_dir = os.path.join(self.pvc_nemo_run_dir, job_subdir)
429394

430-
# Start threads
431-
threads = [
432-
threading.Thread(
433-
target=self._stream_url_sync, args=(url, self._default_headers(token=token), q)
395+
files = []
396+
while len(files) < self.nodes:
397+
files = list(glob.glob(f"{self.pvc_job_dir}/log_*.out"))
398+
files = [f for f in files if "log-allranks_0" not in f]
399+
logger.info(
400+
f"Waiting for {self.nodes + 1 - len(files)} log files to be created in {self.pvc_job_dir}..."
434401
)
435-
for url in urls
436-
]
437-
for t in threads:
438-
t.start()
439-
440-
# Yield chunks as they arrive
441-
while active_urls:
442-
url, item = q.get()
443-
if item is None or self.status(job_id) in [
444-
DGXCloudState.DELETING,
445-
DGXCloudState.STOPPED,
446-
DGXCloudState.STOPPING,
447-
DGXCloudState.DEGRADED,
448-
DGXCloudState.FAILED,
449-
DGXCloudState.COMPLETED,
450-
DGXCloudState.TERMINATING,
451-
]:
452-
active_urls.discard(url)
453-
else:
454-
yield item
455-
456-
# Wait for threads
457-
for t in threads:
458-
t.join()
402+
time.sleep(3)
403+
404+
cmd.extend(files)
405+
406+
logger.info(f"Attempting to stream logs with command: {cmd}")
407+
408+
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, text=True, bufsize=1)
409+
410+
if stream:
411+
while True:
412+
try:
413+
for line in iter(proc.stdout.readline, ""):
414+
if (
415+
line
416+
and not line.rstrip("\n").endswith(".out <==")
417+
and line.rstrip("\n") != ""
418+
):
419+
yield f"{line}"
420+
if proc.poll() is not None:
421+
break
422+
except Exception as e:
423+
logger.error(f"Error streaming logs: {e}")
424+
time.sleep(3)
425+
continue
426+
427+
else:
428+
try:
429+
for line in iter(proc.stdout.readline, ""):
430+
if line:
431+
yield line.rstrip("\n")
432+
if proc.poll() is not None:
433+
break
434+
finally:
435+
proc.terminate()
436+
proc.wait(timeout=2)
459437

460438
def cancel(self, job_id: str):
461439
# Retrieve the authentication token for the REST calls

nemo_run/core/execution/lepton.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,8 @@ class LeptonExecutor(Executor):
8181
) # Image pull secrets for container registry authentication
8282
custom_spec: dict[str, Any] = field(default_factory=dict)
8383
pre_launch_commands: list[str] = field(default_factory=list) # Custom commands before launch
84+
head_resource_shape: Optional[str] = "" # Only used for LeptonRayCluster
85+
ray_version: Optional[str] = None # Only used for LeptonRayCluster
8486

8587
def stop_job(self, job_id: str):
8688
"""

nemo_run/run/ray/cluster.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,10 @@
1717
from typing import Optional, Type
1818

1919
from nemo_run.core.execution.base import Executor
20+
from nemo_run.core.execution.lepton import LeptonExecutor
2021
from nemo_run.core.execution.slurm import SlurmExecutor
2122
from nemo_run.core.frontend.console.api import configure_logging
23+
from nemo_run.run.ray.lepton import LeptonRayCluster
2224
from nemo_run.run.ray.slurm import SlurmRayCluster
2325

2426
# Import guard for Kubernetes dependencies
@@ -43,6 +45,7 @@ def __post_init__(self):
4345
configure_logging(level=self.log_level)
4446
backend_map: dict[Type[Executor], Type] = {
4547
SlurmExecutor: SlurmRayCluster,
48+
LeptonExecutor: LeptonRayCluster,
4649
}
4750

4851
if _KUBERAY_AVAILABLE and KubeRayExecutor is not None and KubeRayCluster is not None:

0 commit comments

Comments
 (0)