NVIDIA-NeMo
diff --git a/‎CHANGELOG.md‎
Lines changed: 82 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 82 additions & 0 deletions
diff --git a/‎docs/guides/ray.md‎
Lines changed: 76 additions & 4 deletions b/‎docs/guides/ray.md‎
Lines changed: 76 additions & 4 deletions
diff --git a/‎nemo_run/core/execution/dgxcloud.py‎
Lines changed: 52 additions & 74 deletions b/‎nemo_run/core/execution/dgxcloud.py‎
Lines changed: 52 additions & 74 deletions
diff --git a/‎nemo_run/core/execution/lepton.py‎
Lines changed: 2 additions & 0 deletions b/‎nemo_run/core/execution/lepton.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎nemo_run/run/ray/cluster.py‎
Lines changed: 3 additions & 0 deletions b/‎nemo_run/run/ray/cluster.py‎
Lines changed: 3 additions & 0 deletions
@@ -1,6 +1,88 @@
 # Changelog
 
 <!-- Next changelog -->
+## NVIDIA Nemo Run 0.7.0
+
+### Detailed Changelogs:
+
+
+#### Executors
+
+
+
+- Add image pull secrets param for lepton [#330](https://github.com/NVIDIA-NeMo/Run/pull/330)
+- Add node reservations for LeptonExecutor [#336](https://github.com/NVIDIA-NeMo/Run/pull/336)
+- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs [#338](https://github.com/NVIDIA-NeMo/Run/pull/338)
+- [SkyPilot] Add retry_until_up as an optional arg to SkyPilot Executor [#340](https://github.com/NVIDIA-NeMo/Run/pull/340)
+- Support SkyPilot Storage configurations in `file_mounts` for automatic cloud sync [#335](https://github.com/NVIDIA-NeMo/Run/pull/335)
+- [SkyPilot] Update YAML dump imports + backward compatibility for SkyPilot <=0.10.3 [#339](https://github.com/NVIDIA-NeMo/Run/pull/339)
+- Create SkypilotJobsExecutor to allow running managed jobs [#343](https://github.com/NVIDIA-NeMo/Run/pull/343)
+- fix: exit code docker runs [#365](https://github.com/NVIDIA-NeMo/Run/pull/365)
+
+
+#### Ray Integration
+
+
+
+- Add ray head start timeout [#324](https://github.com/NVIDIA-NeMo/Run/pull/324)
+- Remove ray deprecated dashboard-grpc-port arg [#325](https://github.com/NVIDIA-NeMo/Run/pull/325)
+
+
+#### Experiment & Job Management
+
+
+
+- add a grace for Jobs that may start in Unknown [#291](https://github.com/NVIDIA-NeMo/Run/pull/291)
+- Create SkypilotJobsExecutor to allow running managed jobs [#343](https://github.com/NVIDIA-NeMo/Run/pull/343)
+
+
+#### Packaging & Deployment
+
+
+
+- Support SkyPilot Storage configurations in `file_mounts` for automatic cloud sync [#335](https://github.com/NVIDIA-NeMo/Run/pull/335)
+- Refactor tar packaging logic to work for submodule and extra repo [#347](https://github.com/NVIDIA-NeMo/Run/pull/347)
+
+
+#### Documentation
+
+
+
+- Add broken links check in docs [#333](https://github.com/NVIDIA-NeMo/Run/pull/333)
+- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs [#338](https://github.com/NVIDIA-NeMo/Run/pull/338)
+- Documentation Restructurting [#350](https://github.com/NVIDIA-NeMo/Run/pull/350)
+- Fix spelling in docstring [#359](https://github.com/NVIDIA-NeMo/Run/pull/359)
+- fix: exit code docker runs [#365](https://github.com/NVIDIA-NeMo/Run/pull/365)
+
+
+#### CI/CD
+
+
+
+- Update cherry-pick workflow to use version 0.63.0 [#344](https://github.com/NVIDIA-NeMo/Run/pull/344)
+- fix: exit code docker runs [#365](https://github.com/NVIDIA-NeMo/Run/pull/365)
+
+
+#### Bug Fixes
+
+
+
+- [SkyPilot] Fix nodes -> num_nodes for SkyPilotExecutor in docs [#338](https://github.com/NVIDIA-NeMo/Run/pull/338)
+- Fix spelling in docstring [#359](https://github.com/NVIDIA-NeMo/Run/pull/359)
+- fix: exit code docker runs [#365](https://github.com/NVIDIA-NeMo/Run/pull/365)
+
+
+#### Others
+
+- chore: Bump to version 0.7.0rc0.dev0 [#322](https://github.com/NVIDIA-NeMo/Run/pull/322)
+- Update community-bot to add community issues to shared project [#321](https://github.com/NVIDIA-NeMo/Run/pull/321)
+- Bump community-bot to 0.54.4 [#332](https://github.com/NVIDIA-NeMo/Run/pull/332)
+- remove custom dir [#351](https://github.com/NVIDIA-NeMo/Run/pull/351)
+- Bumping to 0.5.0 [#352](https://github.com/NVIDIA-NeMo/Run/pull/352)
+- Update release notes header in changelog build [#355](https://github.com/NVIDIA-NeMo/Run/pull/355)
+- add changelog-config [#356](https://github.com/NVIDIA-NeMo/Run/pull/356)
+- Changelog 0.6.0 [#357](https://github.com/NVIDIA-NeMo/Run/pull/357)
+- feat: new changelog-build [#367](https://github.com/NVIDIA-NeMo/Run/pull/367)
 ## NVIDIA Nemo Run 0.6.0
 
 ### Detailed Changelogs:
 
@@ -25,17 +25,19 @@
 
 | Object      | What it abstracts | Back-ends supported |
 |-------------|-------------------|---------------------|
-| `run.ray.cluster.RayCluster` | Lifecycle of a Ray **cluster** (create ⇒ wait ⇢ status ⇢ port-forward ⇢ delete). | `KubeRayExecutor`, `SlurmExecutor` |
+| `run.ray.cluster.RayCluster` | Lifecycle of a Ray **cluster** (create ⇒ wait ⇢ status ⇢ port-forward ⇢ delete). | `KubeRayExecutor`, `SlurmExecutor`, `LeptonExecutor` |
 | `run.ray.job.RayJob`         | Lifecycle of a Ray **job**   (submit ⇒ monitor ⇢ logs ⇢ cancel). | same |
 
-The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s) or a **Slurm** job under the hood.
+The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s), **DGX Cloud Lepton's RayCluster**, or a **Slurm** job under the hood.
 
 ```mermaid
 classDiagram
     RayCluster <|-- KubeRayCluster
     RayCluster <|-- SlurmRayCluster
+    RayCluster <|-- LeptonRayCluster
     RayJob     <|-- KubeRayJob
     RayJob     <|-- SlurmRayJob
+    RayJob     <|-- LeptonRayJob
 ```
 
 ## 2.  KubeRay quick-start
@@ -183,7 +185,77 @@ cluster.stop()
 * `executor.packager = run.GitArchivePackager()` if you prefer packaging a git tree instead of rsync.
 * `cluster.port_forward()` opens an SSH tunnel from *your laptop* to the Ray dashboard running on the head node.
 
-## 4.  API reference cheat-sheet
+## 4.  DGX Cloud Lepton RayCluster quick-start
+
+```python
+import os
+from pathlib import Path
+
+import nemo_run as run
+from nemo_run.core.execution.lepton import LeptonExecutor
+from nemo_run.run.ray.cluster import RayCluster
+from nemo_run.run.ray.job import RayJob
+
+# 1) Create a LeptonExecutor and tweak defaults
+mounts = [
+    {
+        "path": "/",
+        "mount_path": "/nemo-workspace",
+        "from": "node-nfs:lepton-shared-fs",
+    }
+]
+
+executor = LeptonExecutor(
+    resource_shape="gpu.8xh100",
+    container_image="rayproject/ray:2.49.2-gpu",
+    nemo_run_dir="/nemo-workspace/nemo-run",
+    head_resource_shape="cpu.large",
+    ray_version="2.49.2",
+    mounts=mounts,
+    node_group="my-node-group",
+    nodes=1,
+    nprocs_per_node=8,
+    env_vars={
+        "TORCH_HOME": "/nemo-workspace/.cache",
+    },
+    secret_vars=[
+        {"WANDB_API_KEY": "WANDB_API_KEY"},
+        {"HF_TOKEN": "HUGGING_FACE_HUB_TOKEN"},
+    ],
+    launcher="torchrun",
+    image_pull_secrets=[],
+    pre_launch_commands=[],
+)
+
+# 2) Bring up the RayCluster on DGX Cloud Lepton and show the status
+cluster = RayCluster(
+    name="lepton-ray-cluster",
+    executor=executor,
+)
+cluster.start(timeout=1800)
+cluster.status(display=True)
+
+# 3) Submit a RayJob that runs inside the created RayCluster
+job = RayJob(
+    name="demo-lepton-ray-job",
+    executor=executor,
+    cluster_name="lepton-ray-cluster",
+)
+job.start(
+    command="uv run python train.py --config cfgs/train.yaml cluster.num_nodes=2",
+    workdir="/path/to/project/",  # rsync'ed from local to the RayCluster
+)
+job.status(display=True)  # Display the RayJob status
+job.logs(follow=True)  # Tail the job logs as it runs
+
+# 4) Tear down the RayCluster and free up resources
+cluster.stop()
+```
+
+### Tips for DGX Cloud Lepton users
+* This assumes the [DGX Cloud Lepton CLI](https://docs.nvidia.com/dgx-cloud/lepton/reference/cli/get-started/) is installed and has been authenticated.
+
+## 5.  API reference cheat-sheet
 
 ```python
 cluster = RayCluster(name, executor)
@@ -201,7 +273,7 @@ job.stop()
 
 All methods are synchronous and **return immediately** when their work is done; the helpers hide the messy details (kubectl, squeue, ssh, …).
 
-## 5.  Rolling your own CLI
+## 6.  Rolling your own CLI
 
 Because `RayCluster` and `RayJob` are plain Python, you can compose them inside **argparse**, **Typer**, **Click** – anything. Here is a minimal **argparse** script:
 
 
@@ -14,13 +14,12 @@
 # limitations under the License.
 
 import base64
+import glob
 import json
 import logging
 import os
-import queue
 import subprocess
 import tempfile
-import threading
 import time
 from dataclasses import dataclass, field
 from enum import Enum
@@ -323,7 +322,8 @@ def launch(self, name: str, cmd: list[str]) -> tuple[str, str]:
         launch_script = f"""
 ln -s {self.pvc_job_dir}/ /nemo_run
 cd /nemo_run/code
-{" ".join(cmd)}
+mkdir -p {self.pvc_job_dir}/logs
+{" ".join(cmd)} 2>&1 | tee -a {self.pvc_job_dir}/log_$HOSTNAME.out {self.pvc_job_dir}/log-allranks_0.out
 """
         with open(os.path.join(self.job_dir, "launch_script.sh"), "w+") as f:
             f.write(launch_script)
@@ -371,91 +371,69 @@ def status(self, job_id: str) -> Optional[DGXCloudState]:
         r_json = response.json()
         return DGXCloudState(r_json["phase"])
 
-    def _stream_url_sync(self, url: str, headers: dict, q: queue.Queue):
-        """Stream a single URL using requests and put chunks into the queue"""
-        try:
-            with requests.get(url, stream=True, headers=headers, verify=False) as response:
-                for line in response.iter_lines(decode_unicode=True):
-                    q.put((url, f"{line}\n"))
-        except Exception as e:
-            logger.error(f"Error streaming URL {url}: {e}")
-
-        finally:
-            q.put((url, None))
-
     def fetch_logs(
         self,
         job_id: str,
         stream: bool,
         stderr: Optional[bool] = None,
         stdout: Optional[bool] = None,
     ) -> Iterable[str]:
-        token = self.get_auth_token()
-        if not token:
-            logger.error("Failed to retrieve auth token for fetch logs request.")
-            yield ""
-
-        response = requests.get(
-            f"{self.base_url}/workloads", headers=self._default_headers(token=token)
-        )
-        workload_name = next(
-            (
-                workload["name"]
-                for workload in response.json()["workloads"]
-                if workload["id"] == job_id
-            ),
-            None,
-        )
-        if workload_name is None:
-            logger.error(f"No workload found with id {job_id}")
-            yield ""
-
-        urls = [
-            f"{self.kube_apiserver_url}/api/v1/namespaces/runai-{self.project_name}/pods/{workload_name}-worker-{i}/log?container=pytorch"
-            for i in range(self.nodes)
-        ]
-
-        if stream:
-            urls = [url + "&follow=true" for url in urls]
-
         while self.status(job_id) != DGXCloudState.RUNNING:
             logger.info("Waiting for job to start...")
             time.sleep(15)
 
-        time.sleep(10)
+        cmd = ["tail"]
+
+        if stream:
+            cmd.append("-f")
 
-        q = queue.Queue()
-        active_urls = set(urls)
+        # setting linked PVC job directory
+        nemo_run_home = get_nemorun_home()
+        job_subdir = self.job_dir[len(nemo_run_home) + 1 :]  # +1 to remove the initial backslash
+        self.pvc_job_dir = os.path.join(self.pvc_nemo_run_dir, job_subdir)
 
-        # Start threads
-        threads = [
-            threading.Thread(
-                target=self._stream_url_sync, args=(url, self._default_headers(token=token), q)
+        files = []
+        while len(files) < self.nodes:
+            files = list(glob.glob(f"{self.pvc_job_dir}/log_*.out"))
+            files = [f for f in files if "log-allranks_0" not in f]
+            logger.info(
+                f"Waiting for {self.nodes + 1 - len(files)} log files to be created in {self.pvc_job_dir}..."
             )
-            for url in urls
-        ]
-        for t in threads:
-            t.start()
-
-        # Yield chunks as they arrive
-        while active_urls:
-            url, item = q.get()
-            if item is None or self.status(job_id) in [
-                DGXCloudState.DELETING,
-                DGXCloudState.STOPPED,
-                DGXCloudState.STOPPING,
-                DGXCloudState.DEGRADED,
-                DGXCloudState.FAILED,
-                DGXCloudState.COMPLETED,
-                DGXCloudState.TERMINATING,
-            ]:
-                active_urls.discard(url)
-            else:
-                yield item
-
-        # Wait for threads
-        for t in threads:
-            t.join()
+            time.sleep(3)
+
+        cmd.extend(files)
+
+        logger.info(f"Attempting to stream logs with command: {cmd}")
+
+        proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, text=True, bufsize=1)
+
+        if stream:
+            while True:
+                try:
+                    for line in iter(proc.stdout.readline, ""):
+                        if (
+                            line
+                            and not line.rstrip("\n").endswith(".out <==")
+                            and line.rstrip("\n") != ""
+                        ):
+                            yield f"{line}"
+                        if proc.poll() is not None:
+                            break
+                except Exception as e:
+                    logger.error(f"Error streaming logs: {e}")
+                    time.sleep(3)
+                    continue
+
+        else:
+            try:
+                for line in iter(proc.stdout.readline, ""):
+                    if line:
+                        yield line.rstrip("\n")
+                    if proc.poll() is not None:
+                        break
+            finally:
+                proc.terminate()
+                proc.wait(timeout=2)
 
     def cancel(self, job_id: str):
         # Retrieve the authentication token for the REST calls
 
@@ -81,6 +81,8 @@ class LeptonExecutor(Executor):
     )  # Image pull secrets for container registry authentication
     custom_spec: dict[str, Any] = field(default_factory=dict)
     pre_launch_commands: list[str] = field(default_factory=list)  # Custom commands before launch
+    head_resource_shape: Optional[str] = ""  # Only used for LeptonRayCluster
+    ray_version: Optional[str] = None  # Only used for LeptonRayCluster
 
     def stop_job(self, job_id: str):
         """
 
@@ -17,8 +17,10 @@
 from typing import Optional, Type
 
 from nemo_run.core.execution.base import Executor
+from nemo_run.core.execution.lepton import LeptonExecutor
 from nemo_run.core.execution.slurm import SlurmExecutor
 from nemo_run.core.frontend.console.api import configure_logging
+from nemo_run.run.ray.lepton import LeptonRayCluster
 from nemo_run.run.ray.slurm import SlurmRayCluster
 
 # Import guard for Kubernetes dependencies
@@ -43,6 +45,7 @@ def __post_init__(self):
         configure_logging(level=self.log_level)
         backend_map: dict[Type[Executor], Type] = {
             SlurmExecutor: SlurmRayCluster,
+            LeptonExecutor: LeptonRayCluster,
         }
 
         if _KUBERAY_AVAILABLE and KubeRayExecutor is not None and KubeRayCluster is not None: