Skip to content

Commit 78f54ee

Browse files
authored
Add LeptonExecutor support (#224)
* Add support for NVIDIA DGX Cloud Lepton NVIDIA DGX Cloud Lepton is another platform available for launching distributed jobs using NeMo-Run. The new LeptonExecutor leverages the Lepton Python SDK to authenticate with a DGX Cloud Lepton cluster and launch jobs on available resources. Signed-Off-By: Robert Clark <[email protected]> * Add custom mounts to Lepton batch jobs Allow users to specify custom mounts using Lepton's Filesystem functionality. Signed-Off-By: Robert Clark <[email protected]> * Add error handling to LeptonExecutor Handle more possible failure scenarios for the LeptonExecutor where the code could run into a bad state and the user should be alerted with helpful debug info. Signed-Off-By: Robert Clark <[email protected]> * Use a low-resource pod to move data to Lepton Running a low-resource pod to copy experiment data to the Lepton cluster is more reliable and broadly compatible with various cluster types versus the storage API. Signed-Off-By: Robert Clark <[email protected]> * Add unit tests to LeptonExecutor Signed-Off-By: Robert Clark <[email protected]> --------- Signed-off-by: Robert Clark <[email protected]>
1 parent fb86cd0 commit 78f54ee

File tree

13 files changed

+1338
-5
lines changed

13 files changed

+1338
-5
lines changed

docs/source/guides/execution.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Each execution of a single configured task requires an executor. Nemo-Run provid
77
- `run.DockerExecutor`
88
- `run.SlurmExecutor` with an optional `SSHTunnel` for executing on Slurm clusters from your local machine
99
- `run.SkypilotExecutor` (available under the optional feature `skypilot` in the python package).
10+
- `run.LeptonExecutor`
1011

1112
A tuple of task and executor form an execution unit. A key goal of NeMo-Run is to allow you to mix and match tasks and executors to arbitrarily define execution units.
1213

@@ -41,6 +42,7 @@ The packager support matrix is described below:
4142
| SlurmExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
4243
| SkypilotExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
4344
| DGXCloudExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
45+
| LeptonExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
4446

4547
`run.Packager` is a passthrough base packager.
4648

@@ -264,3 +266,40 @@ def your_dgx_executor(nodes: int, gpus_per_node: int, container_image: str):
264266
```
265267

266268
For a complete end-to-end example using DGX Cloud with NeMo, refer to the [NVIDIA DGX Cloud NeMo End-to-End Workflow Example](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html).
269+
270+
#### LeptonExecutor
271+
272+
The `LeptonExecutor` integrates with an NVIDIA DGX Cloud Lepton cluster's Python SDK to launch distributed jobs. It uses API calls behind the Lepton SDK to authenticate, identify the target node group and resource shapes, and submit the job specification which will be launched as a batch job on the cluster.
273+
274+
Here's an example configuration:
275+
276+
```python
277+
def your_lepton_executor(nodes: int, gpus_per_node: int, container_image: str):
278+
# Ensure these are set correctly for your DGX Cloud environment
279+
# You might fetch these from environment variables or a config file
280+
resource_shape = "gpu.8xh100-80gb" # Replace with your desired resource shape representing the number of GPUs in a pod
281+
node_group = "my-node-group" # The node group to run the job in
282+
nemo_run_dir = "/nemo-workspace/nemo-run" # The NeMo-Run directory where experiments are saved
283+
# Define the remote storage directory that will be mounted in the job pods
284+
# Ensure the path specified here contains your NEMORUN_HOME
285+
storage_path = "/nemo-workspace" # The remote storage directory to mount in jobs
286+
mount_path = "/nemo-workspace" # The path where the remote storage directory will be mounted inside the container
287+
288+
executor = run.LeptonExecutor(
289+
resource_shape=resource_shape,
290+
node_group=node_group,
291+
container_image=container_image,
292+
nodes=nodes,
293+
nemo_run_dir=nemo_run_dir,
294+
gpus_per_node=gpus_per_node,
295+
mounts=[{"path": storage_path, "mount_path": mount_path}],
296+
# Optional: Add custom environment variables or PyTorch specs if needed
297+
env_vars=common_envs(),
298+
# packager=run.GitArchivePackager() # Choose appropriate packager
299+
)
300+
return executor
301+
302+
# Example usage:
303+
executor = your_lepton_executor(nodes=4, gpus_per_node=8, container_image="your-nemo-image")
304+
305+
```

docs/source/guides/why-use-nemo-run.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ But once defined, it is seamless to launch your tasks. Currently, we support the
2929
- LocalExecutor
3030
- SlurmExecutor
3131
- SkypilotExecutor
32+
- LeptonExecutor
3233

3334
This means that you can launch your configured task on one slurm cluster or the other, on a Kubernetes cluster, on one cloud or the other, or on all of them at the same time.
3435

docs/source/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,12 @@ will install Skypilot w all clouds
3838

3939
You can also manually install Skypilot from https://skypilot.readthedocs.io/en/latest/getting-started/installation.html
4040

41+
If using DGX Cloud Lepton, use the following command to install the Lepton CLI:
42+
43+
``pip install leptonai``
44+
45+
To authenticate with the DGX Cloud Lepton cluster, navigate to the **Settings > Tokens** page in the DGX Cloud Lepton UI and copy the ``lep login`` command shown on the page and run it in the terminal.
46+
4147
Make sure you have `pip` installed and configured properly.
4248

4349

nemo_run/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,19 +19,20 @@
1919

2020
from nemo_run import cli
2121
from nemo_run.api import autoconvert, dryrun_fn
22+
from nemo_run.cli.lazy import LazyEntrypoint, lazy_imports
2223
from nemo_run.config import Config, ConfigurableMixin, Partial, Script
2324
from nemo_run.core.execution.base import Executor, ExecutorMacros, import_executor
2425
from nemo_run.core.execution.dgxcloud import DGXCloudExecutor
2526
from nemo_run.core.execution.docker import DockerExecutor
2627
from nemo_run.core.execution.launcher import FaultTolerance, SlurmRay, SlurmTemplate, Torchrun
28+
from nemo_run.core.execution.lepton import LeptonExecutor
2729
from nemo_run.core.execution.local import LocalExecutor
2830
from nemo_run.core.execution.skypilot import SkypilotExecutor
2931
from nemo_run.core.execution.slurm import SlurmExecutor
3032
from nemo_run.core.packaging import GitArchivePackager, HybridPackager, Packager, PatternPackager
3133
from nemo_run.core.tunnel.client import LocalTunnel, SSHTunnel
3234
from nemo_run.devspace.base import DevSpace
3335
from nemo_run.help import help
34-
from nemo_run.cli.lazy import LazyEntrypoint, lazy_imports
3536
from nemo_run.package_info import __package_name__, __version__
3637
from nemo_run.run.api import run
3738
from nemo_run.run.experiment import Experiment
@@ -58,6 +59,7 @@
5859
"GitArchivePackager",
5960
"PatternPackager",
6061
"help",
62+
"LeptonExecutor",
6163
"LocalExecutor",
6264
"LocalTunnel",
6365
"Packager",

nemo_run/core/execution/__init__.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,16 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16+
from nemo_run.core.execution.dgxcloud import DGXCloudExecutor
17+
from nemo_run.core.execution.lepton import LeptonExecutor
1618
from nemo_run.core.execution.local import LocalExecutor
1719
from nemo_run.core.execution.skypilot import SkypilotExecutor
1820
from nemo_run.core.execution.slurm import SlurmExecutor
19-
from nemo_run.core.execution.dgxcloud import DGXCloudExecutor
2021

21-
__all__ = ["LocalExecutor", "SlurmExecutor", "SkypilotExecutor", "DGXCloudExecutor"]
22+
__all__ = [
23+
"LocalExecutor",
24+
"SlurmExecutor",
25+
"SkypilotExecutor",
26+
"DGXCloudExecutor",
27+
"LeptonExecutor",
28+
]

0 commit comments

Comments
 (0)