You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/execution.md
+17-1Lines changed: 17 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,7 @@
3
3
After configuring NeMo-Run, the next step is to execute it. Nemo-Run decouples configuration from execution, allowing you to configure a function or task once and then execute it across multiple environments. With Nemo-Run, you can choose to execute a single task or multiple tasks simultaneously on different remote clusters, managing them under an experiment. This brings us to the core building blocks for execution: `run.Executor` and `run.Experiment`.
4
4
5
5
Each execution of a single configured task requires an executor. Nemo-Run provides `run.Executor`, which are APIs to configure your remote executor and set up the packaging of your code. Currently we support:
6
+
6
7
-`run.LocalExecutor`
7
8
-`run.DockerExecutor`
8
9
-`run.SlurmExecutor` with an optional `SSHTunnel` for executing on Slurm clusters from your local machine
@@ -12,7 +13,7 @@ Each execution of a single configured task requires an executor. Nemo-Run provid
12
13
A tuple of task and executor form an execution unit. A key goal of NeMo-Run is to allow you to mix and match tasks and executors to arbitrarily define execution units.
13
14
14
15
Once an execution unit is created, the next step is to run it. The `run.run` function executes a single task, whereas `run.Experiment` offers more fine-grained control to define complex experiments. `run.run` wraps `run.Experiment` with a single task. `run.Experiment` is an API to launch and manage multiple tasks all using pure Python.
15
-
The `run.Experiment` takes care of storing the run metadata, launching it on the specified cluster, and syncing the logs, etc. Additionally, `run.Experiment` also provides management tools to easily inspect and reproduce past experiments. The `run.Experiment` is inspired from [xmanager](https://github.com/google-deepmind/xmanager/tree/main) and uses [TorchX](https://pytorch.org/torchx/latest/) under the hood to handle execution.
16
+
The `run.Experiment` takes care of storing the run metadata, launching it on the specified cluster, and syncing the logs, etc. Additionally, `run.Experiment` also provides management tools to easily inspect and reproduce past experiments. The `run.Experiment` is inspired from [xmanager](https://github.com/google-deepmind/xmanager/tree/main) and uses [TorchX](https://meta-pytorch.org/torchx/latest/) under the hood to handle execution.
16
17
17
18
```{note}
18
19
NeMo-Run assumes familiarity with Docker and uses a docker image as the environment for remote execution. This means you must provide a Docker image that includes all necessary dependencies and configurations when using a remote executor.
@@ -23,12 +24,15 @@ All the experiment metadata is stored under `NEMORUN_HOME` env var on the machin
23
24
```
24
25
25
26
## Executors
27
+
26
28
Executors are dataclasses that configure your remote executor and set up the packaging of your code. All supported executors inherit from the base class `run.Executor`, but have configuration parameters specific to their execution environment. There is an initial cost to understanding the specifics of your executor and setting it up, but this effort is easily amortized over time.
27
29
28
30
Each `run.Executor` has the two attributes: `packager` and `launcher`. The `packager` specifies how to package the code for execution, while the `launcher` determines which tool to use for launching the task.
29
31
30
32
### Launchers
33
+
31
34
We support the following `launchers`:
35
+
32
36
-`default` or `None`: This will directly launch your task without using any special launchers. Set `executor.launcher = None` (which is the default value) if you don't want to use a specific launcher.
33
37
-`torchrun` or `run.Torchrun`: This will launch the task using `torchrun`. See the `Torchrun` class for configuration options. You can use it using `executor.launcher = "torchrun"` or `executor.launcher = Torchrun(...)`.
34
38
-`ft` or `run.core.execution.FaultTolerance`: This will launch the task using NVIDIA's fault tolerant launcher. See the `FaultTolerance` class for configuration options. You can use it using `executor.launcher = "ft"` or `executor.launcher = FaultTolerance(...)`.
@@ -54,28 +58,34 @@ The packager support matrix is described below:
54
58
55
59
`run.GitArchivePackager` uses `git archive` to package your code. Refer to the API reference for `run.GitArchivePackager` to see the exact mechanics of packaging using `git archive`.
56
60
At a high level, it works in the following way:
61
+
57
62
1. base_path = `git rev-parse --show-toplevel`.
58
63
2. Optionally define a subpath as `base_path/GitArchivePackager.subpath` by setting `subpath` attribute on `GitArchivePackager`.
This extracted tar file becomes the working directory for your job. As an example, given the following directory structure with `subpath="src"`:
67
+
62
68
```
63
69
- docs
64
70
- src
65
71
- your_library
66
72
- tests
67
73
```
74
+
68
75
Your working directory at the time of execution will look like:
76
+
69
77
```
70
78
- your_library
71
79
```
80
+
72
81
If you're executing a Python function, this working directory will automatically be included in your Python path.
73
82
74
83
```{note}
75
84
Git archive doesn't package uncommitted changes. In the future, we may add support for including uncommitted changes while honoring `.gitignore`.
76
85
```
77
86
78
87
`run.PatternPackager` is a packager that uses a pattern to package your code. It is useful for packaging code that is not under version control. For example, if you have a directory structure like this:
88
+
79
89
```
80
90
- docs
81
91
- src
@@ -94,6 +104,7 @@ cd {relative_path} && find {relative_include_pattern} -type f
94
104
Each sub-packager in the `sub_packagers` dictionary is assigned a key, which becomes the directory name under which its contents are placed in the final archive. If `extract_at_root` is set to `True`, all contents are placed directly in the root of the archive, potentially overwriting files if names conflict.
This would create an archive where the contents of `src` are under a `code/` directory and matched `configs/*.yaml` files are under a `configs/` directory.
112
124
113
125
### Defining Executors
126
+
114
127
Next, We'll describe details on setting up each of the executors below.
115
128
116
129
#### LocalExecutor
@@ -145,6 +158,7 @@ run.DockerExecutor(
145
158
The SlurmExecutor enables launching the configured task on a Slurm Cluster with Pyxis. Additionally, you can configure a `run.SSHTunnel`, which enables you to execute tasks on the Slurm cluster from your local machine while NeMo-Run manages the SSH connection for you. This setup supports use cases such as launching the same task on multiple Slurm clusters.
146
159
147
160
Below is an example of configuring a Slurm Executor
@@ -205,9 +219,11 @@ The `dependency_type` parameter specifies the type of dependency relationship:
205
219
This functionality enables you to create complex workflows with proper orchestration between different tasks, such as starting a training job only after data preparation is complete, or running an evaluation only after training finishes successfully.
206
220
207
221
#### SkypilotExecutor
222
+
208
223
This executor is used to configure [Skypilot](https://skypilot.readthedocs.io/en/latest/docs/index.html). Make sure Skypilot is installed using `pip install "nemo_run[skypilot]"` and atleast one cloud is configured using `sky check`.
209
224
210
225
Here's an example of the `SkypilotExecutor` for Kubernetes:
0 commit comments