You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/guides/execution.md
+24-4Lines changed: 24 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Execute NeMo Run
2
2
3
-
After configuring NeMo-Run, the next step is to execute it. Nemo-Run decouples configuration from execution, allowing you to configure a function or task once and then execute it across multiple environments. With Nemo-Run, you can choose to execute a single task or multiple tasks simultaneously on different remote clusters, managing them under an experiment. This brings us to the core building blocks for execution:`run.Executor`and`run.Experiment`.
3
+
After configuring NeMo-Run, the next step is to execute it. Nemo-Run decouples configuration from execution, allowing you to configure a function or task once and then execute it across multiple environments. With Nemo-Run, you can choose to execute a single task or multiple tasks simultaneously on different remote clusters, managing them under an experiment. This brings us to the core building blocks for execution:`run.Executor`and`run.Experiment`.
4
4
5
5
Each execution of a single configured task requires an executor. Nemo-Run provides `run.Executor`, which are APIs to configure your remote executor and set up the packaging of your code. Currently we support:
6
6
-`run.LocalExecutor`
@@ -20,7 +20,7 @@ The `run.Experiment` takes care of storing the run metadata, launching it on the
20
20
## Executors
21
21
Executors are dataclasses that configure your remote executor and set up the packaging of your code. All supported executors inherit from the base class `run.Executor`, but have configuration parameters specific to their execution environment. There is an initial cost to understanding the specifics of your executor and setting it up, but this effort is easily amortized over time.
22
22
23
-
Each`run.Executor`has the two attributes:`packager`and`launcher`. The`packager`specifies how to package the code for execution, while the`launcher`determines which tool to use for launching the task.
23
+
Each`run.Executor`has the two attributes:`packager`and`launcher`. The`packager`specifies how to package the code for execution, while the`launcher`determines which tool to use for launching the task.
24
24
25
25
### Launchers
26
26
We support the following `launchers`:
@@ -110,7 +110,7 @@ run.DockerExecutor(
110
110
111
111
#### SlurmExecutor
112
112
113
-
The SlurmExecutor enables launching the configured task on a Slurm Cluster with Pyxis. Additionally, you can configure a`run.SSHTunnel`, which enables you to execute tasks on the Slurm cluster from your local machine while NeMo-Run manages the SSH connection for you. This setup supports use cases such as launching the same task on multiple Slurm clusters.
113
+
The SlurmExecutor enables launching the configured task on a Slurm Cluster with Pyxis. Additionally, you can configure a`run.SSHTunnel`, which enables you to execute tasks on the Slurm cluster from your local machine while NeMo-Run manages the SSH connection for you. This setup supports use cases such as launching the same task on multiple Slurm clusters.
114
114
115
115
Below is an example of configuring a Slurm Executor
Use the SSH Tunnel when launching from your local machine, or the Local Tunnel if you’re already on the Slurm cluster.
153
+
Use the SSH Tunnel when launching from your local machine, or the Local Tunnel if you're already on the Slurm cluster.
154
+
155
+
##### Job Dependencies
156
+
157
+
`SlurmExecutor` supports defining dependencies between [jobs](management.md#adding-tasks), allowing you to create workflows where jobs run in a specific order. Additionally, you can specify the `dependency_type` parameter:
158
+
159
+
```python
160
+
executor = run.SlurmExecutor(
161
+
# ... other parameters ...
162
+
dependency_type="afterok",
163
+
)
164
+
```
165
+
166
+
The `dependency_type` parameter specifies the type of dependency relationship:
167
+
168
+
-`afterok` (default): Job will start only after the specified jobs have completed successfully
169
+
-`afterany`: Job will start after the specified jobs have terminated (regardless of exit code)
170
+
-`afternotok`: Job will start after the specified jobs have failed
171
+
- Other options are available as defined in the [Slurm documentation](https://slurm.schedmd.com/sbatch.html#OPT_dependency)
172
+
173
+
This functionality enables you to create complex workflows with proper orchestration between different tasks, such as starting a training job only after data preparation is complete, or running an evaluation only after training finishes successfully.
154
174
155
175
#### SkypilotExecutor
156
176
This executor is used to configure [Skypilot](https://skypilot.readthedocs.io/en/latest/docs/index.html). Make sure Skypilot is installed and atleast one cloud is configured using `sky check`.
0 commit comments