Skip to content

Commit 41f05d9

Browse files
authored
Add slurm dependency type section to execution guide (#181)
Signed-off-by: Hemil Desai <[email protected]>
1 parent 5bdba59 commit 41f05d9

File tree

1 file changed

+24
-4
lines changed

1 file changed

+24
-4
lines changed

docs/source/guides/execution.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Execute NeMo Run
22

3-
After configuring NeMo-Run, the next step is to execute it. Nemo-Run decouples configuration from execution, allowing you to configure a function or task once and then execute it across multiple environments. With Nemo-Run, you can choose to execute a single task or multiple tasks simultaneously on different remote clusters, managing them under an experiment. This brings us to the core building blocks for execution: `run.Executor` and `run.Experiment`.
3+
After configuring NeMo-Run, the next step is to execute it. Nemo-Run decouples configuration from execution, allowing you to configure a function or task once and then execute it across multiple environments. With Nemo-Run, you can choose to execute a single task or multiple tasks simultaneously on different remote clusters, managing them under an experiment. This brings us to the core building blocks for execution: `run.Executor` and `run.Experiment`.
44

55
Each execution of a single configured task requires an executor. Nemo-Run provides `run.Executor`, which are APIs to configure your remote executor and set up the packaging of your code. Currently we support:
66
- `run.LocalExecutor`
@@ -20,7 +20,7 @@ The `run.Experiment` takes care of storing the run metadata, launching it on the
2020
## Executors
2121
Executors are dataclasses that configure your remote executor and set up the packaging of your code. All supported executors inherit from the base class `run.Executor`, but have configuration parameters specific to their execution environment. There is an initial cost to understanding the specifics of your executor and setting it up, but this effort is easily amortized over time.
2222

23-
Each `run.Executor` has the two attributes: `packager` and `launcher`. The `packager` specifies how to package the code for execution, while the `launcher` determines which tool to use for launching the task.
23+
Each `run.Executor` has the two attributes: `packager` and `launcher`. The `packager` specifies how to package the code for execution, while the `launcher` determines which tool to use for launching the task.
2424

2525
### Launchers
2626
We support the following `launchers`:
@@ -110,7 +110,7 @@ run.DockerExecutor(
110110

111111
#### SlurmExecutor
112112

113-
The SlurmExecutor enables launching the configured task on a Slurm Cluster with Pyxis.  Additionally, you can configure a `run.SSHTunnel`, which enables you to execute tasks on the Slurm cluster from your local machine while NeMo-Run manages the SSH connection for you. This setup supports use cases such as launching the same task on multiple Slurm clusters.
113+
The SlurmExecutor enables launching the configured task on a Slurm Cluster with Pyxis. Additionally, you can configure a `run.SSHTunnel`, which enables you to execute tasks on the Slurm cluster from your local machine while NeMo-Run manages the SSH connection for you. This setup supports use cases such as launching the same task on multiple Slurm clusters.
114114

115115
Below is an example of configuring a Slurm Executor
116116
```python
@@ -150,7 +150,27 @@ def your_slurm_executor(nodes: int = 1, container_image: str = DEFAULT_IMAGE):
150150
executor = your_slurm_cluster(nodes=8, container_image="your-nemo-image")
151151
```
152152

153-
Use the SSH Tunnel when launching from your local machine, or the Local Tunnel if you’re already on the Slurm cluster.
153+
Use the SSH Tunnel when launching from your local machine, or the Local Tunnel if you're already on the Slurm cluster.
154+
155+
##### Job Dependencies
156+
157+
`SlurmExecutor` supports defining dependencies between [jobs](management.md#adding-tasks), allowing you to create workflows where jobs run in a specific order. Additionally, you can specify the `dependency_type` parameter:
158+
159+
```python
160+
executor = run.SlurmExecutor(
161+
# ... other parameters ...
162+
dependency_type="afterok",
163+
)
164+
```
165+
166+
The `dependency_type` parameter specifies the type of dependency relationship:
167+
168+
- `afterok` (default): Job will start only after the specified jobs have completed successfully
169+
- `afterany`: Job will start after the specified jobs have terminated (regardless of exit code)
170+
- `afternotok`: Job will start after the specified jobs have failed
171+
- Other options are available as defined in the [Slurm documentation](https://slurm.schedmd.com/sbatch.html#OPT_dependency)
172+
173+
This functionality enables you to create complex workflows with proper orchestration between different tasks, such as starting a training job only after data preparation is complete, or running an evaluation only after training finishes successfully.
154174

155175
#### SkypilotExecutor
156176
This executor is used to configure [Skypilot](https://skypilot.readthedocs.io/en/latest/docs/index.html). Make sure Skypilot is installed and atleast one cloud is configured using `sky check`.

0 commit comments

Comments
 (0)