Skip to content

Commit 5eea9cb

Browse files
authored
Update management guide with more details (#102)
* Update management guide with more details Signed-off-by: Hemil Desai <[email protected]> * Add hello_docker example Signed-off-by: Hemil Desai <[email protected]> --------- Signed-off-by: Hemil Desai <[email protected]>
1 parent 0e87190 commit 5eea9cb

File tree

2 files changed

+151
-1
lines changed

2 files changed

+151
-1
lines changed

docs/source/guides/management.md

Lines changed: 115 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,119 @@
11
# Management
22

3-
NeMo-Run also provides ways to inspect and reproduce past experiments. This allows you to check logs, sync artifacts (in the future), cancel running tasks, and rerun an old experiment. When you run an experiment using `run.run` or `run.Experiment`, it creates a run under the experiment title. Once finished, you see the following output at the end:
3+
The central component for management of tasks in NeMo-Run is the `Experiment` class. It allows you to define, launch, and manage complex workflows consisting of multiple tasks. This guide provides an overview of the `Experiment` class, its methods, and how to use it effectively.
4+
5+
**Creating an Experiment**
6+
---------------------------
7+
8+
To create an experiment, you can instantiate the `Experiment` class by passing in a descriptive title:
9+
```python
10+
exp = Experiment("My Experiment")
11+
```
12+
When executed, it will automatically generate a unique experiment ID for you, which represents one unique run of the experiment.
13+
14+
> [!NOTE]
15+
> `Experiment` is a context manager and `Experiment.add` and `Experiment.run` methods can currently only be used after entering the context manager.
16+
17+
**Adding Tasks**
18+
-----------------
19+
20+
You can add tasks to an experiment using the `add` method. This method supports tasks of the following kind:
21+
22+
1. A single task which is an instance of either `run.Partial` or `run.Script`, along with its executor.
23+
```python
24+
with exp:
25+
exp.add(task_1, executor=run.LocalExecutor())
26+
```
27+
28+
2. A list of tasks, each of which is an instance of either `run.Partial` or `run.Script`, along with a single executor or a list of executors for each task in the group. Currently, all tasks in the group will be executed in parallel.
29+
```python
30+
with exp:
31+
exp.add([task_2, task_3], executor=run.DockerExecutor(...))
32+
```
33+
34+
You can specify a descriptive name for the task using the `name` keyword argument.
35+
36+
`add` also takes in a list of plugins, each an instance of `run.Plugin`. Plugins are used to make changes to the task and executor together, which is useful in some cases - for example, to enable a config option in the task and set an environment variable in the executor related to the config option.
37+
38+
`add` returns a unique id for the task/job. This unique id can be used to define complex dependencies between a group of tasks as follows:
39+
```python
40+
with run.Experiment("dag-experiment", log_level="INFO") as exp:
41+
id1 = exp.add([inline_script, inline_script_sleep], tail_logs=False, name="task-1")
42+
id2 = exp.add([inline_script, inline_script_sleep], tail_logs=False, name="task-2")
43+
exp.add(
44+
[inline_script, inline_script_sleep],
45+
tail_logs=False,
46+
name="task-3",
47+
dependencies=[id1, id2], # task-3 will only run after task-1 and task-2 have completed
48+
)
49+
```
50+
51+
**Launching an Experiment**
52+
---------------------------
53+
54+
Once you have added all tasks to an experiment, you can launch it using the `run` method. This method takes several optional arguments, including `detach`, `sequential`, and `tail_logs` and `direct`:
55+
56+
* `detach`: If `True`, the experiment will detach from the process executing it. This is useful when launching an experiment on a remote cluster, where you may want to end the process after scheduling the tasks in that experiment.
57+
* `sequential`: If `True`, all tasks will be executed sequentially. This is only applicable when the individual tasks do not have any dependencies on each other.
58+
* `tail_logs`: If `True`, logs will be displayed in real-time.
59+
* `direct`: If `True`, each task in the experiment will be executed directly in the same process on your local machine. This does not support task/job groups.
60+
61+
```python
62+
with exp:
63+
# Add all tasks
64+
exp.run(detach=True, sequential=False, tail_logs=True, direct=False)
65+
```
66+
67+
**Experiment Status**
68+
---------------------
69+
70+
You can check the status of an experiment using the `status` method:
71+
```python
72+
exp.status()
73+
```
74+
This method will display information the status of each task in the experiment. The following is a sample output from the status of experiment in [hello_scripts.py](../../../examples/hello-world/hello_scripts.py):
75+
```bash
76+
Experiment Status for experiment_with_scripts_1730761155
77+
78+
Task 0: echo.sh
79+
- Status: SUCCEEDED
80+
- Executor: LocalExecutor
81+
- Job id: echo.sh-zggz3tq0kpljs
82+
- Local Directory: /home/your_user/.nemo_run/experiments/experiment_with_scripts/experiment_with_scripts_1730761155/echo.sh
83+
84+
Task 1: env_echo_
85+
- Status: SUCCEEDED
86+
- Executor: LocalExecutor
87+
- Job id: env_echo_-f3fc3fbj1qjtc
88+
- Local Directory: /home/your_user/.nemo_run/experiments/experiment_with_scripts/experiment_with_scripts_1730761155/env_echo_
89+
90+
Task 2: simple.add.add_object
91+
- Status: RUNNING
92+
- Executor: LocalExecutor
93+
- Job id: simple.add.add_object-s1543tt3f7dcm
94+
- Local Directory: /home/your_user/.nemo_run/experiments/experiment_with_scripts/experiment_with_scripts_1730761155/simple.add.add_object
95+
```
96+
97+
**Canceling a Task**
98+
---------------------
99+
100+
You can cancel a task using the `cancel` method:
101+
```python
102+
exp.cancel("task_id")
103+
```
104+
105+
**Viewing Logs**
106+
-----------------
107+
108+
You can view the logs of a task using the `logs` method:
109+
```python
110+
exp.logs("task_id")
111+
```
112+
113+
**Experiment output**
114+
-----------------
115+
Once an experiment is run, NeMo-Run displays information on ways to inspect and reproduce past experiments. This allows you to check logs, sync artifacts (in the future), cancel running tasks, and rerun an old experiment.
116+
4117
```python
5118
# The experiment was run with the following tasks: ['echo.sh', 'env_echo_', 'simple.add.add_object']
6119
# You can inspect and reconstruct this experiment at a later point in time using:
@@ -17,4 +130,5 @@ nemorun experiment logs experiment_with_scripts_1720556256 0
17130
nemorun experiment cancel experiment_with_scripts_1720556256 0
18131
```
19132
This information is specific to each experiment on how to manage it.
133+
20134
See [this notebook](examples/hello-world/hello_experiments.ipynb) for more details and a playable experience.

examples/docker/hello_docker.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import nemo_run as run
2+
3+
if __name__ == "__main__":
4+
inline_script = run.Script(
5+
inline="""
6+
echo "Hello 1"
7+
nvidia-smi
8+
sleep 5
9+
"""
10+
)
11+
inline_script_sleep = run.Script(
12+
inline="""
13+
echo "Hello sleep"
14+
sleep infinity
15+
"""
16+
)
17+
executor = run.DockerExecutor(
18+
container_image="python:3.12",
19+
num_gpus=-1,
20+
runtime="nvidia",
21+
ipc_mode="host",
22+
shm_size="30g",
23+
env_vars={"PYTHONUNBUFFERED": "1"},
24+
packager=run.Packager(),
25+
)
26+
with run.Experiment("docker-experiment", executor=executor, log_level="INFO") as exp:
27+
id1 = exp.add([inline_script, inline_script_sleep], tail_logs=False, name="task-1")
28+
id2 = exp.add([inline_script, inline_script_sleep], tail_logs=False, name="task-2")
29+
id3 = exp.add(
30+
[inline_script, inline_script_sleep],
31+
tail_logs=False,
32+
name="task-3",
33+
dependencies=[id1, id2],
34+
)
35+
36+
exp.run(detach=False, tail_logs=True, sequential=False)

0 commit comments

Comments
 (0)