Skip to content

Commit eb6d09c

Browse files
committed
Note about multi-GPU training from a single process
1 parent af44be2 commit eb6d09c

File tree

1 file changed

+8
-1
lines changed

1 file changed

+8
-1
lines changed

docs/access/jupyterlab.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/
203203

204204
While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment.
205205

206-
A popular approach to run multi-GPU ML workloads is with `accelerate` and `torchrun` as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-finetuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
206+
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-finetuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
207207

208208
```bash
209209
!python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ...
@@ -231,6 +231,13 @@ where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is
231231
!!! warning "Concurrent usage of resources"
232232
Subtle bugs can occur when running multiple Jupyter notebooks concurrently that each assume access to the full node. Also, some notebooks may hold on to resources such as spawned child processes or allocated memory despite having completed. In this case, resources such as a GPU may still be busy, blocking another notebook from using it. Therefore, it is good practice to only keep one such notebook running that occupies the full node and restarting a kernel once a notebook has completed. If in doubt, system monitoring with `htop` and [nvdashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) can be helpful for debugging.
233233

234+
!!! warning "Multi-GPU training from a shared Jupyter process"
235+
Running multi-GPU training workloads directly from the shared Jupyter process is generally not recommended due to potential inefficiencies and correctness issues (cf. the [Pytorch docs](https://docs.pytorch.org/docs/stable/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel)). However, if you need it to e.g. reproduce existing results, it is possible to do so with utilities like `accelerate`'s `notebook_launcher` or [`transformers`](https://github.com/huggingface/transformers)' `Trainer` class. When using these in containers, you will currently need to unset the environment variables `RANK` and `LOCAL_RANK`, that is have the following in a cell at the top of the notebook:
236+
237+
```python
238+
import os; os.environ.pop("RANK"); os.environ.pop("LOCAL_RANK");
239+
```
240+
234241

235242
## Further documentation
236243

0 commit comments

Comments
 (0)