You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/access/jupyterlab.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -83,7 +83,7 @@ If the default base images do not meet your requirements, you can specify a cust
83
83
```
84
84
85
85
1. Avoid mounting all of `$HOME` to avoid subtle issues with cached files, but mount Jupyter kernels
86
-
2. Enable SLURM commands (together with two subsequent mounts)
86
+
2. Enable Slurm commands (together with two subsequent mounts)
87
87
3. Currently only required on Daint and Santis, not on Clariden
88
88
4. Set working directory of Jupyter session (file browser root directory)
89
89
5. Use environment settings for optimized communication
@@ -215,7 +215,7 @@ A popular approach to run multi-GPU ML workloads is with `accelerate` and `torch
215
215
!!! note "Notebook structure"
216
216
In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively.
217
217
218
-
Alternatively to using these launchers, it is also possible to use SLURM to obtain more control over resource mappings, e.g. by launching an overlapping SLURM step onto the same node used by the Jupyter process. An example with the container engine looks like this:
218
+
Alternatively to using these launchers, it is also possible to use Slurm to obtain more control over resource mappings, e.g. by launching an overlapping Slurm step onto the same node used by the Jupyter process. An example with the container engine looks like this:
@@ -226,7 +226,7 @@ Alternatively to using these launchers, it is also possible to use SLURM to obta
226
226
python train.py ..."
227
227
```
228
228
229
-
where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is a script using `torch.distributed` for distributed training. This can be further customized with extra SLURM options.
229
+
where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is a script using `torch.distributed` for distributed training. This can be further customized with extra Slurm options.
230
230
231
231
!!! warning "Concurrent usage of resources"
232
232
Subtle bugs can occur when running multiple Jupyter notebooks concurrently that each assume access to the full node. Also, some notebooks may hold on to resources such as spawned child processes or allocated memory despite having completed. In this case, resources such as a GPU may still be busy, blocking another notebook from using it. Therefore, it is good practice to only keep one such notebook running that occupies the full node and restarting a kernel once a notebook has completed. If in doubt, system monitoring with `htop` and [nvdashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) can be helpful for debugging.
0 commit comments