Skip to content

Commit 6d9ac5a

Browse files
committed
Fixing CE example
1 parent 7826edb commit 6d9ac5a

File tree

1 file changed

+19
-14
lines changed

1 file changed

+19
-14
lines changed

docs/access/jupyterlab.md

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -62,29 +62,34 @@ If the default base images do not meet your requirements, you can specify a cust
6262
mounts = [
6363
"/capstor",
6464
"/iopsstor",
65-
"/users/${USER}/.local/share/jupyter" # (1)!
65+
"/users/${USER}/.local/share/jupyter", # (1)!
66+
"/etc/slurm", # (2)!
67+
"/usr/lib64/libslurm-uenv-mount.so",
68+
"/etc/container_engine_pyxis.conf" # (3)!
6669
]
6770

68-
workdir = "/capstor/scratch/cscs/${USER}" # (2)!
71+
workdir = "/capstor/scratch/cscs/${USER}" # (4)!
6972

7073
writable = true
7174

7275
[annotations]
73-
com.hooks.aws_ofi_nccl.enabled = "true" # (3)!
76+
com.hooks.aws_ofi_nccl.enabled = "true" # (5)!
7477
com.hooks.aws_ofi_nccl.variant = "cuda12"
7578

7679
[env]
77-
CUDA_CACHE_DISABLE = "1" # (4)!
78-
TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (5)!
79-
MPICH_GPU_SUPPORT_ENABLED = "0" # (6)!
80+
CUDA_CACHE_DISABLE = "1" # (6)!
81+
TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)!
82+
MPICH_GPU_SUPPORT_ENABLED = "0" # (8)!
8083
```
8184

8285
1. avoid mounting all of `$HOME` to avoid subtle issues with cached files, but mount Jupyter kernels
83-
2. set working directory of Jupyter session (file browser root directory)
84-
3. use environment settings for optimized communication
85-
4. disable CUDA JIT cache
86-
5. async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
87-
6. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
86+
2. enable SLURM commands (together with two subsequent mounts)
87+
3. currently only required on Daint and Santis, not on Clariden
88+
4. set working directory of Jupyter session (file browser root directory)
89+
5. use environment settings for optimized communication
90+
6. disable CUDA JIT cache
91+
7. async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
92+
8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
8893

8994
??? tip "Accessing file systems with uenv"
9095
While Jupyter sessions with CE start in the directory specified with `workdir`, a uenv session always start in your `$HOME` folder. All non-hidden files and folders in `$HOME` are visible and accessible through the JupyterLab file browser. However, you can not browse directly to folders above `$HOME`. To enable access your `$SCRATCH` folder, it is therefore necessary to create a symbolic link to your `$SCRATCH` folder. This can be done by issuing the following command in a terminal from your `$HOME` directory:
@@ -126,7 +131,7 @@ python -m ipykernel install \
126131

127132
The `<kernel-name>` can be replaced by a name specific to the base image/virtual environment.
128133

129-
!!! bug "Python packages from uenv shadowing those in a virtual environment"
134+
??? bug "Python packages from uenv shadowing those in a virtual environment"
130135
When using uenv with a virtual environment on top, the site-packages under `/user-environment` currently take precedence over those in the activated virtual environment. This is due to the path being included in the `PYTHONPATH` environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to `PYTHONPATH` whenever activating the virtual environment.
131136
```bash
132137
export PYTHONPATH="$(python -c 'import site; print(site.getsitepackages()[0])'):$PYTHONPATH"
@@ -203,10 +208,10 @@ A popular approach to run multi-GPU ML workloads is with `accelerate` and `torch
203208
!torchrun --standalone --nproc_per_node=4 run_train.py ...
204209
```
205210

206-
!!! warning
211+
!!! warning "torchrun with virtual environments"
207212
When using a virtual environment on top of a base image with Pytorch, replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment.
208213

209-
!!! note
214+
!!! note "Notebook structure"
210215
In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively.
211216

212217
Alternatively to using these launchers, it is also possible to use SLURM to obtain more control over resource mappings, e.g. by launching an overlapping SLURM step onto the same node used by the Jupyter process. An example with the container engine looks like this

0 commit comments

Comments
 (0)