You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/access/jupyterlab.md
+19-14Lines changed: 19 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,29 +62,34 @@ If the default base images do not meet your requirements, you can specify a cust
62
62
mounts = [
63
63
"/capstor",
64
64
"/iopsstor",
65
-
"/users/${USER}/.local/share/jupyter" # (1)!
65
+
"/users/${USER}/.local/share/jupyter", # (1)!
66
+
"/etc/slurm", # (2)!
67
+
"/usr/lib64/libslurm-uenv-mount.so",
68
+
"/etc/container_engine_pyxis.conf" # (3)!
66
69
]
67
70
68
-
workdir = "/capstor/scratch/cscs/${USER}" # (2)!
71
+
workdir = "/capstor/scratch/cscs/${USER}" # (4)!
69
72
70
73
writable = true
71
74
72
75
[annotations]
73
-
com.hooks.aws_ofi_nccl.enabled = "true" # (3)!
76
+
com.hooks.aws_ofi_nccl.enabled = "true" # (5)!
74
77
com.hooks.aws_ofi_nccl.variant = "cuda12"
75
78
76
79
[env]
77
-
CUDA_CACHE_DISABLE = "1" # (4)!
78
-
TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (5)!
79
-
MPICH_GPU_SUPPORT_ENABLED = "0" # (6)!
80
+
CUDA_CACHE_DISABLE = "1" # (6)!
81
+
TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)!
82
+
MPICH_GPU_SUPPORT_ENABLED = "0" # (8)!
80
83
```
81
84
82
85
1. avoid mounting all of `$HOME` to avoid subtle issues with cached files, but mount Jupyter kernels
83
-
2. set working directory of Jupyter session (file browser root directory)
84
-
3. use environment settings for optimized communication
85
-
4. disable CUDA JIT cache
86
-
5. async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
87
-
6. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
86
+
2. enable SLURM commands (together with two subsequent mounts)
87
+
3. currently only required on Daint and Santis, not on Clariden
88
+
4. set working directory of Jupyter session (file browser root directory)
89
+
5. use environment settings for optimized communication
90
+
6. disable CUDA JIT cache
91
+
7. async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
92
+
8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
88
93
89
94
??? tip "Accessing file systems with uenv"
90
95
While Jupyter sessions with CE start in the directory specified with `workdir`, a uenv session always start in your `$HOME` folder. All non-hidden files and folders in `$HOME` are visible and accessible through the JupyterLab file browser. However, you can not browse directly to folders above `$HOME`. To enable access your `$SCRATCH` folder, it is therefore necessary to create a symbolic link to your `$SCRATCH` folder. This can be done by issuing the following command in a terminal from your `$HOME` directory:
@@ -126,7 +131,7 @@ python -m ipykernel install \
126
131
127
132
The `<kernel-name>` can be replaced by a name specific to the base image/virtual environment.
128
133
129
-
!!! bug "Python packages from uenv shadowing those in a virtual environment"
134
+
??? bug "Python packages from uenv shadowing those in a virtual environment"
130
135
When using uenv with a virtual environment on top, the site-packages under `/user-environment` currently take precedence over those in the activated virtual environment. This is due to the path being included in the `PYTHONPATH` environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to `PYTHONPATH` whenever activating the virtual environment.
When using a virtual environment on top of a base image with Pytorch, replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment.
208
213
209
-
!!! note
214
+
!!! note "Notebook structure"
210
215
In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively.
211
216
212
217
Alternatively to using these launchers, it is also possible to use SLURM to obtain more control over resource mappings, e.g. by launching an overlapping SLURM step onto the same node used by the Jupyter process. An example with the container engine looks like this
0 commit comments