Skip to content

Commit e59e2f6

Browse files
committed
code snippets: comments as annotations, venv: reference to mksquashfs
1 parent d1625f7 commit e59e2f6

File tree

2 files changed

+55
-38
lines changed

2 files changed

+55
-38
lines changed

docs/guides/storage.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ At first it can seem strange that a "high-performance" file system is significan
1414

1515
Meta data lookups on Lustre are expensive compared to your laptop, where the local file system is able to aggressively cache meta data.
1616

17+
[](){#ref-guides-storage-venv}
1718
### Python virtual environments with uenv
1819

1920
Python virtual environments can be very slow on Lustre, for example a simple `import numpy` command run on Lustre might take seconds, compared to milliseconds on your laptop.

docs/software/ml/pytorch.md

Lines changed: 54 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -255,27 +255,28 @@ There are two ways to access the software provided by the uenv, once it has been
255255

256256
!!! example "test mpi compilers and python provided by pytorch/v2.6.0"
257257
```console
258-
# start using the default view
259-
$ uenv start pytorch/v2.6.0:v1 --view=default
258+
$ uenv start pytorch/v2.6.0:v1 --view=default # (1)!
260259

261-
# the python executable provided by the uenv is the default, and is a recent version
262-
$ which python
260+
$ which python # (2)!
263261
/user-environment/env/default/bin/python
264262
$ python --version
265263
Python 3.13.0
266264

267-
# the mpi compiler wrappers are also available
268-
$ which mpicc
265+
$ which mpicc # (3)!
269266
/user-environment/env/default/bin/mpicc
270267
$ mpicc --version
271268
gcc (Spack GCC) 13.3.0
272269
$ gcc --version # the compiler wrapper uses the gcc provided by the uenv
273270
gcc (Spack GCC) 13.3.0
274271

275-
# exit the uenv
276-
exit
272+
$ exit # (4)!
277273
```
278274

275+
1. start using the default view
276+
2. the python executable provided by the uenv is the default, and is a recent version
277+
3. the mpi compiler wrappers are also available
278+
4. exit the uenv
279+
279280
=== "Spack"
280281

281282
The pytorch uenv can also be used as a base for building software with
@@ -291,25 +292,36 @@ Uenvs are read-only, and cannot be modified. However, it is possible to add Pyth
291292

292293
!!! example "creating a virtual environment on top of the uenv"
293294
```console
294-
# start the uenv
295-
$ uenv start pytorch/v2.6.0:v1 --view=default
295+
$ uenv start pytorch/v2.6.0:v1 --view=default # (1)!
296296

297-
# create a virtual environment
298-
$ python -m venv ./my-venv
297+
$ python -m venv ./my-venv # (2)!
299298

300-
# activate the virtual environment
301-
$ source ./my-venv/bin/activate
299+
$ source ./my-venv/bin/activate # (3)!
302300

303-
# install packages using pip
304-
(my-venv) $ pip install <package>
301+
(my-venv) $ pip install <package> # (4)!
305302

306-
# deactivate the virtual environment
307-
(my-venv) $ deactivate
303+
(my-venv) $ deactivate # (5)!
308304

309-
# exit the uenv
310-
exit
305+
$ exit # (6)!
311306
```
312307

308+
1. The `default` view is recommended, as it loads all the packages provided by the uenv.
309+
This is important for PyTorch to work correctly, as it relies on the CUDA and NCCL libraries provided by the uenv.
310+
2. The virtual environment is created in the current working directory, and can be activated and deactivated like any other Python virtual environment.
311+
3. Activating the virtual environment will override the Python executable provided by the uenv, and use the one from the virtual environment instead.
312+
This is important to ensure that the packages installed in the virtual environment are used.
313+
4. The virtual environment can be used to install any Python package.
314+
5. The virtual environment can be deactivated using the `deactivate` command.
315+
This will restore the original Python executable provided by the uenv.
316+
6. The uenv can be exited using the `exit` command or by typing `ctrl-d`.
317+
318+
!!! note
319+
Python virtual environments can be slow on the parallel Lustre file system
320+
due to the amount of small files and potentially many processes accessing
321+
it. If this becomes a bottleneck, consider [squashing the
322+
venv][ref-guides-storage-venv] into its own memory-mapped, read-only file
323+
system.
324+
313325
Alternatively one can use the uenv as [upstream Spack
314326
instance][ref-building-uenv-spack] to to add both Python and non-Python
315327
packages. However, this workflow is more involved and intended for advanced
@@ -326,36 +338,36 @@ Spack users.
326338
#SBATCH --ntasks-per-node=4
327339
#SBATCH --cpus-per-task=72
328340
#SBATCH --time=00:30:00
329-
#SBATCH --uenv=pytorch/v2.6.0:/user-environment
341+
#SBATCH --uenv=pytorch/v2.6.0:/user-environment # (1)!
330342
#SBATCH --view=default
331343

332344
#################################
333345
# OpenMP environment variables #
334346
#################################
335-
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # (1)!
347+
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # (2)!
336348

337349
#################################
338350
# PyTorch environment variables #
339351
#################################
340-
export MASTER_ADDR=$(hostname) # (2)!
352+
export MASTER_ADDR=$(hostname) # (3)!
341353
export MASTER_PORT=6000
342354
export WORLD_SIZE=$SLURM_NPROCS
343-
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (3)!
355+
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (4)!
344356

345357
#################################
346358
# MPICH environment variables #
347359
#################################
348-
export MPICH_GPU_SUPPORT_ENABLED=0 # (4)!
360+
export MPICH_GPU_SUPPORT_ENABLED=0 # (5)!
349361

350362
#################################
351363
# CUDA environment variables #
352364
#################################
353-
export CUDA_CACHE_DISABLE=1 # (5)!
365+
export CUDA_CACHE_DISABLE=1 # (6)!
354366

355367
############################################
356368
# NCCL and Fabric environment variables #
357369
############################################
358-
export NCCL_NET="AWS Libfabric" # (6)!
370+
export NCCL_NET="AWS Libfabric" # (7)!
359371
export NCCL_NET_GDR_LEVEL=PHB
360372
export NCCL_CROSS_NIC=1
361373
export FI_CXI_DISABLE_HOST_REGISTER=1
@@ -364,8 +376,8 @@ Spack users.
364376
export FI_CXI_DEFAULT_TX_SIZE=32768
365377
export FI_CXI_RX_MATCH_MODE=software
366378

367-
# (7)!
368379
# (8)!
380+
# (9)!
369381
srun bash -c "
370382
export RANK=\$SLURM_PROCID
371383
export LOCAL_RANK=\$SLURM_LOCALID
@@ -374,20 +386,24 @@ Spack users.
374386
"
375387
```
376388

377-
1. Only set `OMP_NUM_THREADS` if you are using OpenMP in your code.
378-
2. These variables are used by PyTorch to initialize the distributed
389+
1. The `--uenv` option is used to specify the uenv to use for the job. The
390+
`--view=default` option is used to load all the packages provided by the
391+
uenv.
392+
2. Only set `OMP_NUM_THREADS` if you are using OpenMP in your code.
393+
3. These variables are used by PyTorch to initialize the distributed
379394
backend. The `MASTER_ADDR` and `MASTER_PORT` variables are used to
380-
determine the address and port of the master node. Additionally we also need
381-
`RANK` and `LOCAL_RANK` but these must be set per-process, see below.
382-
3. Enable more graceful exception handling, see [PyTorch
395+
determine the address and port of the master node. Additionally we also
396+
need `RANK` and `LOCAL_RANK` but these must be set per-process, see
397+
below.
398+
4. Enable more graceful exception handling, see [PyTorch
383399
documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
384-
4. Disable GPU support in MPICH, as it [can lead to
400+
5. Disable GPU support in MPICH, as it [can lead to
385401
deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi)
386402
when using together with nccl.
387-
5. Avoid writing JITed binaries to the (distributed) file system, which
403+
6. Avoid writing JITed binaries to the (distributed) file system, which
388404
could lead to performance issues.
389-
6. These variables should always be set for correctness and optimal
405+
7. These variables should always be set for correctness and optimal
390406
performance when using NCCL, see [the detailed
391407
explanation][ref-communication-nccl].
392-
7. `RANK` and `LOCAL_RANK` are set per-process by the SLURM job launcher.
393-
8. Activate the virtual environment created on top of the uenv (if any).
408+
8. `RANK` and `LOCAL_RANK` are set per-process by the SLURM job launcher.
409+
9. Activate the virtual environment created on top of the uenv (if any).

0 commit comments

Comments
 (0)