Skip to content

Commit 03fdbaa

Browse files
committed
better use of codeblocks, codeowners entry
1 parent 3dd5f2a commit 03fdbaa

File tree

2 files changed

+117
-119
lines changed

2 files changed

+117
-119
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ docs/software/communication @msimberg
44
docs/software/devtools/linaro @jgphpc
55
docs/software/prgenv/linalg.md @finkandreas @msimberg
66
docs/software/sciapps/cp2k.md @abussy @RMeli
7+
docs/software/ml @boeschf

docs/software/ml/pytorch.md

Lines changed: 116 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -249,29 +249,28 @@ There are two ways to access the software provided by the uenv, once it has been
249249

250250
The simplest way to get started is to use the `default` file system view, which automatically loads all of the packages when the uenv is started.
251251

252-
!!! example "test mpi compilers and python provided by pytorch/v2.6.0"
253-
```console
254-
$ uenv start pytorch/v2.6.0:v1 --view=default # (1)!
255-
256-
$ which python # (2)!
257-
/user-environment/env/default/bin/python
258-
$ python --version
259-
Python 3.13.0
260-
261-
$ which mpicc # (3)!
262-
/user-environment/env/default/bin/mpicc
263-
$ mpicc --version
264-
gcc (Spack GCC) 13.3.0
265-
$ gcc --version # the compiler wrapper uses the gcc provided by the uenv
266-
gcc (Spack GCC) 13.3.0
267-
268-
$ exit # (4)!
269-
```
270-
271-
1. Start using the default view.
272-
2. The python executable provided by the uenv is the default, and is a recent version.
273-
3. The MPI compiler wrappers are also available.
274-
4. Exit the uenv.
252+
```console title="Test mpi compilers and python provided by pytorch/v2.6.0"
253+
$ uenv start pytorch/v2.6.0:v1 --view=default # (1)!
254+
255+
$ which python # (2)!
256+
/user-environment/env/default/bin/python
257+
$ python --version
258+
Python 3.13.0
259+
260+
$ which mpicc # (3)!
261+
/user-environment/env/default/bin/mpicc
262+
$ mpicc --version
263+
gcc (Spack GCC) 13.3.0
264+
$ gcc --version # the compiler wrapper uses the gcc provided by the uenv
265+
gcc (Spack GCC) 13.3.0
266+
267+
$ exit # (4)!
268+
```
269+
270+
1. Start using the default view.
271+
2. The python executable provided by the uenv is the default, and is a recent version.
272+
3. The MPI compiler wrappers are also available.
273+
4. Exit the uenv.
275274

276275
=== "Spack"
277276

@@ -284,32 +283,31 @@ There are two ways to access the software provided by the uenv, once it has been
284283

285284
Uenvs are read-only, and cannot be modified. However, it is possible to add Python packages on top of the uenv using virtual environments.
286285

287-
!!! example "creating a virtual environment on top of the uenv"
288-
```console
289-
$ uenv start pytorch/v2.6.0:v1 --view=default # (1)!
286+
```console title="Creating a virtual environment on top of the uenv"
287+
$ uenv start pytorch/v2.6.0:v1 --view=default # (1)!
290288

291-
$ python -m venv ./my-venv # (2)!
289+
$ python -m venv ./my-venv # (2)!
292290

293-
$ source ./my-venv/bin/activate # (3)!
291+
$ source ./my-venv/bin/activate # (3)!
294292

295-
(my-venv) $ pip install <package> # (4)!
293+
(my-venv) $ pip install <package> # (4)!
296294

297-
(my-venv) $ deactivate # (5)!
295+
(my-venv) $ deactivate # (5)!
298296

299-
$ exit # (6)!
300-
```
297+
$ exit # (6)!
298+
```
301299

302-
1. The `default` view is recommended, as it loads all the packages provided by the uenv.
303-
This is important for PyTorch to work correctly, as it relies on the CUDA and NCCL libraries provided by the uenv.
304-
2. The virtual environment is created in the current working directory, and can be activated and deactivated like any other Python virtual environment.
305-
3. Activating the virtual environment will override the Python executable provided by the uenv, and use the one from the virtual environment instead.
306-
This is important to ensure that the packages installed in the virtual environment are used.
307-
4. The virtual environment can be used to install any Python package.
308-
5. The virtual environment can be deactivated using the `deactivate` command.
309-
This will restore the original Python executable provided by the uenv.
310-
6. The uenv can be exited using the `exit` command or by typing `ctrl-d`.
311-
312-
!!! note
300+
1. The `default` view is recommended, as it loads all the packages provided by the uenv.
301+
This is important for PyTorch to work correctly, as it relies on the CUDA and NCCL libraries provided by the uenv.
302+
2. The virtual environment is created in the current working directory, and can be activated and deactivated like any other Python virtual environment.
303+
3. Activating the virtual environment will override the Python executable provided by the uenv, and use the one from the virtual environment instead.
304+
This is important to ensure that the packages installed in the virtual environment are used.
305+
4. The virtual environment can be used to install any Python package.
306+
5. The virtual environment can be deactivated using the `deactivate` command.
307+
This will restore the original Python executable provided by the uenv.
308+
6. The uenv can be exited using the `exit` command or by typing `ctrl-d`.
309+
310+
!!! note "Squashing the virtual environment"
313311
Python virtual environments can be slow on the parallel Lustre file system due to the amount of small files and potentially many processes accessing it.
314312
If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times.
315313

@@ -318,80 +316,79 @@ However, this workflow is more involved and intended for advanced Spack users.
318316

319317
## Running PyTorch jobs with SLURM
320318

321-
!!! example "slurm sbatch script"
322-
```bash
323-
#!/bin/bash
324-
#SBATCH --job-name=myjob
325-
#SBATCH --nodes=1
326-
#SBATCH --ntasks-per-node=4
327-
#SBATCH --cpus-per-task=72
328-
#SBATCH --time=00:30:00
329-
# (1)!
330-
#SBATCH --uenv=pytorch/v2.6.0:/user-environment
331-
#SBATCH --view=default
332-
333-
#################################
334-
# OpenMP environment variables #
335-
#################################
336-
export OMP_NUM_THREADS=8 # (2)!
337-
338-
#################################
339-
# PyTorch environment variables #
340-
#################################
341-
export MASTER_ADDR=$(hostname) # (3)!
342-
export MASTER_PORT=6000
343-
export WORLD_SIZE=$SLURM_NPROCS
344-
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (4)!
345-
export TRITON_HOME=/dev/shm/ # (5)!
346-
347-
#################################
348-
# MPICH environment variables #
349-
#################################
350-
export MPICH_GPU_SUPPORT_ENABLED=0 # (6)!
351-
352-
#################################
353-
# CUDA environment variables #
354-
#################################
355-
export CUDA_CACHE_DISABLE=1 # (7)!
356-
357-
############################################
358-
# NCCL and Fabric environment variables #
359-
############################################
360-
export NCCL_NET="AWS Libfabric" # (8)!
361-
export NCCL_NET_GDR_LEVEL=PHB
362-
export NCCL_CROSS_NIC=1
363-
export FI_CXI_DISABLE_HOST_REGISTER=1
364-
export FI_MR_CACHE_MONITOR=userfaultfd
365-
export FI_CXI_DEFAULT_CQ_SIZE=131072
366-
export FI_CXI_DEFAULT_TX_SIZE=32768
367-
export FI_CXI_RX_MATCH_MODE=software
368-
369-
# (9)!
370-
# (10)!
371-
srun bash -c "
372-
export RANK=\$SLURM_PROCID
373-
export LOCAL_RANK=\$SLURM_LOCALID
374-
. ./my-venv/bin/activate
375-
python myscript.py
376-
"
377-
```
319+
```bash title="slurm sbatch script"
320+
#!/bin/bash
321+
#SBATCH --job-name=myjob
322+
#SBATCH --nodes=1
323+
#SBATCH --ntasks-per-node=4
324+
#SBATCH --cpus-per-task=72
325+
#SBATCH --time=00:30:00
326+
# (1)!
327+
#SBATCH --uenv=pytorch/v2.6.0:/user-environment
328+
#SBATCH --view=default
329+
330+
#################################
331+
# OpenMP environment variables #
332+
#################################
333+
export OMP_NUM_THREADS=8 # (2)!
334+
335+
#################################
336+
# PyTorch environment variables #
337+
#################################
338+
export MASTER_ADDR=$(hostname) # (3)!
339+
export MASTER_PORT=6000
340+
export WORLD_SIZE=$SLURM_NPROCS
341+
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (4)!
342+
export TRITON_HOME=/dev/shm/ # (5)!
343+
344+
#################################
345+
# MPICH environment variables #
346+
#################################
347+
export MPICH_GPU_SUPPORT_ENABLED=0 # (6)!
348+
349+
#################################
350+
# CUDA environment variables #
351+
#################################
352+
export CUDA_CACHE_DISABLE=1 # (7)!
353+
354+
############################################
355+
# NCCL and Fabric environment variables #
356+
############################################
357+
export NCCL_NET="AWS Libfabric" # (8)!
358+
export NCCL_NET_GDR_LEVEL=PHB
359+
export NCCL_CROSS_NIC=1
360+
export FI_CXI_DISABLE_HOST_REGISTER=1
361+
export FI_MR_CACHE_MONITOR=userfaultfd
362+
export FI_CXI_DEFAULT_CQ_SIZE=131072
363+
export FI_CXI_DEFAULT_TX_SIZE=32768
364+
export FI_CXI_RX_MATCH_MODE=software
365+
366+
# (9)!
367+
# (10)!
368+
srun bash -c "
369+
export RANK=\$SLURM_PROCID
370+
export LOCAL_RANK=\$SLURM_LOCALID
371+
. ./my-venv/bin/activate
372+
python myscript.py
373+
"
374+
```
375+
376+
1. The `--uenv` option is used to specify the uenv to use for the job.
377+
The `--view=default` option is used to load all the packages provided by the uenv.
378+
2. Set `OMP_NUM_THREADS` if you are using OpenMP in your code.
379+
The number of threads should be not greater than the number of cores per task (`$SLURM_CPUS_PER_TASK`).
380+
The optimal number depends on the workload and should be determined by testing.
381+
Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes.
382+
3. These variables are used by PyTorch to initialize the distributed backend.
383+
The `MASTER_ADDR` and `MASTER_PORT` variables are used to determine the address and port of the master node.
384+
Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below.
385+
4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
386+
5. Set the Trition home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
387+
This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it.
388+
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
389+
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
390+
7. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl].
391+
8. `RANK` and `LOCAL_RANK` are set per-process by the SLURM job launcher.
392+
10. Activate the virtual environment created on top of the uenv (if any).
393+
Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times.
378394

379-
1. The `--uenv` option is used to specify the uenv to use for the job.
380-
The `--view=default` option is used to load all the packages provided by the uenv.
381-
2. Set `OMP_NUM_THREADS` if you are using OpenMP in your code.
382-
The number of threads should be not greater than the number of cores per task (`$SLURM_CPUS_PER_TASK`).
383-
The optimal number depends on the workload and should be determined by testing.
384-
Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes.
385-
3. These variables are used by PyTorch to initialize the distributed backend.
386-
The `MASTER_ADDR` and `MASTER_PORT` variables are used to determine the address and port of the master node.
387-
Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below.
388-
4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
389-
5. Set the Trition home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
390-
This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it.
391-
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
392-
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
393-
7. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl].
394-
8. `RANK` and `LOCAL_RANK` are set per-process by the SLURM job launcher.
395-
10. Activate the virtual environment created on top of the uenv (if any).
396-
Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times.
397-

0 commit comments

Comments
 (0)