Skip to content

Commit 419b286

Browse files
committed
adhere to text formatting guidelines, added: triton home directory to /dev/shm
1 parent 1fe9ab6 commit 419b286

File tree

2 files changed

+45
-79
lines changed

2 files changed

+45
-79
lines changed

docs/software/ml/index.md

Lines changed: 17 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -3,67 +3,51 @@
33

44
## Overview
55

6-
CSCS supports a wide range of machine learning (ML) applications and frameworks
7-
on its systems. Most ML workloads are containerized to ensure portability,
8-
reproducibility, and ease of use across environments.
6+
CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems.
7+
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments.
98

10-
Users can choose between running containers, using provided uenv software
11-
stacks, or building custom Python environments tailored to their needs.
9+
Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs.
1210

1311
## Running Machine Learning Applications with Containers
1412

15-
Containerization is the recommended approach for ML workloads on Alps, as it
16-
simplifies software management and maximizes compatibility with other systems.
13+
Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.
1714

1815
* CSCS does not provide ready-to-use ML container images
19-
* Users are encouraged to build their own containers, starting from popular
20-
sources such as the [Nvidia NGC
21-
Catalog](https://catalog.ngc.nvidia.com/containers)
16+
* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers)
2217

2318
Helpful references:
2419

2520
* Running containers on Alps: [Container Engine Guide][ref-container-engine]
26-
* Building custom container images: [Container Build
27-
Guide][ref-build-containers]
21+
* Building custom container images: [Container Build Guide][ref-build-containers]
2822

2923
## Using Provided uenv Software Stacks
3024

31-
Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv])
32-
that can serve as a starting point for machine learning projects. These
33-
environments provide optimized compilers, libraries, and selected ML
34-
frameworks.
25+
Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects.
26+
These environments provide optimized compilers, libraries, and selected ML frameworks.
3527

3628
Available ML-related uenvs:
3729

38-
* [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden]
39-
and [Daint][ref-cluster-daint]
30+
* [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint]
4031

41-
To extend these environments with additional Python packages, it is recommended
42-
to create a Python Virtual Environment (venv). See this [PyTorch venv
43-
example][ref-uenv-pytorch-venv] for details.
32+
To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
33+
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
4434

4535
!!! note
46-
While many Python packages provide pre-built binaries for common
47-
architectures, some may require building from source.
36+
While many Python packages provide pre-built binaries for common architectures, some may require building from source.
4837

4938
## Building Custom Python Environments
5039

51-
Users may also choose to build entirely custom software stacks using Python
52-
package managers such as `pip` or `conda`. Most ML libraries are available via
53-
the [Python Package Index (PyPI)](https://pypi.org/).
40+
Users may also choose to build entirely custom software stacks using Python package managers such as `pip` or `conda`.
41+
Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).
5442

55-
To ensure optimal performance on CSCS systems, we recommend starting from an
56-
environment that already includes:
43+
To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:
5744

5845
* CUDA, cuDNN
5946
* MPI, NCCL
6047
* C/C++ compilers
6148

6249
This can be achieved either by:
6350

64-
* Building a [custom container image][ref-build-containers] based on a suitable
65-
ML-ready base image.
66-
* Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or
67-
[PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual
68-
environment.
51+
* Building a [custom container image][ref-build-containers] based on a suitable ML-ready base image.
52+
* Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual environment.
6953

docs/software/ml/pytorch.md

Lines changed: 28 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,8 @@
11
[](){#ref-uenv-pytorch}
22
# PyTorch
33

4-
The PyTorch software stack was designed with the intention of being able to run
5-
[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training
6-
workloads out of the box. Thus, it comes with batteries included and does not
7-
just provide the bare [PyTorch framework](https://github.com/pytorch/pytorch).
4+
The PyTorch software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box.
5+
Thus, it comes with batteries included and does not just provide the bare [PyTorch framework](https://github.com/pytorch/pytorch).
86

97
!!! note "uenv"
108

@@ -249,8 +247,7 @@ There are two ways to access the software provided by the uenv, once it has been
249247

250248
=== "the default view"
251249

252-
The simplest way to get started is to use the `default` file system view,
253-
which automatically loads all of the packages when the uenv is started.
250+
The simplest way to get started is to use the `default` file system view, which automatically loads all of the packages when the uenv is started.
254251

255252
!!! example "test mpi compilers and python provided by pytorch/v2.6.0"
256253
```console
@@ -278,9 +275,7 @@ There are two ways to access the software provided by the uenv, once it has been
278275

279276
=== "Spack"
280277

281-
The pytorch uenv can also be used as a base for building software with
282-
Spack, because it provides compilers, MPI, Python and common packages like
283-
HDF5.
278+
The pytorch uenv can also be used as a base for building software with Spack, because it provides compilers, MPI, Python and common packages like HDF5.
284279

285280
[Check out the guide for using Spack with uenv][ref-building-uenv-spack].
286281

@@ -315,17 +310,11 @@ Uenvs are read-only, and cannot be modified. However, it is possible to add Pyth
315310
6. The uenv can be exited using the `exit` command or by typing `ctrl-d`.
316311

317312
!!! note
318-
Python virtual environments can be slow on the parallel Lustre file system
319-
due to the amount of small files and potentially many processes accessing
320-
it. If this becomes a bottleneck, consider [squashing the
321-
venv][ref-guides-storage-venv] into its own memory-mapped, read-only file
322-
system to enhance scalability and reduce load times.
323-
324-
Alternatively one can use the uenv as [upstream Spack
325-
instance][ref-building-uenv-spack] to to add both Python and non-Python
326-
packages. However, this workflow is more involved and intended for advanced
327-
Spack users.
313+
Python virtual environments can be slow on the parallel Lustre file system due to the amount of small files and potentially many processes accessing it.
314+
If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times.
328315

316+
Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages.
317+
However, this workflow is more involved and intended for advanced Spack users.
329318

330319
## Running PyTorch jobs with SLURM
331320

@@ -337,7 +326,8 @@ Spack users.
337326
#SBATCH --ntasks-per-node=4
338327
#SBATCH --cpus-per-task=72
339328
#SBATCH --time=00:30:00
340-
#SBATCH --uenv=pytorch/v2.6.0:/user-environment # (1)!
329+
# (1)!
330+
#SBATCH --uenv=pytorch/v2.6.0:/user-environment
341331
#SBATCH --view=default
342332

343333
#################################
@@ -352,21 +342,22 @@ Spack users.
352342
export MASTER_PORT=6000
353343
export WORLD_SIZE=$SLURM_NPROCS
354344
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (4)!
345+
export TRITON_HOME=/dev/shm/ # (5)!
355346

356347
#################################
357348
# MPICH environment variables #
358349
#################################
359-
export MPICH_GPU_SUPPORT_ENABLED=0 # (5)!
350+
export MPICH_GPU_SUPPORT_ENABLED=0 # (6)!
360351

361352
#################################
362353
# CUDA environment variables #
363354
#################################
364-
export CUDA_CACHE_DISABLE=1 # (6)!
355+
export CUDA_CACHE_DISABLE=1 # (7)!
365356

366357
############################################
367358
# NCCL and Fabric environment variables #
368359
############################################
369-
export NCCL_NET="AWS Libfabric" # (7)!
360+
export NCCL_NET="AWS Libfabric" # (8)!
370361
export NCCL_NET_GDR_LEVEL=PHB
371362
export NCCL_CROSS_NIC=1
372363
export FI_CXI_DISABLE_HOST_REGISTER=1
@@ -375,8 +366,8 @@ Spack users.
375366
export FI_CXI_DEFAULT_TX_SIZE=32768
376367
export FI_CXI_RX_MATCH_MODE=software
377368

378-
# (8)!
379369
# (9)!
370+
# (10)!
380371
srun bash -c "
381372
export RANK=\$SLURM_PROCID
382373
export LOCAL_RANK=\$SLURM_LOCALID
@@ -385,28 +376,19 @@ Spack users.
385376
"
386377
```
387378

388-
1. The `--uenv` option is used to specify the uenv to use for the job. The
389-
`--view=default` option is used to load all the packages provided by the
390-
uenv.
379+
1. The `--uenv` option is used to specify the uenv to use for the job.
380+
The `--view=default` option is used to load all the packages provided by the uenv.
391381
2. Only set `OMP_NUM_THREADS` if you are using OpenMP in your code.
392-
3. These variables are used by PyTorch to initialize the distributed
393-
backend. The `MASTER_ADDR` and `MASTER_PORT` variables are used to
394-
determine the address and port of the master node. Additionally we also
395-
need `RANK` and `LOCAL_RANK` but these must be set per-process, see
396-
below.
397-
4. Enable more graceful exception handling, see [PyTorch
398-
documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
399-
5. Disable GPU support in MPICH, as it [can lead to
400-
deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi)
401-
when using together with nccl.
402-
6. Avoid writing JITed binaries to the (distributed) file system, which
403-
could lead to performance issues.
404-
7. These variables should always be set for correctness and optimal
405-
performance when using NCCL, see [the detailed
406-
explanation][ref-communication-nccl].
382+
3. These variables are used by PyTorch to initialize the distributed backend.
383+
The `MASTER_ADDR` and `MASTER_PORT` variables are used to determine the address and port of the master node.
384+
Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below.
385+
4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
386+
5. Set the Trition home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
387+
This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it.
388+
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
389+
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
390+
7. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl].
407391
8. `RANK` and `LOCAL_RANK` are set per-process by the SLURM job launcher.
408-
9. Activate the virtual environment created on top of the uenv (if any).
409-
Please follow the guidelines for [python virtual environments with
410-
uenv][ref-guides-storage-venv] to enhance scalability and reduce load
411-
times.
392+
10. Activate the virtual environment created on top of the uenv (if any).
393+
Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times.
412394

0 commit comments

Comments
 (0)