Skip to content

Commit d438dcd

Browse files
committed
Integrating Fabian's feedback, updating code owners
1 parent 6d22e8d commit d438dcd

File tree

5 files changed

+77
-33
lines changed

5 files changed

+77
-33
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,6 @@ docs/software/prgenv/linalg.md @finkandreas @msimberg
99
docs/software/sciapps/cp2k.md @abussy @RMeli
1010
docs/software/sciapps/lammps.md @nickjbrowning
1111
docs/software/sciapps/gromacs.md @kanduri
12-
docs/software/ml @boeschf
12+
docs/software/ml @boeschf @henrique @lukasgd
1313
docs/storage @mpasserini
1414
docs/alps/storage.md @mpasserini

docs/access/jupyterlab.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ If the default base images do not meet your requirements, you can specify a cust
8686
3. Currently only required on Daint and Santis, not on Clariden
8787
4. Set working directory of Jupyter session (file browser root directory)
8888
5. Use environment settings for optimized communication
89-
6. Disable CUDA JIT cache
89+
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
9090
7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
9191
8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
9292

docs/software/ml/pytorch.md

Lines changed: 67 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ For most applications, the [PyTorch NGC container](https://catalog.ngc.nvidia.co
3333

3434
### Define Container Runtime Environment
3535

36-
Having built and imported a container image with podman and enroot, the next step is to configure the runtime environment with an environment definition file (EDF). In particular, this includes specifying the image, any directories mounted and a working directory to for the processes in the container to start in as in the [quickstart examples for CE][ref-container-engine].
36+
Having built and imported a container image with podman and enroot, the next step is to configure the runtime environment with an environment definition file (EDF). In particular, this includes specifying the image, any directories mounted from the host and a working directory for the process in the container to start in as in the [quickstart examples for CE][ref-container-engine].
3737

3838
Apart from this, there are specific features relevant for machine learning made available through [annotations][ref-ce-annotations], which customize the container at runtime.
3939

@@ -68,8 +68,8 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)!
6868
2. The path `/users` is not mounted as a whole since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed.
6969
3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started
7070
4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook]. While not strictly needed for single node workloads, it is good practice to keep it always on.
71-
5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`.
72-
6. Disable CUDA JIT cache
71+
5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario. Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys).
72+
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
7373
7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
7474
8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
7575

@@ -93,14 +93,15 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)!
9393

9494
1. Enable Slurm commands (together with two subsequent mounts)
9595

96-
!!! note "Best practice for large-scale jobs"
96+
!!! note "Best practice for production jobs"
9797

98-
For stability and reproducibility, use self-contained containers for large scale jobs. Using code mounted from the distributed filesystem may leave compiled artefacts behind that can result in unintentional runtime errors when e.g. swapping the container image. In particular, it is recommended to avoid mounting all of `$HOME`, so that environments are properly isolated and e.g. the Triton cache (that by default ends up in `$HOME/.triton`) resides in an ephemeral location of the filesystem.
98+
For stability and reproducibility, use self-contained containers for production jobs. Using code mounted from the distributed filesystem may leave compiled artefacts behind that can result in unintentional runtime errors when e.g. swapping the container image. In particular, it is recommended to avoid mounting all of `$HOME`, so that environments are properly isolated and e.g. the Triton cache (that by default ends up in `$HOME/.triton`) resides in an ephemeral location of the filesystem.
9999

100100
!!! note "Collaborating in Git"
101101

102102
For reproducibility, it is recommended to always track the Dockerfile, EDF and an optional virtual environment specification alongside your application code in a Git repository.
103103

104+
[](){#ref-ce-pytorch-venv}
104105
### (Optionally) extend container with virtual environment
105106

106107
While production jobs should include as many dependencies as possible in the container image, during development it can be convenient to manage frequently changing packages in a virtual environment built on top of the container image. This can include both dependencies and actively developed packages (that can be installed in editable mode with `pip install -e .`).
@@ -186,28 +187,57 @@ For further details on execution logic, job monitoring and data management, plea
186187

187188
* The argument `--ddp-bucket-size` controls the level of grouping of many small data-parallel communications into bigger ones and setting it to a high value can improve throughput (model-dependent, e.g. `10000000000`).
188189

189-
* If in doubt about communication performance with NCCL at scale, use [nccl-tests](https://github.com/NVIDIA/nccl-tests) with the relevant communication patterns to check if scaling behavior can be reproduced.
190+
* If in doubt about communication performance with NCCL at scale, use the [`NCCL_DEBUG`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) environment variable to validate that the aws-ofi-nccl plugin has been properly initialized and libfabric was recognized (further subsystems can be monitored with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys)). If the issue persists, use [nccl-tests](https://github.com/NVIDIA/nccl-tests) with the relevant communication patterns to check if the scaling behavior can be reproduced and contact CSCS support.
190191

191192
Additionally, consider the **best practice for checkpointing and data management**:
192193

193-
* Following the advice on [filesystems][ref-storage-fs], write checkpoints (sequential write) to `/capstor/scratch` and place randomly accessed training data (many small random reads) on `/iopsstor/scratch`. Use the [data transfer instructions][ref-data-xfer] to move data to/from `/capstor/store`. Make sure to apply recommended [Lustre settings][ref-guides-storage-lustre] on all directories containing significant amount of data, including those containing container images and managed by other tools (e.g. the HuggingFace cache, see [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) in the [this tutorial][software-ml-llm-inference-tutorial]).
194+
* Following the advice on [filesystems][ref-storage-fs], write checkpoints (sequential write) to `/capstor/scratch` and place randomly accessed training data (many small random reads) on `/iopsstor/scratch`. Use the [data transfer instructions][ref-data-xfer] to move data to/from `/capstor/store`. Make sure to apply recommended [Lustre settings][ref-guides-storage-lustre] on all directories containing significant amount of data, including those containing container images and managed by other tools (e.g. the HuggingFace cache, see [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) in the [this tutorial][software-ml-llm-inference-tutorial]). In case your workload continues to be limited by filesystem performance, contact CSCS support.
195+
196+
* Regularly adjust checkpoint writing intervals to the current overhead induced by writing a checkpoint ($T_1$) and mean time between job failures ($T_2$). As a first order approximation use a checkpointing interval of $\sqrt{2 T_1 T_2}$ (derived by [Young](https://doi.org/10.1145/361147.361115) and [Daly](https://doi.org/10.1016/j.future.2004.11.016)).
194197

195-
* Regularly adjust checkpoint writing intervals to the overhead induced by writing a checkpoint ($T_1$) and the mean time between job failures ($T_2$). As a first order approximation use a checkpointing interval of $\sqrt{2 T_1 T_2}$ (derived by [Young](https://doi.org/10.1145/361147.361115) and [Daly](https://doi.org/10.1016/j.future.2004.11.016)).
198+
* Avoid activities that put excessive load on third party services (such as web scraping or bulk downloads) in line with the [guidelines on Internet Access on Alps][ref-guides-internet-access-ext].
196199

197200
Adjust for **cluster availability**:
198201

199202
* Submit your jobs with a Slurm time limit compatible with reservations (such as maintenance windows, cf. `scontrol show res`) to be able to get scheduled.
200203

204+
??? info "Debugging segmentation faults"
205+
Application crashes with segmentation faults can be investigated by inspecting core dump files that contain an image of the process memory at the time of the crash. For this purpose, you can load the core dump file with `cuda-gdb` installed in the container and look at the stack trace with `bt`. Note that in order to generate core dump files the line `ulimit -c 0` must be commented out in the above sbatch script.
206+
207+
### Known Issues
208+
209+
??? info "Errors hidden by failures in UCX signal handler"
210+
Application errors may trigger the UCX signal handler in the NGC container, which has caused secondary failures in the past, shadowing the initial error trace. These secondary failures may be significantly harder to fix than the initial problem.
211+
212+
An example is the following trace from the NGC PyTorch 25.01 with Megatron-LM:
213+
```console
214+
640: [nid007306:244443:0:244443] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x455)
215+
640: ==== backtrace (tid: 244443) ====
216+
640: 0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2cc) [0x4000d2b214dc]
217+
640: 1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3168c) [0x4000d2b2168c]
218+
640: 2 /opt/hpcx/ucx/lib/libucs.so.0(+0x319b8) [0x4000d2b219b8]
219+
640: 3 linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x4000347707dc]
220+
640: 4 /usr/local/cuda/lib64/libnvrtc.so.12.8.61(+0x935000) [0x400140a25000]
221+
640: 5 [0x3d5c5e58]
222+
640: =================================
223+
srun: error: nid007306: task 640: Segmentation fault
224+
srun: Terminating StepId=348680.1
225+
```
226+
In this case, the segmentation fault in the UCX signal handler (`ucs_handle_error`) was due to a broken NVRTC in the container. However, to obtain the trace of the initial error (which was unrelated), it was necessary to disable the UCX signal handler by setting the following environment variable in the sbatch script:
227+
```bash
228+
export UCX_HANDLE_ERRORS=none
229+
```
230+
201231

202232
[](){#ref-uenv-pytorch}
203233
## Running PyTorch with a uenv
204234

205-
The PyTorch software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box.
235+
The PyTorch uenv software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box.
206236
Thus, it comes with batteries included and does not just provide the bare [PyTorch framework](https://github.com/pytorch/pytorch).
207237

208238
!!! note "uenv"
209239

210-
[PyTorch][ref-uenv-pytorch] is provided via [uenv][ref-uenv].
240+
The [PyTorch uenv][ref-uenv-pytorch] is provided via the tool [uenv][ref-uenv].
211241
Please have a look at the [uenv documentation][ref-uenv] for more information about uenvs and how to use them.
212242

213243
### Versioning
@@ -520,6 +550,12 @@ $ exit # (6)!
520550
```
521551
It is recommended to apply this workaround if you are constrained by a Python package version installed in the uenv that you need to change for your application.
522552

553+
!!! note
554+
Keep in mind that
555+
556+
* this virtual environment is _specific_ to this particular uenv and won't actually work unless you are using it from inside this uenv - it relies on the resources packaged inside the uenv.
557+
* every Slurm job making use of this virtual environment will need to activate it first (_inside_ the `srun` command).
558+
523559
Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages.
524560
However, this workflow is more involved and intended for advanced Spack users.
525561

@@ -537,36 +573,40 @@ However, this workflow is more involved and intended for advanced Spack users.
537573
#SBATCH --uenv=pytorch/v2.6.0:/user-environment
538574
#SBATCH --view=default
539575

576+
set -x
577+
578+
ulimit -c 0 # (2)!
579+
540580
#################################
541581
# OpenMP environment variables #
542582
#################################
543-
export OMP_NUM_THREADS=8 # (2)!
583+
export OMP_NUM_THREADS=8 # (3)!
544584

545585
#################################
546586
# PyTorch environment variables #
547587
#################################
548-
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (3)!
549-
export TRITON_HOME=/dev/shm/ # (4)!
588+
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (4)!
589+
export TRITON_HOME=/dev/shm/ # (5)!
550590

551591
#################################
552592
# MPICH environment variables #
553593
#################################
554-
export MPICH_GPU_SUPPORT_ENABLED=0 # (5)!
594+
export MPICH_GPU_SUPPORT_ENABLED=0 # (6)!
555595

556596
#################################
557597
# CUDA environment variables #
558598
#################################
559-
export CUDA_CACHE_DISABLE=1 # (6)!
599+
export CUDA_CACHE_DISABLE=1 # (7)!
560600

561601
############################################
562602
# NCCL and Fabric environment variables #
563603
############################################
564-
# (7)!
604+
# (8)!
565605
--8<-- "docs/software/communication/nccl_env_vars"
566606

567-
# (8)!
568607
# (9)!
569-
srun bash -c "
608+
# (10)!
609+
srun -ul bash -c "
570610
. ./venv-uenv-pt2.6-v1/bin/activate
571611
572612
--8<-- "docs/software/ml/torch_distributed_env_vars"
@@ -576,19 +616,20 @@ srun bash -c "
576616

577617
1. The `--uenv` option is used to specify the uenv to use for the job.
578618
The `--view=default` option is used to load all the packages provided by the uenv.
579-
2. Set `OMP_NUM_THREADS` if you are using OpenMP in your code.
619+
2. In case the application crashes, it may leave behind large core dump files that contain an image of the process memory at the time of the crash. While these can be useful for debugging the reason of a specific crash (by e.g. loading them with `cuda-gdb` and looking at the stack trace with `bt`), they may accumulate over time and occupy a large space on the filesystem. For this reason, it is recommended to disable their creation (unless needed) by adding this line.
620+
3. Set `OMP_NUM_THREADS` if you are using OpenMP in your code.
580621
The number of threads should be not greater than the number of cores per task (`$SLURM_CPUS_PER_TASK`).
581622
The optimal number depends on the workload and should be determined by testing.
582623
Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes.
583-
3. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
584-
4. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
624+
4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
625+
5. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
585626
This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. Avoid this setting with the container engine as it may lead to errors related to mount settings of `/dev/shm` (use a filesystem path inside the container instead).
586-
5. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
587-
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
588-
7. These variables should always be set for correctness and optimal performance when using NCCL with uenv, see [the detailed explanation][ref-communication-nccl].
589-
8. Activate the virtual environment created on top of the uenv (if any).
627+
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
628+
7. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
629+
8. These variables should always be set for correctness and optimal performance when using NCCL with uenv, see [the detailed explanation][ref-communication-nccl].
630+
9. Activate the virtual environment created on top of the uenv (if any).
590631
Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times.
591-
9. The environment variables are used by PyTorch to initialize the distributed backend.
632+
10. The environment variables are used by PyTorch to initialize the distributed backend.
592633
The `MASTER_ADDR`, `MASTER_PORT` variables are used to determine the address and port of the master node.
593634
Additionally we also need `RANK` and `LOCAL_RANK` and `WORLD_SIZE` to identify the position of each rank within the Slurm step and node, respectively.
594635

docs/software/ml/tutorials/llm-inference.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -165,8 +165,8 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)!
165165
2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed.
166166
3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started
167167
4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on.
168-
5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`.
169-
6. Disable CUDA JIT cache
168+
5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys).
169+
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
170170
7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
171171
8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
172172

0 commit comments

Comments
 (0)