You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/software/ml/pytorch.md
+67-26Lines changed: 67 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ For most applications, the [PyTorch NGC container](https://catalog.ngc.nvidia.co
33
33
34
34
### Define Container Runtime Environment
35
35
36
-
Having built and imported a container image with podman and enroot, the next step is to configure the runtime environment with an environment definition file (EDF). In particular, this includes specifying the image, any directories mounted and a working directory to for the processes in the container to start in as in the [quickstart examples for CE][ref-container-engine].
36
+
Having built and imported a container image with podman and enroot, the next step is to configure the runtime environment with an environment definition file (EDF). In particular, this includes specifying the image, any directories mounted from the host and a working directory for the process in the container to start in as in the [quickstart examples for CE][ref-container-engine].
37
37
38
38
Apart from this, there are specific features relevant for machine learning made available through [annotations][ref-ce-annotations], which customize the container at runtime.
2. The path `/users` is not mounted as a whole since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed.
69
69
3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started
70
70
4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook]. While not strictly needed for single node workloads, it is good practice to keep it always on.
71
-
5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`.
72
-
6.Disable CUDA JIT cache
71
+
5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario. Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys).
72
+
6.Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
73
73
7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
74
74
8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
1. Enable Slurm commands (together with two subsequent mounts)
95
95
96
-
!!! note "Best practice for large-scale jobs"
96
+
!!! note "Best practice for production jobs"
97
97
98
-
For stability and reproducibility, use self-contained containers for large scale jobs. Using code mounted from the distributed filesystem may leave compiled artefacts behind that can result in unintentional runtime errors when e.g. swapping the container image. In particular, it is recommended to avoid mounting all of `$HOME`, so that environments are properly isolated and e.g. the Triton cache (that by default ends up in `$HOME/.triton`) resides in an ephemeral location of the filesystem.
98
+
For stability and reproducibility, use self-contained containers for production jobs. Using code mounted from the distributed filesystem may leave compiled artefacts behind that can result in unintentional runtime errors when e.g. swapping the container image. In particular, it is recommended to avoid mounting all of `$HOME`, so that environments are properly isolated and e.g. the Triton cache (that by default ends up in `$HOME/.triton`) resides in an ephemeral location of the filesystem.
99
99
100
100
!!! note "Collaborating in Git"
101
101
102
102
For reproducibility, it is recommended to always track the Dockerfile, EDF and an optional virtual environment specification alongside your application code in a Git repository.
103
103
104
+
[](){#ref-ce-pytorch-venv}
104
105
### (Optionally) extend container with virtual environment
105
106
106
107
While production jobs should include as many dependencies as possible in the container image, during development it can be convenient to manage frequently changing packages in a virtual environment built on top of the container image. This can include both dependencies and actively developed packages (that can be installed in editable mode with `pip install -e .`).
@@ -186,28 +187,57 @@ For further details on execution logic, job monitoring and data management, plea
186
187
187
188
* The argument `--ddp-bucket-size` controls the level of grouping of many small data-parallel communications into bigger ones and setting it to a high value can improve throughput (model-dependent, e.g. `10000000000`).
188
189
189
-
* If in doubt about communication performance with NCCL at scale, use [nccl-tests](https://github.com/NVIDIA/nccl-tests) with the relevant communication patterns to check if scaling behavior can be reproduced.
190
+
* If in doubt about communication performance with NCCL at scale, use the [`NCCL_DEBUG`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) environment variable to validate that the aws-ofi-nccl plugin has been properly initialized and libfabric was recognized (further subsystems can be monitored with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys)). If the issue persists, use [nccl-tests](https://github.com/NVIDIA/nccl-tests) with the relevant communication patterns to check if the scaling behavior can be reproduced and contact CSCS support.
190
191
191
192
Additionally, consider the **best practice for checkpointing and data management**:
192
193
193
-
* Following the advice on [filesystems][ref-storage-fs], write checkpoints (sequential write) to `/capstor/scratch` and place randomly accessed training data (many small random reads) on `/iopsstor/scratch`. Use the [data transfer instructions][ref-data-xfer] to move data to/from `/capstor/store`. Make sure to apply recommended [Lustre settings][ref-guides-storage-lustre] on all directories containing significant amount of data, including those containing container images and managed by other tools (e.g. the HuggingFace cache, see [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) in the [this tutorial][software-ml-llm-inference-tutorial]).
194
+
* Following the advice on [filesystems][ref-storage-fs], write checkpoints (sequential write) to `/capstor/scratch` and place randomly accessed training data (many small random reads) on `/iopsstor/scratch`. Use the [data transfer instructions][ref-data-xfer] to move data to/from `/capstor/store`. Make sure to apply recommended [Lustre settings][ref-guides-storage-lustre] on all directories containing significant amount of data, including those containing container images and managed by other tools (e.g. the HuggingFace cache, see [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) in the [this tutorial][software-ml-llm-inference-tutorial]). In case your workload continues to be limited by filesystem performance, contact CSCS support.
195
+
196
+
* Regularly adjust checkpoint writing intervals to the current overhead induced by writing a checkpoint ($T_1$) and mean time between job failures ($T_2$). As a first order approximation use a checkpointing interval of $\sqrt{2 T_1 T_2}$ (derived by [Young](https://doi.org/10.1145/361147.361115) and [Daly](https://doi.org/10.1016/j.future.2004.11.016)).
194
197
195
-
* Regularly adjust checkpoint writing intervals to the overhead induced by writing a checkpoint ($T_1$) and the mean time between job failures ($T_2$). As a first order approximation use a checkpointing interval of $\sqrt{2 T_1 T_2}$ (derived by [Young](https://doi.org/10.1145/361147.361115) and [Daly](https://doi.org/10.1016/j.future.2004.11.016)).
198
+
* Avoid activities that put excessive load on third party services (such as web scraping or bulk downloads) in line with the [guidelines on Internet Access on Alps][ref-guides-internet-access-ext].
196
199
197
200
Adjust for **cluster availability**:
198
201
199
202
* Submit your jobs with a Slurm time limit compatible with reservations (such as maintenance windows, cf. `scontrol show res`) to be able to get scheduled.
200
203
204
+
??? info "Debugging segmentation faults"
205
+
Application crashes with segmentation faults can be investigated by inspecting core dump files that contain an image of the process memory at the time of the crash. For this purpose, you can load the core dump file with `cuda-gdb` installed in the container and look at the stack trace with `bt`. Note that in order to generate core dump files the line `ulimit -c 0` must be commented out in the above sbatch script.
206
+
207
+
### Known Issues
208
+
209
+
??? info "Errors hidden by failures in UCX signal handler"
210
+
Application errors may trigger the UCX signal handler in the NGC container, which has caused secondary failures in the past, shadowing the initial error trace. These secondary failures may be significantly harder to fix than the initial problem.
211
+
212
+
An example is the following trace from the NGC PyTorch 25.01 with Megatron-LM:
213
+
```console
214
+
640: [nid007306:244443:0:244443] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x455)
In this case, the segmentation fault in the UCX signal handler (`ucs_handle_error`) was due to a broken NVRTC in the container. However, to obtain the trace of the initial error (which was unrelated), it was necessary to disable the UCX signal handler by setting the following environment variable in the sbatch script:
227
+
```bash
228
+
export UCX_HANDLE_ERRORS=none
229
+
```
230
+
201
231
202
232
[](){#ref-uenv-pytorch}
203
233
## Running PyTorch with a uenv
204
234
205
-
The PyTorch software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box.
235
+
The PyTorch uenv software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box.
206
236
Thus, it comes with batteries included and does not just provide the bare [PyTorch framework](https://github.com/pytorch/pytorch).
207
237
208
238
!!! note "uenv"
209
239
210
-
[PyTorch][ref-uenv-pytorch] is provided via [uenv][ref-uenv].
240
+
The [PyTorch uenv][ref-uenv-pytorch] is provided via the tool [uenv][ref-uenv].
211
241
Please have a look at the [uenv documentation][ref-uenv] for more information about uenvs and how to use them.
212
242
213
243
### Versioning
@@ -520,6 +550,12 @@ $ exit # (6)!
520
550
```
521
551
It is recommended to apply this workaround if you are constrained by a Python package version installed in the uenv that you need to change for your application.
522
552
553
+
!!! note
554
+
Keep in mind that
555
+
556
+
* this virtual environment is _specific_ to this particular uenv and won't actually work unless you are using it from inside this uenv - it relies on the resources packaged inside the uenv.
557
+
* every Slurm job making use of this virtual environment will need to activate it first (_inside_ the `srun` command).
558
+
523
559
Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages.
524
560
However, this workflow is more involved and intended for advanced Spack users.
525
561
@@ -537,36 +573,40 @@ However, this workflow is more involved and intended for advanced Spack users.
1. The `--uenv` option is used to specify the uenv to use for the job.
578
618
The `--view=default` option is used to load all the packages provided by the uenv.
579
-
2. Set `OMP_NUM_THREADS` if you are using OpenMP in your code.
619
+
2. In case the application crashes, it may leave behind large core dump files that contain an image of the process memory at the time of the crash. While these can be useful for debugging the reason of a specific crash (by e.g. loading them with `cuda-gdb` and looking at the stack trace with `bt`), they may accumulate over time and occupy a large space on the filesystem. For this reason, it is recommended to disable their creation (unless needed) by adding this line.
620
+
3. Set `OMP_NUM_THREADS` if you are using OpenMP in your code.
580
621
The number of threads should be not greater than the number of cores per task (`$SLURM_CPUS_PER_TASK`).
581
622
The optimal number depends on the workload and should be determined by testing.
582
623
Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes.
583
-
3. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
584
-
4. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
624
+
4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
625
+
5. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
585
626
This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. Avoid this setting with the container engine as it may lead to errors related to mount settings of `/dev/shm` (use a filesystem path inside the container instead).
586
-
5. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
587
-
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
588
-
7. These variables should always be set for correctness and optimal performance when using NCCL with uenv, see [the detailed explanation][ref-communication-nccl].
589
-
8. Activate the virtual environment created on top of the uenv (if any).
627
+
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
628
+
7. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
629
+
8. These variables should always be set for correctness and optimal performance when using NCCL with uenv, see [the detailed explanation][ref-communication-nccl].
630
+
9. Activate the virtual environment created on top of the uenv (if any).
590
631
Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times.
591
-
9. The environment variables are used by PyTorch to initialize the distributed backend.
632
+
10. The environment variables are used by PyTorch to initialize the distributed backend.
592
633
The `MASTER_ADDR`, `MASTER_PORT` variables are used to determine the address and port of the master node.
593
634
Additionally we also need `RANK` and `LOCAL_RANK` and `WORLD_SIZE` to identify the position of each rank within the Slurm step and node, respectively.
2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed.
166
166
3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started
167
167
4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on.
168
-
5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`.
169
-
6.Disable CUDA JIT cache
168
+
5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys).
169
+
6.Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
170
170
7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
171
171
8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
0 commit comments