Skip to content

Commit 2f74baf

Browse files
authored
Expand communication pages (#75)
1 parent d94f4ce commit 2f74baf

File tree

8 files changed

+155
-7
lines changed

8 files changed

+155
-7
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
* @bcumming @msimberg @RMeli
22
docs/services/firecrest @jpdorsch @ekouts
3-
docs/software/communication @msimberg
3+
docs/software/communication @Madeeks @msimberg
44
docs/software/devtools/linaro @jgphpc
55
docs/software/prgenv/linalg.md @finkandreas @msimberg
66
docs/software/sciapps/cp2k.md @abussy @RMeli

docs/guides/gb2025.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,3 +81,7 @@ srun -N1 -n4 -c71 ...
8181

8282
!!! todo
8383
write a guide on which versions to use, environment variables to set, etc.
84+
85+
See [the container engine documentation][ref-ce-aws-ofi-hook] for information on using NCCL in containers.
86+
The [NCCL][ref-communication-nccl] contains general information on configuring NCCL.
87+
This information is especially important when using uenvs, as the environment variables are not set automatically.

docs/software/communication/cray-mpich.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,12 +58,14 @@ See [this page][ref-slurm-gh200] for more information on configuring SLURM to us
5858

5959
Alternatively, if you wish to not use GPU-aware MPI, either unset `MPICH_GPU_SUPPORT_ENABLED` or explicitly set it to `0` in your launch scripts.
6060

61+
[](){#ref-communication-cray-mpich-known-issues}
6162
## Known issues
6263

6364
This section documents known issues related to Cray MPICH on Alps. Resolved issues are also listed for reference.
6465

6566
### Existing Issues
6667

68+
[](){#ref-communication-cray-mpich-cache-monitor-disable}
6769
#### Cray MPICH hangs
6870

6971
Cray MPICH may sometimes hang on larger runs.

docs/software/communication/libfabric.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,21 @@
44
[Libfabric](https://ofiwg.github.io/libfabric/), or Open Fabrics Interfaces (OFI), is a low level networking library that abstracts away various networking backends.
55
It is used by Cray MPICH, and can be used together with OpenMPI, NCCL, and RCCL to make use of the [Slingshot network on Alps][ref-alps-hsn].
66

7+
## Using libfabric
8+
9+
If you are using a uenv provided by CSCS, such as [prgenv-gnu][ref-uenv-prgenv-gnu], [Cray MPICH][ref-communication-cray-mpich] is linked to libfabric and the high speed network will be used.
10+
No changes are required in applications.
11+
12+
If you are using containers, the system libfabric can be loaded into your container using the [CXI hook provided by the container engine][ref-ce-cxi-hook].
13+
Using the hook is essential to make full use of the Alps network.
14+
15+
## Tuning libfabric
16+
17+
Tuning libfabric (particularly together with [Cray MPICH][ref-communication-cray-mpich], [OpenMPI][ref-communication-openmpi], [NCCL][ref-communication-nccl], and [RCCL][ref-communication-rccl]) depends on many factors, including the application, workload, and system.
18+
For a comprehensive overview libfabric options for the CXI provider (the provider for the Slingshot network), see the [`fi_cxi` man pages](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_cxi.7.html).
19+
Note that the exact version deployed on Alps may differ, and not all options may be applicable on Alps.
20+
21+
See the [Cray MPICH known issues page][ref-communication-cray-mpich-known-issues] for issues when using Cray MPICH together with libfabric.
22+
723
!!! todo
24+
More options?

docs/software/communication/nccl.md

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,78 @@
44
[NCCL](https://developer.nvidia.com/nccl) is an optimized inter-GPU communication library for NVIDIA GPUs.
55
It is commonly used in machine learning frameworks, but traditional scientific applications can also benefit from NCCL.
66

7-
!!! todo
8-
- high level description
9-
- libfabric/aws-ofi-nccl plugin
10-
- configuration options
7+
## Using NCCL
8+
9+
To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used.
10+
With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it.
11+
12+
Most uenvs, like [`prgenv-gnu`][ref-uenv-prgenv-gnu], also contain the NCCL plugin.
13+
When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin will be available in the environment.
14+
Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment.
15+
The environment variables described below must be set to ensure that NCCL uses the plugin.
16+
17+
While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL:
18+
19+
```bash
20+
export NCCL_NET="AWS Libfabric" # (1)!
21+
export NCCL_NET_GDR_LEVEL=PHB # (2)!
22+
export FI_CXI_DEFAULT_CQ_SIZE=131072 # (3)!
23+
export FI_CXI_DEFAULT_TX_SIZE=32768
24+
export FI_CXI_DISABLE_HOST_REGISTER=1
25+
export FI_CXI_RX_MATCH_MODE=software
26+
export FI_MR_CACHE_MONITOR=userfaultfd
27+
export MPICH_GPU_SUPPORT_ENABLED=0 # (4)!
28+
```
29+
30+
1. This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. If the plugin can not be found, applications will fail to start. With the default value, applications would instead fall back to e.g. TCP, which would be significantly slower than with the plugin. [More information about `NCCL_NET`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net).
31+
2. Use GPU Direct RDMA when GPU and NIC are on the same NUMA node. [More information about `NCCL_NET_GDR_LEVEL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net-gdr-level-formerly-nccl-ib-gdr-level).
32+
3. This and the other `FI` (libfabric) environment variables have been found to give the best performance on the Alps network across a wide range of applications. Specific applications may perform better with other values.
33+
4. Disable GPU-aware MPI explicitly, to avoid potential deadlocks between MPI and NCCL.
34+
35+
!!! warning "Using NCCL with uenvs"
36+
The environment variables listed above are not set automatically when using uenvs.
37+
38+
!!! warning "GPU-aware MPI with NCCL"
39+
Using GPU-aware MPI together with NCCL [can easily lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi).
40+
Unless care is taken to ensure that the two methods of communication are not used concurrently, we recommend not using GPU-aware MPI with NCCL.
41+
To explicitly disable GPU-aware MPI with Cray MPICH, explicitly set `MPICH_GPU_SUPPORT_ENABLED=0`.
42+
Note that this option may be set to `1` by default on some Alps clusters.
43+
See [the Cray MPICH documentation][ref-communication-cray-mpich] for more details on GPU-aware MPI with Cray MPICH.
44+
45+
!!! warning "`invalid usage` error with `NCCL_NET="AWS Libfabric`"
46+
If you are getting error messages such as:
47+
```console
48+
nid006352: Test NCCL failure common.cu:958 'invalid usage (run with NCCL_DEBUG=WARN for details)
49+
```
50+
this may be due to the plugin not being found by NCCL.
51+
If this is the case, running the application with the recommended `NCCL_DEBUG=WARN` should print something similar to the following:
52+
```console
53+
nid006352:34157:34217 [1] net.cc:626 NCCL WARN Error: network AWS Libfabric not found.
54+
```
55+
When using uenvs like `prgenv-gnu`, make sure you are either using the `default` view which loads `aws-ofi-nccl` automatically, or, if using the `modules` view, load the `aws-ofi-nccl` module with `module load aws-ofi-nccl`.
56+
If the plugin is found correctly, running the application with `NCCL_DEBUG=INFO` should print:
57+
```console
58+
nid006352:34610:34631 [0] NCCL INFO Using network AWS Libfabric
59+
```
60+
61+
!!! warning "Do not use `NCCL_NET_PLUGIN="ofi"` with uenvs"
62+
NCCL has an alternative way of specifying what plugin to use: `NCCL_NET_PLUGIN`.
63+
When using uenvs, do not set `NCCL_NET_PLUGIN="ofi"` instead of, or in addition to, `NCCL_NET="AWS Libfabric"`.
64+
If you do, your application will fail to start since NCCL will:
65+
66+
1. fail to find the plugin because of the name of the shared library in the uenv, and
67+
2. prefer `NCCL_NET_PLUGIN` over `NCCL_NET`, so it will fail to find the plugin even if `NCCL_NET="AWS Libfabric"` is correctly set.
68+
69+
When both environment variables are set the error message, with `NCCL_DEBUG=WARN`, will look similar to when the plugin isn't available:
70+
```console
71+
nid006365:179857:179897 [1] net.cc:626 NCCL WARN Error: network AWS Libfabric not found.
72+
```
73+
74+
With `NCCL_DEBUG=INFO`, NCCL will print:
75+
```console
76+
nid006365:180142:180163 [0] NCCL INFO NET/Plugin: Could not find: ofi libnccl-net-ofi.so. Using internal network plugin.
77+
...
78+
nid006365:180142:180163 [0] net.cc:626 NCCL WARN Error: network AWS Libfabric not found.
79+
```
80+
81+
If you only set `NCCL_NET="ofi"`, NCCL may silently fail to load the plugin but fall back to the default implementation.

docs/software/communication/openmpi.md

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,52 @@ However, [OpenMPI](https://www.open-mpi.org/) can be used as an alternative in s
66

77
To use OpenMPI on Alps, it must be built against [libfabric][ref-communication-libfabric] with support for the [Slingshot 11 network][ref-alps-hsn].
88

9+
## Using OpenMPI
10+
11+
!!! warning
12+
Building and using OpenMPI on Alps is still [work in progress](https://eth-cscs.github.io/cray-network-stack/).
13+
The instructions found on this page may be inaccurate, but are a good starting point to using OpenMPI on Alps.
14+
15+
!!! todo
16+
Deploy experimental uenv.
17+
918
!!! todo
10-
Building OpenMPI for Alps is still work in progress: https://eth-cscs.github.io/cray-network-stack/.
19+
Document OpenMPI uenv next to prgenv-gnu, prgenv-nvfortran, and linalg?
20+
21+
OpenMPI is provided through a [uenv][ref-uenv] similar to [`prgenv-gnu`][ref-uenv-prgenv-gnu].
22+
Once the uenv is loaded, compiling and linking with OpenMPI and libfabric is transparent.
23+
At runtime, some additional options must be set to correctly use the Slingshot network.
24+
25+
First, when launching applications through slurm, [PMIx](https://pmix.github.com) must be used for application launching.
26+
This is done with the `--mpi` flag of `srun`:
27+
```bash
28+
srun --mpi=pmix ...
29+
```
30+
31+
Additionally, the following environment variables should be set:
32+
```bash
33+
export PMIX_MCA_psec="native" # (1)!
34+
export FI_PROVIDER="cxi" # (2)!
35+
export OMPI_MCA_pml="^ucx" # (3)!
36+
export OMPI_MCA_mtl="ofi" # (4)!
37+
38+
1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup.
39+
2. Use the CXI (Slingshot) provider.
40+
3. Use anything except [UCX](https://openucx.org/documentation/) for [point-to-point communication](https://docs.open-mpi.org/en/v5.0.x/mca.html#selecting-which-open-mpi-components-are-used-at-run-time). The `^` signals that OpenMPI should exclude all listed components.
41+
4. Use libfabric for the [Matching Transport Layer](https://docs.open-mpi.org/en/v5.0.x/mca.html#frameworks).
42+
43+
!!! info "CXI provider does all communication through the network interface cards (NICs)"
44+
When using the libfabric CXI provider, all communication goes through NICs, including intra-node communication.
45+
This means that intra-node communication can not make use of shared memory optimizations and the maximum bandwidth will not be severely limited.
46+
47+
Libfabric has a new [LINKx](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_lnx.7.html) provider, which allows using different libfabric providers for inter- and intra-node communication.
48+
This provider is not as well tested, but can in theory perform better for intra-node communication, because it can use shared memory.
49+
To use the LINKx provider, set the following, instead of `FI_PROVIDER=cxi`:
50+
51+
```bash
52+
export FI_PROVIDER="lnx" # (1)!
53+
export FI_LNX_PROV_LINKS="shm+cxi" # (2)!
54+
```
55+
56+
1. Use the libfabric LINKx provider, to allow using different libfabric providers for inter- and intra-node communication.
57+
2. Use the shared memory provider for intra-node communication and the CXI (Slingshot) provider for inter-node communication.

docs/software/communication/rccl.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,7 @@ It provides equivalent functionality to [NCCL][ref-communication-nccl] for AMD G
88
- high level description
99
- libfabric/aws-ofi-rccl plugin
1010
- configuration options
11+
12+
!!! info
13+
RCCL uses many of the same [configuration options as NCCL](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html), with the `NCCL` prefix, not `RCCL`.
14+
Refer to NCCL documentation to tune RCCL.

docs/software/container-engine.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -437,7 +437,9 @@ If a libfabric library is already present in the container filesystem (for examp
437437
438438
!!! note
439439
Due to the nature of Slingshot and the mechanism implemented by the CXI hook, container applications need to use a communication library which supports libfabric in order to benefit from usage of the hook.
440-
> Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details).
440+
441+
!!! note
442+
Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details).
441443
442444
The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which can be defined in the EDF, as shown in the following example:
443445
@@ -533,6 +535,7 @@ Container hooks let you customize container behavior to fit system-specific need
533535
### AWS OFI NCCL Hook 
534536

535537
The [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) is a software extension that allows the [NCCL](https://developer.nvidia.com/nccl) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect.
538+
Also see [NCCL][ref-communication-nccl] and [libfabric][ref-communication-libfabric] for more information on using the libraries on Alps.
536539

537540
The Container Engine includes a hook program to inject the AWS OFI NCCL plugin in containers; since the plugin must also be compatible with the GPU programming software stack being used, the `com.hooks.aws_ofi_nccl.variant` annotation is used to specify a plugin variant suitable for a given container image.
538541
At the moment of writing, 4 plugin variants are configured: `cuda11`, `cuda12` (to be used on NVIDIA GPU nodes), `rocm5`, and `rocm6` (to be used on AMD GPU nodes alongside RCCL).

0 commit comments

Comments
 (0)