Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
477e097
Add more links to libfabric section
msimberg Apr 3, 2025
7a871b3
Add a few environment variables for OpenMPI on Alps
msimberg Apr 3, 2025
ea89fe0
Expand NCCL and RCCL pages
msimberg Apr 3, 2025
4b2a984
Add note box in container engine docs
msimberg Apr 3, 2025
59f5ba2
Add more codeowners to communication pages
msimberg Apr 3, 2025
8f15929
Update docs/software/communication/nccl.md
msimberg Apr 3, 2025
259fd4b
Recommend cxi over lnx when using OpenMPI
msimberg Apr 3, 2025
30901d1
perf variables
boeschf Apr 3, 2025
64db5bf
Merge pull request #1 from boeschf/expand-communication
msimberg Apr 3, 2025
c93b4df
Add links to NCCL docs from GB docs
msimberg Apr 3, 2025
988c24a
Refactor NCCL docs, add uenv notes
msimberg Apr 3, 2025
79c51c2
Add comma
msimberg Apr 3, 2025
9ae6744
Fix tyop
msimberg Apr 3, 2025
4b7ae6b
Add more examples and warnings about aws ofi nccl plugin not loading …
msimberg Apr 4, 2025
f0b7e1d
Fix annotation numbering in NCCL docs
msimberg Apr 4, 2025
18aee3f
Add more text about NCCL_NET_PLUGIN
msimberg Apr 4, 2025
49af1cc
Remove biddisco from communication code owners
msimberg Apr 4, 2025
b1e6b3a
Update docs/software/communication/libfabric.md
msimberg Apr 4, 2025
36262c8
Update docs/software/communication/nccl.md
msimberg Apr 4, 2025
20a8b3c
Update docs/software/communication/nccl.md
msimberg Apr 4, 2025
2b2ba8c
Update docs/software/communication/openmpi.md
msimberg Apr 4, 2025
4ea05bc
Update docs/software/communication/openmpi.md
msimberg Apr 4, 2025
4b9a49c
Update docs/software/communication/openmpi.md
msimberg Apr 4, 2025
b489566
Merge branch 'main' into expand-communication
msimberg Apr 7, 2025
c7703c7
Merge branch 'main' into expand-communication
bcumming Apr 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
* @bcumming @msimberg @RMeli
docs/services/firecrest @jpdorsch @ekouts
docs/software/communication @msimberg
docs/software/communication @biddisco @Madeeks @msimberg
docs/software/prgenv/linalg.md @finkandreas @msimberg
docs/software/sciapps/cp2k.md @abussy @RMeli
2 changes: 2 additions & 0 deletions docs/software/communication/cray-mpich.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,12 +58,14 @@ See [this page][ref-slurm-gh200] for more information on configuring SLURM to us

Alternatively, if you wish to not use GPU-aware MPI, either unset `MPICH_GPU_SUPPORT_ENABLED` or explicitly set it to `0` in your launch scripts.

[](){#ref-communication-cray-mpich-known-issues}
## Known issues

This section documents known issues related to Cray MPICH on Alps. Resolved issues are also listed for reference.

### Existing Issues

[](){#ref-communication-cray-mpich-cache-monitor-disable}
#### Cray MPICH hangs

Cray MPICH may sometimes hang on larger runs.
Expand Down
18 changes: 18 additions & 0 deletions docs/software/communication/libfabric.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,22 @@
[Libfabric](https://ofiwg.github.io/libfabric/), or Open Fabrics Interfaces (OFI), is a low level networking library that abstracts away various networking backends.
It is used by Cray MPICH, and can be used together with OpenMPI, NCCL, and RCCL to make use of the [Slingshot network on Alps][ref-alps-hsn].

## Using libfabric

If you are using a uenv provided by CSCS, such as [prgenv-gnu][ref-uenv-prgenv-gnu], [Cray MPICH][ref-communication-cray-mpich] is linked to libfabric and the high speed network will be used.
No changes are required in applications.

If you are using containers, the system libfabric can be loaded into your container using the [CXI hook provided by the container engine][ref-ce-cxi-hook].
Using the hook is essential to make full use of the Alps network.

## Tuning libfabric

Tuning libfabric (particularly together with [Cray MPICH][ref-communication-cray-mpich], [OpenMPI][ref-communication-openmpi], [NCCL][ref-communication-nccl], and [RCCL][ref-communication-rccl]) depends on many factors, including the application, workload, and system.
For a comprehensive overview libfabric options for the CXI provider (the provider for the Slingshot network), see the [`fi_cxi` man pages](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_cxi.7.html).
Note that the exact version deployed on Alps may differ, and not all options may be applicable on Alps.

See the [Cray MPICH known issues page][ref-communication-cray-mpich-known-issues] for issues when using Cray MPICH together with libfabric.
For example, certain applications may hang at scale unless [the `FI_MR_CACHE_MONITOR=disabled`][ref-communication-cray-mpich-cache-monitor-disable] option is set.

!!! todo
More options?
36 changes: 33 additions & 3 deletions docs/software/communication/nccl.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,37 @@
[NCCL](https://developer.nvidia.com/nccl) is an optimized inter-GPU communication library for NVIDIA GPUs.
It is commonly used in machine learning frameworks, but traditional scientific applications can also benefit from NCCL.

## Using NCCL

To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used.
With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it.

While the container engine does this automatically, regardless of application, the following environment variables should always be set when using NCCL:

```bash
export NCCL_NET="AWS Libfabric"
```

This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network.
Conversely, if the plugin can not be found, applications will fail to start instead of falling back to e.g. TCP, which would be significantly slower than with the plugin.

For optimal performance, the following environment variables should also be set (these are set automatically by the container engine):

```bash
export NCCL_NET_GDR_LEVEL=PHB
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_DEFAULT_TX_SIZE=32768
export FI_CXI_RX_MATCH_MODE=software
```

!!! warning "GPU-aware MPI with NCCL"
Using GPU-aware MPI together with NCCL [can easily lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi).
Unless care is taken to ensure that the two methods of communication are not used concurrently, we recommend not using GPU-aware MPI with NCCL.
To explicitly disable GPU-aware MPI with Cray MPICH, explicitly set `MPICH_GPU_SUPPORT_ENABLED=0`.
Note that this option may be set to `1` by default on some Alps clusters.
See [the Cray MPICH documentation][ref-communication-cray-mpich] for more details on GPU-aware MPI with Cray MPICH.

!!! todo
- high level description
- libfabric/aws-ofi-nccl plugin
- configuration options
More options?
50 changes: 49 additions & 1 deletion docs/software/communication/openmpi.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,53 @@ However, [OpenMPI](https://www.open-mpi.org/) can be used as an alternative in s

To use OpenMPI on Alps, it must be built against [libfabric][ref-communication-libfabric] with support for the [Slingshot 11 network][ref-alps-hsn].

## Using OpenMPI

!!! warning
Building and using OpenMPI on Alps is still [work in progress](https://eth-cscs.github.io/cray-network-stack/).
The instructions found on this page may be inaccurate, but are a good starting point to using OpenMPI on Alps.

!!! todo
Deploy experimental uenv.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biddisco, @bcumming I think it'd be good if we could deploy the not fully tested uenv from @biddisco as prgenv-gnu-ompi/25.4:alpha1 or something like that to have something stable to pull from jfrog. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do - we don't need to do this before we deploy these docs


!!! todo
Building OpenMPI for Alps is still work in progress: https://eth-cscs.github.io/cray-network-stack/.
Document OpenMPI uenv next to prgenv-gnu, prgenv-nvfortran, and linalg?

OpenMPI is provided through a [uenv][ref-uenv] similar to [`prgenv-gnu`][ref-uenv-prgenv-gnu].
Once the uenv is loaded, compiling and linking with OpenMPI and libfabric is transparent.
At runtime, some additional options must be set to correctly use the Slingshot network.

First, when launching applications through slurm, [PMIx](https://pmix.github.com) must be used for application launching.
This is done with the `--mpi` flag of `srun`:
```bash
srun --mpi=pmix ...
```

Additionally, the following environment variables should be set:
```bash
export PMIX_MCA_psec="native" # (1)
export FI_PROVIDER="cxi" # (2)
export OMPI_MCA_pml="^ucx" # (3)
export OMPI_MCA_mtl="ofi" # (4)
```

1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup.
2. Use the CXI (Slingshot) provider.
3. Use anything except [UCX](https://openucx.org/documentation/) for [point-to-point communication](https://docs.open-mpi.org/en/v5.0.x/mca.html#selecting-which-open-mpi-components-are-used-at-run-time).
4. Use libfabric for the [Matching Transport Layer](https://docs.open-mpi.org/en/v5.0.x/mca.html#frameworks).

!!! info "CXI provider does all communication through the network interface cards (NICs)"
When using the libfabric CXI provider, all communication goes through NICs, including intra-node communication.
This means that intra-node communication can not make use of shared memory optimizations and the maximum bandwidth will not be severely limited.

Libfabric has a new [LINKx](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_lnx.7.html) provider, which allows using different libfabric providers for inter- and intra-node communication.
This provider is not as well tested, but can in theory perform better for intra-node communication, because it can use shared memory.
To use the LINKx provider, set the following, instead of `FI_PROVIDER=cxi`:

```bash
export FI_PROVIDER="lnx" # (1)
export FI_LNX_PROV_LINKS="shm+cxi" # (2)
```

1. Use the libfabric LINKx provider, to allow using different libfabric providers for inter- and intra-node communication.
2. Use the shared memory provider for intra-node communication and the CXI (Slingshot) provider for inter-node communication.
4 changes: 4 additions & 0 deletions docs/software/communication/rccl.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,7 @@ It provides equivalent functionality to [NCCL][ref-communication-nccl] for AMD G
- high level description
- libfabric/aws-ofi-rccl plugin
- configuration options

!!! info
RCCL uses many of the same [configuration options as NCCL](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html), with the `NCCL` prefix, not `RCCL`.
Refer to NCCL documentation to tune RCCL.
5 changes: 4 additions & 1 deletion docs/software/container-engine.md
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,9 @@ If a libfabric library is already present in the container filesystem (for examp

!!! note
Due to the nature of Slingshot and the mechanism implemented by the CXI hook, container applications need to use a communication library which supports libfabric in order to benefit from usage of the hook.
> Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details).

!!! note
Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details).

The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which can be defined in the EDF, as shown in the following example:

Expand Down Expand Up @@ -533,6 +535,7 @@ Container hooks let you customize container behavior to fit system-specific need
### AWS OFI NCCL Hook 

The [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) is a software extension that allows the [NCCL](https://developer.nvidia.com/nccl) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect.
Also see [NCCL][ref-communication-nccl] and [libfabric][ref-communication-libfabric] for more information on using the libraries on Alps.

The Container Engine includes a hook program to inject the AWS OFI NCCL plugin in containers; since the plugin must also be compatible with the GPU programming software stack being used, the `com.hooks.aws_ofi_nccl.variant` annotation is used to specify a plugin variant suitable for a given container image.
At the moment of writing, 4 plugin variants are configured: `cuda11`, `cuda12` (to be used on NVIDIA GPU nodes), `rocm5`, and `rocm6` (to be used on AMD GPU nodes alongside RCCL).
Expand Down