-
Notifications
You must be signed in to change notification settings - Fork 41
Expand communication pages #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
477e097
7a871b3
ea89fe0
4b2a984
59f5ba2
8f15929
259fd4b
30901d1
64db5bf
c93b4df
988c24a
79c51c2
9ae6744
4b7ae6b
f0b7e1d
18aee3f
49af1cc
b1e6b3a
36262c8
20a8b3c
2b2ba8c
4ea05bc
4b9a49c
b489566
c7703c7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,6 @@ | ||
| * @bcumming @msimberg @RMeli | ||
| docs/services/firecrest @jpdorsch @ekouts | ||
| docs/software/communication @msimberg | ||
| docs/software/communication @Madeeks @msimberg | ||
| docs/software/devtools/linaro @jgphpc | ||
| docs/software/prgenv/linalg.md @finkandreas @msimberg | ||
| docs/software/sciapps/cp2k.md @abussy @RMeli |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -4,7 +4,78 @@ | |||||||||||||
| [NCCL](https://developer.nvidia.com/nccl) is an optimized inter-GPU communication library for NVIDIA GPUs. | ||||||||||||||
| It is commonly used in machine learning frameworks, but traditional scientific applications can also benefit from NCCL. | ||||||||||||||
|
|
||||||||||||||
| !!! todo | ||||||||||||||
| - high level description | ||||||||||||||
| - libfabric/aws-ofi-nccl plugin | ||||||||||||||
| - configuration options | ||||||||||||||
| ## Using NCCL | ||||||||||||||
|
|
||||||||||||||
| To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used. | ||||||||||||||
| With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it. | ||||||||||||||
|
|
||||||||||||||
| Most uenvs, like [`prgenv-gnu`][ref-uenv-prgenv-gnu], also contain the NCCL plugin. | ||||||||||||||
| When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin will be available in the environment. | ||||||||||||||
| Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment. | ||||||||||||||
| The environment variables described below must be set to ensure that NCCL uses the plugin. | ||||||||||||||
|
|
||||||||||||||
| While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL: | ||||||||||||||
|
|
||||||||||||||
| ```bash | ||||||||||||||
| export NCCL_NET="AWS Libfabric" # (1)! | ||||||||||||||
| export NCCL_NET_GDR_LEVEL=PHB # (2)! | ||||||||||||||
| export FI_CXI_DEFAULT_CQ_SIZE=131072 # (3)! | ||||||||||||||
| export FI_CXI_DEFAULT_TX_SIZE=32768 | ||||||||||||||
| export FI_CXI_DISABLE_HOST_REGISTER=1 | ||||||||||||||
| export FI_CXI_RX_MATCH_MODE=software | ||||||||||||||
| export FI_MR_CACHE_MONITOR=userfaultfd | ||||||||||||||
| export MPICH_GPU_SUPPORT_ENABLED=0 # (4)! | ||||||||||||||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
@boeschf, @teojgo, @fawzi (or anyone else that knows): any comments on
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For reference, the default vars set by the AWS NCCL plugin hook (which reasonably apply only when NCCL is used) are defined here: https://git.cscs.ch/alps-platforms/vservices/vs-enroot/-/blob/main/enroot-variables.tf?ref_type=heads#L314 Individual vclusters might override these values, of course.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ah, of course 🤦 makese sense. So definitely not for the NCCL page then. Might still be useful to add to the Cray MPICH page, but I'd like to first understand how universally useful it is.
Indeed, thanks for the link. That's what I was using as a reference to understand if we need to list all of them on the NCCL page here in the docs. Based on docs |
||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| 1. This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. If the plugin can not be found, applications will fail to start. With the default value, applications would instead fall back to e.g. TCP, which would be significantly slower than with the plugin. [More information about `NCCL_NET`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net). | ||||||||||||||
| 2. Use GPU Direct RDMA when GPU and NIC are on the same NUMA node. [More information about `NCCL_NET_GDR_LEVEL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net-gdr-level-formerly-nccl-ib-gdr-level). | ||||||||||||||
| 3. This and the other `FI` (libfabric) environment variables have been found to give the best performance on the Alps network across a wide range of applications. Specific applications may perform better with other values. | ||||||||||||||
| 4. Disable GPU-aware MPI explicitly, to avoid potential deadlocks between MPI and NCCL. | ||||||||||||||
|
|
||||||||||||||
| !!! warning "Using NCCL with uenvs" | ||||||||||||||
| The environment variables listed above are not set automatically when using uenvs. | ||||||||||||||
|
|
||||||||||||||
| !!! warning "GPU-aware MPI with NCCL" | ||||||||||||||
| Using GPU-aware MPI together with NCCL [can easily lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi). | ||||||||||||||
| Unless care is taken to ensure that the two methods of communication are not used concurrently, we recommend not using GPU-aware MPI with NCCL. | ||||||||||||||
| To explicitly disable GPU-aware MPI with Cray MPICH, explicitly set `MPICH_GPU_SUPPORT_ENABLED=0`. | ||||||||||||||
| Note that this option may be set to `1` by default on some Alps clusters. | ||||||||||||||
| See [the Cray MPICH documentation][ref-communication-cray-mpich] for more details on GPU-aware MPI with Cray MPICH. | ||||||||||||||
|
|
||||||||||||||
| !!! warning "`invalid usage` error with `NCCL_NET="AWS Libfabric`" | ||||||||||||||
| If you are getting error messages such as: | ||||||||||||||
| ```console | ||||||||||||||
| nid006352: Test NCCL failure common.cu:958 'invalid usage (run with NCCL_DEBUG=WARN for details) | ||||||||||||||
| ``` | ||||||||||||||
| this may be due to the plugin not being found by NCCL. | ||||||||||||||
| If this is the case, running the application with the recommended `NCCL_DEBUG=WARN` should print something similar to the following: | ||||||||||||||
| ```console | ||||||||||||||
| nid006352:34157:34217 [1] net.cc:626 NCCL WARN Error: network AWS Libfabric not found. | ||||||||||||||
| ``` | ||||||||||||||
| When using uenvs like `prgenv-gnu`, make sure you are either using the `default` view which loads `aws-ofi-nccl` automatically, or, if using the `modules` view, load the `aws-ofi-nccl` module with `module load aws-ofi-nccl`. | ||||||||||||||
| If the plugin is found correctly, running the application with `NCCL_DEBUG=INFO` should print: | ||||||||||||||
| ```console | ||||||||||||||
| nid006352:34610:34631 [0] NCCL INFO Using network AWS Libfabric | ||||||||||||||
| ``` | ||||||||||||||
RMeli marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
| !!! warning "Do not use `NCCL_NET_PLUGIN="ofi"` with uenvs" | ||||||||||||||
| NCCL has an alternative way of specifying what plugin to use: `NCCL_NET_PLUGIN`. | ||||||||||||||
| When using uenvs, do not set `NCCL_NET_PLUGIN="ofi"` instead of, or in addition to, `NCCL_NET="AWS Libfabric"`. | ||||||||||||||
| If you do, your application will fail to start since NCCL will: | ||||||||||||||
|
|
||||||||||||||
| 1. fail to find the plugin because of the name of the shared library in the uenv, and | ||||||||||||||
| 2. prefer `NCCL_NET_PLUGIN` over `NCCL_NET`, so it will fail to find the plugin even if `NCCL_NET="AWS Libfabric"` is correctly set. | ||||||||||||||
|
|
||||||||||||||
| When both environment variables are set the error message, with `NCCL_DEBUG=WARN`, will look similar to when the plugin isn't available: | ||||||||||||||
| ```console | ||||||||||||||
| nid006365:179857:179897 [1] net.cc:626 NCCL WARN Error: network AWS Libfabric not found. | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| With `NCCL_DEBUG=INFO`, NCCL will print: | ||||||||||||||
| ```console | ||||||||||||||
| nid006365:180142:180163 [0] NCCL INFO NET/Plugin: Could not find: ofi libnccl-net-ofi.so. Using internal network plugin. | ||||||||||||||
| ... | ||||||||||||||
| nid006365:180142:180163 [0] net.cc:626 NCCL WARN Error: network AWS Libfabric not found. | ||||||||||||||
| ``` | ||||||||||||||
RMeli marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
| If you only set `NCCL_NET="ofi"`, NCCL may silently fail to load the plugin but fall back to the default implementation. | ||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,5 +6,52 @@ However, [OpenMPI](https://www.open-mpi.org/) can be used as an alternative in s | |
|
|
||
| To use OpenMPI on Alps, it must be built against [libfabric][ref-communication-libfabric] with support for the [Slingshot 11 network][ref-alps-hsn]. | ||
|
|
||
| ## Using OpenMPI | ||
|
|
||
| !!! warning | ||
| Building and using OpenMPI on Alps is still [work in progress](https://eth-cscs.github.io/cray-network-stack/). | ||
| The instructions found on this page may be inaccurate, but are a good starting point to using OpenMPI on Alps. | ||
|
|
||
| !!! todo | ||
| Deploy experimental uenv. | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do - we don't need to do this before we deploy these docs |
||
|
|
||
| !!! todo | ||
| Building OpenMPI for Alps is still work in progress: https://eth-cscs.github.io/cray-network-stack/. | ||
| Document OpenMPI uenv next to prgenv-gnu, prgenv-nvfortran, and linalg? | ||
|
|
||
| OpenMPI is provided through a [uenv][ref-uenv] similar to [`prgenv-gnu`][ref-uenv-prgenv-gnu]. | ||
| Once the uenv is loaded, compiling and linking with OpenMPI and libfabric is transparent. | ||
| At runtime, some additional options must be set to correctly use the Slingshot network. | ||
|
|
||
| First, when launching applications through slurm, [PMIx](https://pmix.github.com) must be used for application launching. | ||
| This is done with the `--mpi` flag of `srun`: | ||
| ```bash | ||
| srun --mpi=pmix ... | ||
| ``` | ||
|
|
||
| Additionally, the following environment variables should be set: | ||
| ```bash | ||
| export PMIX_MCA_psec="native" # (1)! | ||
| export FI_PROVIDER="cxi" # (2)! | ||
| export OMPI_MCA_pml="^ucx" # (3)! | ||
| export OMPI_MCA_mtl="ofi" # (4)! | ||
|
|
||
| 1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup. | ||
| 2. Use the CXI (Slingshot) provider. | ||
| 3. Use anything except [UCX](https://openucx.org/documentation/) for [point-to-point communication](https://docs.open-mpi.org/en/v5.0.x/mca.html#selecting-which-open-mpi-components-are-used-at-run-time). The `^` signals that OpenMPI should exclude all listed components. | ||
| 4. Use libfabric for the [Matching Transport Layer](https://docs.open-mpi.org/en/v5.0.x/mca.html#frameworks). | ||
|
|
||
| !!! info "CXI provider does all communication through the network interface cards (NICs)" | ||
| When using the libfabric CXI provider, all communication goes through NICs, including intra-node communication. | ||
| This means that intra-node communication can not make use of shared memory optimizations and the maximum bandwidth will not be severely limited. | ||
|
|
||
| Libfabric has a new [LINKx](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_lnx.7.html) provider, which allows using different libfabric providers for inter- and intra-node communication. | ||
| This provider is not as well tested, but can in theory perform better for intra-node communication, because it can use shared memory. | ||
| To use the LINKx provider, set the following, instead of `FI_PROVIDER=cxi`: | ||
|
|
||
| ```bash | ||
| export FI_PROVIDER="lnx" # (1)! | ||
| export FI_LNX_PROV_LINKS="shm+cxi" # (2)! | ||
| ``` | ||
|
|
||
| 1. Use the libfabric LINKx provider, to allow using different libfabric providers for inter- and intra-node communication. | ||
| 2. Use the shared memory provider for intra-node communication and the CXI (Slingshot) provider for inter-node communication. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB @bcumming, I'm linking from the GB docs to the updated NCCL docs now.