-
Notifications
You must be signed in to change notification settings - Fork 41
Expand communication pages #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 9 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
477e097
Add more links to libfabric section
msimberg 7a871b3
Add a few environment variables for OpenMPI on Alps
msimberg ea89fe0
Expand NCCL and RCCL pages
msimberg 4b2a984
Add note box in container engine docs
msimberg 59f5ba2
Add more codeowners to communication pages
msimberg 8f15929
Update docs/software/communication/nccl.md
msimberg 259fd4b
Recommend cxi over lnx when using OpenMPI
msimberg 30901d1
perf variables
boeschf 64db5bf
Merge pull request #1 from boeschf/expand-communication
msimberg c93b4df
Add links to NCCL docs from GB docs
msimberg 988c24a
Refactor NCCL docs, add uenv notes
msimberg 79c51c2
Add comma
msimberg 9ae6744
Fix tyop
msimberg 4b7ae6b
Add more examples and warnings about aws ofi nccl plugin not loading …
msimberg f0b7e1d
Fix annotation numbering in NCCL docs
msimberg 18aee3f
Add more text about NCCL_NET_PLUGIN
msimberg 49af1cc
Remove biddisco from communication code owners
msimberg b1e6b3a
Update docs/software/communication/libfabric.md
msimberg 36262c8
Update docs/software/communication/nccl.md
msimberg 20a8b3c
Update docs/software/communication/nccl.md
msimberg 2b2ba8c
Update docs/software/communication/openmpi.md
msimberg 4ea05bc
Update docs/software/communication/openmpi.md
msimberg 4b9a49c
Update docs/software/communication/openmpi.md
msimberg b489566
Merge branch 'main' into expand-communication
msimberg c7703c7
Merge branch 'main' into expand-communication
bcumming File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,5 @@ | ||
| * @bcumming @msimberg @RMeli | ||
| docs/services/firecrest @jpdorsch @ekouts | ||
| docs/software/communication @msimberg | ||
| docs/software/communication @biddisco @Madeeks @msimberg | ||
| docs/software/prgenv/linalg.md @finkandreas @msimberg | ||
| docs/software/sciapps/cp2k.md @abussy @RMeli | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,5 +6,53 @@ However, [OpenMPI](https://www.open-mpi.org/) can be used as an alternative in s | |
|
|
||
| To use OpenMPI on Alps, it must be built against [libfabric][ref-communication-libfabric] with support for the [Slingshot 11 network][ref-alps-hsn]. | ||
|
|
||
| ## Using OpenMPI | ||
|
|
||
| !!! warning | ||
| Building and using OpenMPI on Alps is still [work in progress](https://eth-cscs.github.io/cray-network-stack/). | ||
| The instructions found on this page may be inaccurate, but are a good starting point to using OpenMPI on Alps. | ||
|
|
||
| !!! todo | ||
| Deploy experimental uenv. | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do - we don't need to do this before we deploy these docs |
||
|
|
||
| !!! todo | ||
| Building OpenMPI for Alps is still work in progress: https://eth-cscs.github.io/cray-network-stack/. | ||
| Document OpenMPI uenv next to prgenv-gnu, prgenv-nvfortran, and linalg? | ||
|
|
||
| OpenMPI is provided through a [uenv][ref-uenv] similar to [`prgenv-gnu`][ref-uenv-prgenv-gnu]. | ||
| Once the uenv is loaded, compiling and linking with OpenMPI and libfabric is transparent. | ||
| At runtime, some additional options must be set to correctly use the Slingshot network. | ||
|
|
||
| First, when launching applications through slurm, [PMIx](https://pmix.github.com) must be used for application launching. | ||
| This is done with the `--mpi` flag of `srun`: | ||
| ```bash | ||
| srun --mpi=pmix ... | ||
| ``` | ||
|
|
||
| Additionally, the following environment variables should be set: | ||
| ```bash | ||
| export PMIX_MCA_psec="native" # (1) | ||
| export FI_PROVIDER="cxi" # (2) | ||
| export OMPI_MCA_pml="^ucx" # (3) | ||
| export OMPI_MCA_mtl="ofi" # (4) | ||
| ``` | ||
msimberg marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup. | ||
| 2. Use the CXI (Slingshot) provider. | ||
| 3. Use anything except [UCX](https://openucx.org/documentation/) for [point-to-point communication](https://docs.open-mpi.org/en/v5.0.x/mca.html#selecting-which-open-mpi-components-are-used-at-run-time). | ||
RMeli marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 4. Use libfabric for the [Matching Transport Layer](https://docs.open-mpi.org/en/v5.0.x/mca.html#frameworks). | ||
|
|
||
| !!! info "CXI provider does all communication through the network interface cards (NICs)" | ||
| When using the libfabric CXI provider, all communication goes through NICs, including intra-node communication. | ||
| This means that intra-node communication can not make use of shared memory optimizations and the maximum bandwidth will not be severely limited. | ||
|
|
||
| Libfabric has a new [LINKx](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_lnx.7.html) provider, which allows using different libfabric providers for inter- and intra-node communication. | ||
| This provider is not as well tested, but can in theory perform better for intra-node communication, because it can use shared memory. | ||
| To use the LINKx provider, set the following, instead of `FI_PROVIDER=cxi`: | ||
|
|
||
| ```bash | ||
| export FI_PROVIDER="lnx" # (1) | ||
| export FI_LNX_PROV_LINKS="shm+cxi" # (2) | ||
| ``` | ||
msimberg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| 1. Use the libfabric LINKx provider, to allow using different libfabric providers for inter- and intra-node communication. | ||
| 2. Use the shared memory provider for intra-node communication and the CXI (Slingshot) provider for inter-node communication. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.