Skip to content

Be more clear about CUDA vs. MPI_Init order. #13355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions docs/tuning-apps/networking/cuda.rst
Original file line number Diff line number Diff line change
Expand Up @@ -669,17 +669,23 @@ Tuning Guide*, which can be found in the `Cornelis Networks Customer Center
When do I need to select a CUDA device?
---------------------------------------

"mpi-cuda-dev-selection"

OpenMPI requires CUDA resources allocated for internal use. These
are allocated lazily when they are first needed, e.g. CUDA IPC mem handles
are created when a communication routine first requires them during a
transfer. So, the CUDA device needs to be selected before the first MPI
call requiring a CUDA resource. MPI_Init and most communicator related
operations do not create any CUDA resources (guaranteed for MPI_Init,
MPI_Comm_rank, MPI_Comm_size, MPI_Comm_split_type and MPI_Comm_free). It
is thus possible to use those routines to query rank information and use
those to select a GPU, e.g. using
Open MPI requires CUDA resources allocated for internal use. When possible,
these resources are allocated lazily when they are first needed, e.g. CUDA
IPC mem handles are created when a communication routine first requires them
during a transfer. MPI_Init and most communicator related operations do not
create any CUDA resources (guaranteed at least for MPI_Comm_rank,
MPI_Comm_size on ``MPI_COMM_WORLD``).

However, this is not always the case. In certain instances, such as when
using PSM2 or the ``smcuda`` BTL (with the OB1 PML), it is not feasible to
delay the CUDA resources allocation. Consequently, these resources will need
to be allocated during ``MPI_Init()``.

Regardless of the situation, the CUDA device must be selected before the first
MPI call that requires a CUDA resource. When CUDA resources can be initialized
lazily, it is possible to use the aforementioned communicator-related operations
to query rank information and utilize that to select a GPU.


.. code-block:: c

Expand Down
Loading