@@ -669,17 +669,23 @@ Tuning Guide*, which can be found in the `Cornelis Networks Customer Center
669669When do I need to select a CUDA device?
670670---------------------------------------
671671
672- "mpi-cuda-dev-selection"
673-
674- OpenMPI requires CUDA resources allocated for internal use. These
675- are allocated lazily when they are first needed, e.g. CUDA IPC mem handles
676- are created when a communication routine first requires them during a
677- transfer. So, the CUDA device needs to be selected before the first MPI
678- call requiring a CUDA resource. MPI_Init and most communicator related
679- operations do not create any CUDA resources (guaranteed for MPI_Init,
680- MPI_Comm_rank, MPI_Comm_size, MPI_Comm_split_type and MPI_Comm_free). It
681- is thus possible to use those routines to query rank information and use
682- those to select a GPU, e.g. using
672+ Open MPI requires CUDA resources allocated for internal use. When possible,
673+ these resources are allocated lazily when they are first needed, e.g. CUDA
674+ IPC mem handles are created when a communication routine first requires them
675+ during a transfer. MPI_Init and most communicator related operations do not
676+ create any CUDA resources (guaranteed at least for MPI_Comm_rank,
677+ MPI_Comm_size on ``MPI_COMM_WORLD ``).
678+
679+ However, this is not always the case. In certain instances, such as when
680+ using PSM2 or the ``smcuda `` BTL (with the OB1 PML), it is not feasible to
681+ delay the CUDA resources allocation. Consequently, these resources will need
682+ to be allocated during ``MPI_Init() ``.
683+
684+ Regardless of the situation, the CUDA device must be selected before the first
685+ MPI call that requires a CUDA resource. When CUDA resources can be initialized
686+ lazily, it is possible to use the aforementioned communicator-related operations
687+ to query rank information and utilize that to select a GPU.
688+
683689
684690.. code-block :: c
685691
0 commit comments