Skip to content

Commit dd6a7a3

Browse files
authored
Merge pull request #13562 from edgargabriel/topic/docs-update-for-6.0
Pass over the documentation tuning section
2 parents 7c5a405 + 50c2cb9 commit dd6a7a3

File tree

11 files changed

+415
-155
lines changed

11 files changed

+415
-155
lines changed
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
Accelerator support
2+
===================
3+
4+
Open MPI supports a variety of different accelerator vendor
5+
eco-systems. This section provides some generic guidance on tuning MPI
6+
applications that use device memory, as well as vendor specific
7+
options.
8+
9+
10+
.. toctree::
11+
:maxdepth: 1
12+
13+
initialize
14+
memkind
15+
cuda
16+
rocm
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
Selecting an Accelerator Device before calling MPI_Init
2+
=======================================================
3+
4+
A common problem when using accelerators arises when selecting which
5+
GPU should be used by an MPI process. The decision is often based by
6+
the rank of that process in ``MPI_COMM_WORLD``. The rank of a process
7+
can however only be retrieved after the MPI library has correctly
8+
initialized. On the other hand, the accelerator resources initialized
9+
during ``MPI_Init`` can have some associations with the `current`
10+
device, which will be the default device used by a particular
11+
eco-system if not set to a different value.
12+
13+
To circumvent this circular problem, applications are encouraged to
14+
make use of the environment variable ``OMPI_COMM_WORLD_LOCAL_RANK``
15+
that is set by Open MPI at launch time and can be retrieved before
16+
``MPI_Init``. An example code sample using the HIP programming model
17+
looks as follows:
18+
19+
.. code-block:: c
20+
21+
int num_devices;
22+
hipGetDeviceCount(&num_devices);
23+
assert (num_devices > 0);
24+
25+
char* ompi_local_rank = getenv("OMPI_COMM_WORLD_LOCAL_RANK");
26+
if (nullptr != ompi_local_rank) {
27+
hipSetDevice(atoi(ompi_local_rank) % num_devices);
28+
}
29+
30+
MPI_Init (&argc, &argv);
31+
...
32+
33+
34+
.. note:: Open MPI currently assumes that an MPI processes is using a
35+
single accelerator device. Certain software stacks might be
36+
able to support multiple GPUs per rank.
37+
38+
39+
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
Support for Memory-kind Info Objects
2+
====================================
3+
4+
`MPI version 4.1. <https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report.pdf>`_
5+
introduced the notion of memory allocation kinds, which allow an
6+
application to specify what memory types it plans to use, and to query
7+
what memory types are supported by the MPI library in a portable
8+
manner. In addition, the application can place restrictions on certain
9+
objects such as creating a separate communicator for using with
10+
host-memory and a communicator that will be used with device memory
11+
only. This approach allows the MPI library to perform certain
12+
optimizations, such as bypassing checking the memory-type of buffer
13+
pointers. Please refer to the MPI specification as well as the `Memory
14+
Allocation Kinds Side Document
15+
<https://www.mpi-forum.org/docs/sidedocs/mem-alloc10.pdf>`_ for more
16+
details and examples.
17+
18+
Open MPI starting from version 6.0.0 supports the following values for the memory allocation kind Info object:
19+
20+
* mpi
21+
* system
22+
* cuda:device
23+
* cuda:host
24+
* cuda:managed
25+
* level_zero:device
26+
* level_zero:host
27+
* level_zero:shared
28+
* rocm:device
29+
* rocm:host
30+
* rocm:managed
31+
32+
.. note:: Support for accelerator memory allocation kind info objects
33+
will depend on the accelerator support compiled into Open
34+
MPI.
35+
36+
37+
Passing memory-kind info to mpiexec
38+
===================================
39+
40+
The following example demonstrates how to pass memory allocation kind
41+
information to Open MPI at application launch:
42+
43+
.. code:: sh
44+
45+
# Specify that the application will use system, MPI, and CUDA device memory
46+
shell$ mpiexec --memory-allocation-kinds system,mpi,cuda:device -n 64 ./<my_executable>
47+
48+
Asserting usage of memory kind when creating a Communicator
49+
===========================================================
50+
51+
The following code-snipplet demonstrates how to assert that a
52+
communicator will only be used for ROCm device buffers:
53+
54+
.. code:: c
55+
56+
MPI_Info info_assert;
57+
MPI_Info_create (&info_assert);
58+
char assert_key[] = "mpi_assert_memory_alloc_kinds";
59+
char assert_value[] = "rocm:device";
60+
MPI_Info_set (info_assert, assert_key, assert_value);
61+
62+
MPI_Comm comm_dup
63+
MPI_Comm_dup_with_info (MPI_COMM_WORLD, info_assert, &comm_dup);
64+
...
Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
ROCm
2+
====
3+
4+
ROCm is the name of the software stack used by AMD GPUs. It includes
5+
the ROCm Runtime (ROCr), the HIP programming model, and numerous
6+
numerical and machine learning libraries tuned for the AMD Instinct and Radeon
7+
accelerators. More information can be found at the following
8+
`AMD webpages <https://rocm.docs.amd.com/en/latest/>`_
9+
10+
11+
Building Open MPI with ROCm support
12+
-----------------------------------
13+
14+
ROCm-aware support means that the MPI library can send and receive
15+
data from AMD GPU device buffers directly. Starting from Open MPI
16+
v6.0.0 ROCm support is available directly within Open MPI for single
17+
node scenarios, and through UCX or libfabric for multi-node scenarios.
18+
19+
20+
Compiling Open MPI with ROCm support
21+
------------------------------------
22+
23+
Compiling Open MPI with ROCm support requires setting the
24+
``--with-rocm=<rocm-path>`` option at configure time:
25+
26+
.. code-block:: sh
27+
28+
# Configure Open MPI with ROCm support
29+
shell$ cd ompi
30+
shell$ ./configure --with-rocm=/opt/rocm \
31+
<other configure params>
32+
33+
34+
/////////////////////////////////////////////////////////////////////////
35+
36+
Checking that Open MPI has been built with ROCm support
37+
-------------------------------------------------------
38+
39+
Verify that Open MPI has been built with ROCm using the
40+
:ref:`ompi_info(1) <man1-ompi_info>` command:
41+
42+
.. code-block:: sh
43+
44+
# Use ompi_info to verify ROCm support in Open MPI
45+
shell$ ./ompi_info | grep "MPI extensions"
46+
MPI extensions: affinity, cuda, ftmpi, rocm
47+
48+
/////////////////////////////////////////////////////////////////////////
49+
50+
Runtime querying of ROCm support in Open MPI
51+
--------------------------------------------
52+
53+
Querying the availability of ROCm support in Open MPI at runtime is
54+
possible through the memory allocation kind info object, see ::ref::`memkind`
55+
page for details.
56+
57+
In addition, starting with Open MPI v5.0.0 :ref:`MPIX_Query_rocm_support(3)
58+
<mpix_query_rocm_support>` is available as an extension to check
59+
the availability of ROCm support in the library. To use the
60+
function, the code needs to include ``mpi-ext.h``. Note that
61+
``mpi-ext.h`` is an Open MPI specific header file.
62+
63+
64+
.. _sm-rocm-options-label:
65+
66+
/////////////////////////////////////////////////////////////////////////
67+
68+
Running single node jobs with ROCm support
69+
------------------------------------------
70+
71+
The user has multiple options for running an Open MPI job with GPU support
72+
in a single node scenario:
73+
74+
* the default shared memory component ``btl/sm`` has support for
75+
accelerators, will use however by default a bounce buffer on the CPU
76+
for data transfers. Hence, while this works, it will not be able to
77+
take advantage of the high-speed GPU-to-GPU InfinityFabric
78+
interconnect (if available).
79+
80+
* to use the high-speed GPU-to-GPU interconnect within a node, the user has to
81+
enable the accelerator single-copy component (``smsc/accelerator``), e.g.:
82+
83+
.. code-block:: sh
84+
85+
# Enable the smsc/accelerator component
86+
shell$ mpirun --mca smsc_accelerator_priority 80 -n 64 ./<my_executable>
87+
88+
* Alternatively, the user can replace the default shared memory
89+
component ``btl/sm`` with the ``btl/smcuda`` component, which has
90+
been extended to support ROCm devices. While this approach supports
91+
communication over a high-speed GPU-to-GPU interconnect, it does not
92+
support single-copy data transfers for host-memory through
93+
e.g. ``xpmem`` or ``cma``. Hence, the performance of host-memory
94+
based data transfers might be lower than with the default ``btl/sm``
95+
component. Example:
96+
97+
.. code-block:: sh
98+
99+
# Use btl/smcuda instead of btl/sm for communication
100+
shell$ mpirun --mca btl smcuda,tcp,self -n 64 ./<my_executable>
101+
102+
/////////////////////////////////////////////////////////////////////////
103+
104+
ROCm support in Open MPI with UCX
105+
---------------------------------
106+
107+
In this configuration, UCX will provide the ROCm support, and hence it
108+
is important to ensure that UCX itself is built with ROCm support. Both,
109+
inter- and intra-node communication will be executed through UCX.
110+
111+
To see if your UCX library was built with ROCm support, run the
112+
following command:
113+
114+
.. code-block:: sh
115+
116+
# Check if ucx was built with ROCm support
117+
shell$ ucx_info -v
118+
119+
# configured with: --with-rocm=/opt/rocm --enable-mt
120+
121+
If you need to build the UCX library yourself to include ROCm support,
122+
please see the UCX documentation for `building UCX with Open MPI:
123+
<https://openucx.readthedocs.io/en/master/running.html#openmpi-with-ucx>`_
124+
125+
It should look something like:
126+
127+
.. code-block:: sh
128+
129+
# Configure UCX with ROCm support
130+
shell$ cd ucx
131+
shell$ ./configure --prefix=/path/to/ucx-rocm-install \
132+
--with-rocm=/opt/rocm
133+
134+
# Configure Open MPI with UCX and ROCm support
135+
shell$ cd ompi
136+
shell$ ./configure --with-rocm=/opt/rocm \
137+
--with-ucx=/path/to/ucx-rocm-install \
138+
<other configure params>
139+
140+
/////////////////////////////////////////////////////////////////////////
141+
142+
Using ROCm-aware UCX with Open MPI
143+
----------------------------------
144+
145+
If UCX and Open MPI have been configured with ROCm support, specifying
146+
the UCX pml component is sufficient to take advantage of the ROCm
147+
support in the libraries. For example, the command to execute the
148+
``osu_latency`` benchmark from the `OSU benchmarks
149+
<https://mvapich.cse.ohio-state.edu/benchmarks>`_ with ROCm buffers
150+
using Open MPI and UCX ROCm support is something like this:
151+
152+
.. code-block:: sh
153+
154+
shell$ mpirun -n 2 --mca pml ucx \
155+
./osu_latency D D
156+
157+
.. note:: some additional configure flags are required to compile the
158+
OSU benchmark to support ROCm buffers. Please refer to the
159+
`UCX ROCm instructions
160+
<https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI>`_
161+
for details.
162+
163+
/////////////////////////////////////////////////////////////////////////
164+
165+
ROCm support in Open MPI with libfabric
166+
---------------------------------------
167+
168+
Some network interconnects are supported through the libfabric library.
169+
Configuring libfabric and Open MPI with ROCm support looks something like:
170+
171+
.. code-block:: sh
172+
173+
# Configure libfabric with ROCm support
174+
shell$ cd libfabric
175+
shell$ ./configure --prefix=/path/to/ofi-rocm-install \
176+
--with-rocr=/opt/rocm
177+
178+
# Configure Open MPI with libfabric and ROCm support
179+
shell$ cd ompi
180+
shell$ ./configure --with-rocm=/opt/rocm \
181+
--with-ofi=/path/to/ofi-rocm-install \
182+
<other configure params>
183+
184+
/////////////////////////////////////////////////////////////////////////
185+
186+
187+
Using ROCm-aware libfabric with Open MPI
188+
----------------------------------------
189+
190+
There are two mechanism for using libfabric and Open MPI with ROCm support.
191+
192+
* Specifying the ``mtl/ofi`` component is sufficient to take advantage
193+
of the ROCm support in the libraries. In this case, both intra- and
194+
inter-node communication will be performed by the libfabric library. In
195+
order to ensure that the application will make use of the shared
196+
memory provider for intra-node communication and the network
197+
interconnect specific provider for inter-node communication, the
198+
user might have to request using the ``linkX`` provider, e.g.:
199+
200+
.. code-block:: sh
201+
202+
# Force using the ofi mtl component
203+
mpirun --mca pml cm --mca mtl ofi \
204+
--mca opal_common_ofi_provider_include "shm+cxi:lnx" \
205+
-n 64 ./<my_executable>
206+
207+
* Alternatively, the user can use the ``btl/ofi`` component, in which
208+
case the intra-node communication will use the Open MPI shared
209+
memory mechanisms (see <_sm-rocm-options-label>), and use
210+
libfabric only for inter-node scenarios.
211+
212+
.. code-block:: sh
213+
214+
# Use the ofi btl for inter-node and sm btl
215+
# for intra-node communication
216+
mpirun --mca pml ob1 --mca btl ofi,sm,tcp,self \
217+
--mca smsc_accelerator_priority 80 \
218+
-n 64 ./<my_executable>
219+
220+
221+
/////////////////////////////////////////////////////////////////////////
222+
223+
Collective component supporting ROCm device memory
224+
--------------------------------------------------
225+
226+
227+
The ``coll/accelerator`` component supports collective operations on
228+
ROCm device buffers for many commonly used collective
229+
operations. The component works by copying data into a temporary host
230+
buffer, executing the collective operation on the host buffer, and
231+
copying the result back to the device buffer at completion. This
232+
component will lead to adequate performance for short to medium data
233+
sizes, but performance is often suboptimal especially for large reduction
234+
operations.
235+
236+
The `UCC <https://github.com/openucx/ucc>`_ based collective component
237+
in Open MPI can be configured and compiled to include ROCm support,
238+
and will typically lead to significantly better performance for large
239+
reductions.
240+
241+
An example for configure UCC and Open MPI with ROCm is shown below:
242+
243+
.. code-block:: sh
244+
245+
# Configure and compile UCC with ROCm support
246+
shell$ cd ucc
247+
shell$ ./configure --with-rocm=/opt/rocm \
248+
--with-ucx=/path/to/ucx-rocm-install \
249+
--prefix=/path/to/ucc-rocm-install
250+
shell$ make -j && make install
251+
252+
# Configure and compile Open MPI with UCX, UCC, and ROCm support
253+
shell$ cd ompi
254+
shell$ ./configure --with-rocm=/opt/rocm \
255+
--with-ucx=/path/to/ucx-rocm-install \
256+
--with-ucc=/path/to/ucc-rocm-install
257+
258+
To use the UCC component in an applicatin requires setting some
259+
additional parameters:
260+
261+
.. code-block::
262+
263+
shell$ mpirun --mca pml ucx --mca osc ucx \
264+
--mca coll_ucc_enable 1 \
265+
--mca coll_ucc_priority 100 -np 64 ./my_mpi_app
266+
267+
.. note:: Using the UCC library for collective operations in Open MPI
268+
requires using the UCX library, and hence cannot be deployed
269+
e.g. when using libfabric.

0 commit comments

Comments
 (0)