Skip to content

Commit ee19d95

Browse files
committed
Pass over the tuning section
update MPI I/O, network, section. Create a new directory for accelerator related stuff. Signed-off-by: Edgar Gabriel <[email protected]>
1 parent 7c5a405 commit ee19d95

File tree

11 files changed

+409
-155
lines changed

11 files changed

+409
-155
lines changed
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
Accelerator support
2+
===================
3+
4+
Open MPI supports a variety of different accelerator vendor
5+
eco-systems. This section provides some generic guidance on tuning MPI
6+
applications that use device memory, as well as vendor specific
7+
options.
8+
9+
10+
.. toctree::
11+
:maxdepth: 1
12+
13+
initialize
14+
memkind
15+
cuda
16+
rocm
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
Selecting an Accelerator Device before calling MPI_Init
2+
=======================================================
3+
4+
A common problem when using accelerators arises when selecting which
5+
GPU should be used by an MPI process. The decision is often based by
6+
the rank of that process in ``MPI_COMM_WORLD``. The rank of a process
7+
can however only be retrieved after the MPI library has correctly
8+
initialized. On the other hand, the accelerator resources initialized
9+
during ``MPI_Init`` can have some associations with the `current`
10+
device, which will be the default device used by a particular
11+
eco-system if not set to a different value.
12+
13+
To circumvent this circular problem, applications are encouraged to
14+
make use of the environment variable ``OMPI_COMM_WORLD_LOCAL_RANK``
15+
that is set by Open MPI at launch time and can be retrieved before
16+
``MPI_Init``. An example code sample using the HIP programming model
17+
looks as follows:
18+
19+
.. code-block:: sh
20+
21+
int num_devices;
22+
hipGetDeviceCount(&num_devices);
23+
assert (num_devices > 0);
24+
25+
char* ompi_local_rank = getenv("OMPI_COMM_WORLD_LOCAL_RANK");
26+
if (nullptr != ompi_local_rank) {
27+
hipSetDevice(atoi(ompi_local_rank % num_devices));
28+
}
29+
30+
MPI_Init (&argc, &argv);
31+
...
32+
33+
34+
.. note:: Open MPI currently assumes that an MPI processes is using a
35+
single accelerator device. Certain software stacks might be
36+
able to support multiple GPUs per rank.
37+
38+
39+
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
Support for Memory-kind Info Objects
2+
====================================
3+
4+
`MPI version 4.1. <https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report.pdf>`_ introduced the notion of memory-kinds, which allow an
5+
application to specify what memory types it plans to use, and to query
6+
what memory types are supported by the MPI library in a portable
7+
manner. In addition, the application can place restrictions on certain
8+
objects such as creating a separate communicator for using with
9+
host-memory and a communicator that will be used with device memory
10+
only. This approach allows the MPI library to perform certain
11+
optimizations, such as bypassing checking the memory-type of buffer
12+
pointers. Please refer to the MPI specification as well as the
13+
`Memory Allocation Kinds Side Document <https://www.mpi-forum.org/docs/sidedocs/mem-alloc10.pdf>`_ for more details and examples.
14+
15+
Open MPI starting from version 6.0.0 supports the following values for the memory allocation kind Info object:
16+
17+
* mpi
18+
* system
19+
* cuda:device
20+
* cuda:host
21+
* cuda:managed
22+
* level_zero:host
23+
* level_zero:device
24+
* level_zero:shared
25+
* rocm:device
26+
* rocm:host
27+
* rocm:managed
28+
29+
.. note:: Support for accelerator memory-kind info objects will depend
30+
on the accelerator support compiled into Open MPI.
31+
32+
33+
Passing memory-kind info to mpiexec
34+
===================================
35+
36+
The following example demonstrates how to pass memory-allocation kind
37+
information to Open MPI at application launch:
38+
39+
.. code:: sh
40+
41+
#Specify that the application will use system, mpi, and CUDA device memory
42+
mpiexec --memory-allocation-kinds system,mpi,cuda:device -n 64 ./<my_executable>
43+
44+
Asserting usage of memory kind when creating a Communicator
45+
===========================================================
46+
47+
The following code-snipplet demonstrates how to assert that a
48+
communicator will only be used for ROCm device buffers:
49+
50+
.. code:: sh
51+
52+
MPI_Info info_assert;
53+
MPI_Info_create (&info_assert);
54+
char assert_key[] = "mpi_assert_memory_alloc_kinds";
55+
char assert_value[] = "rocm:device";
56+
MPI_Info_set (info_assert, assert_key, assert_value);
57+
58+
MPI_Comm comm_dup
59+
MPI_Comm_dup_with_info (MPI_COMM_WORLD, info_assert, &comm_dup);
Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
ROCm
2+
====
3+
4+
ROCm is the name of the software stack used by AMD GPUs. It includes
5+
the ROCm Runtime (ROCr), the HIP programming model, and numerous
6+
numerical and machine learning libraries tuned for the AMD Instinct and Radeon
7+
accelerators. More information can be found at the following
8+
`AMD webpages <https://rocm.docs.amd.com/en/latest/>`_
9+
10+
11+
Building Open MPI with ROCm support
12+
-----------------------------------
13+
14+
ROCm-aware support means that the MPI library can send and receive
15+
data from AMD GPU device buffers directly. Starting from Open MPI
16+
v6.0.0 ROCm support is available directly within Open MPI for single
17+
node scenarios, and through UCX or libfabric for multi-node scenarios.
18+
19+
20+
Compiling Open MPI with ROCm support
21+
------------------------------------
22+
23+
Compiling Open MPI with ROCm support requires setting the
24+
``--with-rocm=<rocm-path>`` option at configure time:
25+
26+
.. code-block:: sh
27+
28+
# Configure Open MPI with ROCm support
29+
shell$ cd ompi
30+
shell$ ./configure --with-rocm=/opt/rocm \
31+
<other configure params>
32+
33+
34+
/////////////////////////////////////////////////////////////////////////
35+
36+
Checking that Open MPI has been built with ROCm support
37+
-------------------------------------------------------
38+
39+
Verify that Open MPI has been built with ROCm using the
40+
:ref:`ompi_info(1) <man1-ompi_info>` command:
41+
42+
.. code-block:: sh
43+
44+
# Use ompi_info to verify ROCm support in Open MPI
45+
shell$ ./ompi_info | grep "MPI extensions"
46+
MPI extensions: affinity, cuda, ftmpi, rocm
47+
48+
/////////////////////////////////////////////////////////////////////////
49+
50+
Runtime querying of ROCm support in Open MPI
51+
--------------------------------------------
52+
53+
Querying the availability of ROCm support in Open MPI at runtime is
54+
possible through the memory-kind info object, see ::ref::`memory-kind`
55+
page for details.
56+
57+
In addition, starting with Open MPI v5.0.0 :ref:`MPIX_Query_rocm_support(3)
58+
<mpix_query_rocm_support>` is available as an extension to check
59+
the availability of ROCm support in the library. To use the
60+
function, the code needs to include ``mpi-ext.h``. Note that
61+
``mpi-ext.h`` is an Open MPI specific header file.
62+
63+
64+
.. _sm-rocm-options-label:
65+
66+
/////////////////////////////////////////////////////////////////////////
67+
68+
Running single node jobs with ROCm support
69+
-------------------------------------------------------
70+
71+
The user has multiple options for running an Open MPI job with GPU support
72+
in a single node scenario:
73+
74+
* the default shared memory component ``btl/sm`` has support for
75+
accelerators, will use however by default a bounce buffer on the CPU
76+
for data transfers. Hence, while this works, it will not be able to
77+
take advantage of the high-speed GPU-to-GPU InfinityFabric (TM)
78+
interconnect (if available).
79+
80+
* to use the high-speed GPU-to-GPU interconnect within a node, the user has to
81+
enable the accelerator single-copy component (``smsc/accelerator``), e.g.:
82+
83+
.. code-block:: sh
84+
85+
#Enable the smsc/accelerator component
86+
mpirun --mca smsc_accelerator_priority 80 -n 64 ./<my_executable>
87+
88+
* Alternatively, the user can replace the default shared memory
89+
component ``btl/sm`` with the ``btl/smcuda`` component, which has
90+
been extended to support ROCm devices. While this approach supports
91+
communication over a high-speed GPU-to-GPU interconnect, it does not
92+
support single-copy data transfers for host-memory through
93+
e.g. ``xpmem`` or ``cma``. Hence, the performance of host-memory
94+
based data transfers might be lower than with the default ``btl/sm``
95+
component. Example:
96+
97+
.. code-block:: sh
98+
99+
#Use btl/smcuda instead of btl/sm for communication
100+
mpirun --mca btl smcuda,tcp,self -n 64 ./<my_executable>
101+
102+
/////////////////////////////////////////////////////////////////////////
103+
104+
ROCm support in Open MPI with UCX
105+
---------------------------------
106+
107+
In this configuration, UCX will provide the ROCm support, and hence it
108+
is important to ensure that UCX itself is built with ROCm support. Both,
109+
inter- and intra-node communication will be executed through UCX.
110+
111+
To see if your UCX library was built with ROCm support, run the
112+
following command:
113+
114+
.. code-block:: sh
115+
116+
# Check if ucx was built with ROCm support
117+
shell$ ucx_info -v
118+
119+
# configured with: --with-rocm=/opt/rocm --enable-mt
120+
121+
If you need to build the UCX library yourself to include ROCm support,
122+
please see the UCX documentation for `building UCX with Open MPI:
123+
<https://openucx.readthedocs.io/en/master/running.html#openmpi-with-ucx>`_
124+
125+
It should look something like:
126+
127+
.. code-block:: sh
128+
129+
# Configure UCX with ROCm support
130+
shell$ cd ucx
131+
shell$ ./configure --prefix=/path/to/ucx-rocm-install \
132+
--with-rocm=/opt/rocm
133+
134+
# Configure Open MPI with UCX and ROCm support
135+
shell$ cd ompi
136+
shell$ ./configure --with-rocm=/opt/rocm \
137+
--with-ucx=/path/to/ucx-rocm-install \
138+
<other configure params>
139+
140+
/////////////////////////////////////////////////////////////////////////
141+
142+
Using ROCm-aware UCX with Open MPI
143+
----------------------------------
144+
145+
If UCX and Open MPI have been configured with ROCm support, specifying
146+
the UCX pml component is sufficient to take advantage of the ROCm
147+
support in the libraries. For example, the command to execute the
148+
``osu_latency`` benchmark from the `OSU benchmarks
149+
<https://mvapich.cse.ohio-state.edu/benchmarks>`_ with ROCm buffers
150+
using Open MPI and UCX ROCm support is something like this:
151+
152+
.. code-block:: sh
153+
154+
shell$ mpirun -n 2 --mca pml ucx \
155+
./osu_latency D D
156+
157+
.. note:: some additional configure flags are required to compile the
158+
OSU benchmark to support ROCm buffers. Please refer to the
159+
`UCX ROCm instructions
160+
<https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI>`_
161+
for details.
162+
163+
/////////////////////////////////////////////////////////////////////////
164+
165+
ROCm support in Open MPI with libfabric
166+
---------------------------------------
167+
168+
Some network interconnects are supported through the libfabric library.
169+
Configurating libfabric and Open MPI with ROCm support looks something like:
170+
171+
.. code-block:: sh
172+
173+
# Configure libfabric with ROCm support
174+
shell$ cd libfabric
175+
shell$ ./configure --prefix=/path/to/ofi-rocm-install \
176+
--with-rocr=/opt/rocm
177+
178+
# Configure Open MPI with libfabric and ROCm support
179+
shell$ cd ompi
180+
shell$ ./configure --with-rocm=/opt/rocm \
181+
--with-ofi=/path/to/ofi-rocm-install \
182+
<other configure params>
183+
184+
/////////////////////////////////////////////////////////////////////////
185+
186+
187+
Using ROCm-aware libfabric with Open MPI
188+
----------------------------------------
189+
190+
There are two mechanism for using libfabric and Open MPI with ROCm support.
191+
192+
* Specifying the ``mtl/ofi`` component is sufficient to take advantage
193+
of the ROCm support in the libraries. In this case, both intra- and
194+
inter-node communication will be performed by the libfabric. In
195+
order to ensure that the application will make use of the shared
196+
memory provider for intra-node communication and the network
197+
interconnect specific provider for inter-node communication, the
198+
user might have to request using the ``linkX`` provider, e.g.:
199+
200+
.. code-block:: sh
201+
202+
# Force using the ofi mtl component
203+
mpirun --mca pml cm --mca mtl ofi \
204+
--mca opal_common_ofi_provider_include "shm+cxi:lnx" \
205+
-n 64 ./<my_executable>
206+
207+
* Alternatively, the user can use the ``btl/ofi`` component, in which
208+
case the intra-node communication will use the Open MPI shared
209+
memory mechanisms (see <_sm-rocm-options-label>), and use
210+
libfabric only for inter-node scenarios.
211+
212+
.. code-block:: sh
213+
214+
# Force using the ofi mtl component
215+
mpirun --mca pml ob1 --mca btl ofi,sm,tcp,self \
216+
--mca smsc_accelerator_priority 80 \
217+
-n 64 ./<my_executable>
218+
219+
220+
/////////////////////////////////////////////////////////////////////////
221+
222+
Collective component supporting ROCm device memory
223+
--------------------------------------------------
224+
225+
226+
The ``coll/accelerator`` component supports collective operations on
227+
ROCm device buffers for many commonly used collective
228+
operations. The component works by copying data into a temporary host
229+
buffer, executing the collective operation on the host buffer, and
230+
copying the data back to the device buffer at completion. This
231+
component will lead to adequate performance for short to medium data
232+
sizes, but performance is often suboptimal especially for large reduction
233+
operations.
234+
235+
The `UCC <https://github.com/openucx/ucc>`_ based collective component
236+
in Open MPI can be configured and compiled to include ROCm support,
237+
and will typically lead to significantly better performance for large
238+
reductions.
239+
240+
An example for configure UCC and Open MPI with ROCm is shown below:
241+
242+
.. code-block:: sh
243+
244+
# Configure and compile UCC with ROCm support
245+
shell$ cd ucc
246+
shell$ ./configure --with-rocm=/opt/rocm \
247+
--with-ucx=/path/to/ucx-rocm-install \
248+
--prefix=/path/to/ucc-rocm-install
249+
shell$ make -j && make install
250+
251+
# Configure and compile Open MPI with UCX, UCC, and ROCm support
252+
shell$ cd ompi
253+
shell$ ./configure --with-rocm=/opt/rocm \
254+
--with-ucx=/path/to/ucx-rocm-install \
255+
--with-ucc=/path/to/ucc-rocm-install
256+
257+
To use the UCC component in an applicatin requires setting some
258+
additional parameters:
259+
260+
.. code-block::
261+
262+
shell$ mpirun --mca pml ucx --mca osc ucx \
263+
--mca coll_ucc_enable 1 \
264+
--mca coll_ucc_priority 100 -np 64 ./my_mpi_app
265+
266+
.. note:: Using the UCC library for collective operations in Open MPI
267+
requires using the UCX library, and hence cannot be deployed
268+
e.g. when using libfabric.

docs/tuning-apps/coll-tuned.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ Tuning Collectives
33

44
Open MPI's ``coll`` framework provides a number of components implementing
55
collective communication, including: ``han``, ``libnbc``, ``self``, ``ucc`` ``base``,
6-
``sync``, ``xhc``, ``accelerator``, ``basic``, ``ftagree``, ``inter``, ``portals4``,
6+
``sync``, ``xhc``, ``accelerator``, ``basic``, ``ftagree``, ``inter``, ``portals4``, ``acoll``,
77
and ``tuned``. Some of these components may not be available depending on how
88
Open MPI was compiled and what hardware is available on the system. A run-time
99
decision based on each component's self reported priority, selects which

0 commit comments

Comments
 (0)