|
| 1 | +ROCm |
| 2 | +==== |
| 3 | + |
| 4 | +ROCm is the name of the software stack used by AMD GPUs. It includes |
| 5 | +the ROCm Runtime (ROCr), the HIP programming model, and numerous |
| 6 | +numerical and machine learning libraries tuned for the AMD Instinct and Radeon |
| 7 | +accelerators. More information can be found at the following |
| 8 | +`AMD webpages <https://rocm.docs.amd.com/en/latest/>`_ |
| 9 | + |
| 10 | + |
| 11 | +Building Open MPI with ROCm support |
| 12 | +----------------------------------- |
| 13 | + |
| 14 | +ROCm-aware support means that the MPI library can send and receive |
| 15 | +data from AMD GPU device buffers directly. Starting from Open MPI |
| 16 | +v6.0.0 ROCm support is available directly within Open MPI for single |
| 17 | +node scenarios, and through UCX or libfabric for multi-node scenarios. |
| 18 | + |
| 19 | + |
| 20 | +Compiling Open MPI with ROCm support |
| 21 | +------------------------------------ |
| 22 | + |
| 23 | +Compiling Open MPI with ROCm support requires setting the |
| 24 | +``--with-rocm=<rocm-path>`` option at configure time: |
| 25 | + |
| 26 | +.. code-block:: sh |
| 27 | +
|
| 28 | + # Configure Open MPI with ROCm support |
| 29 | + shell$ cd ompi |
| 30 | + shell$ ./configure --with-rocm=/opt/rocm \ |
| 31 | + <other configure params> |
| 32 | +
|
| 33 | +
|
| 34 | +///////////////////////////////////////////////////////////////////////// |
| 35 | + |
| 36 | +Checking that Open MPI has been built with ROCm support |
| 37 | +------------------------------------------------------- |
| 38 | + |
| 39 | +Verify that Open MPI has been built with ROCm using the |
| 40 | +:ref:`ompi_info(1) <man1-ompi_info>` command: |
| 41 | + |
| 42 | +.. code-block:: sh |
| 43 | +
|
| 44 | + # Use ompi_info to verify ROCm support in Open MPI |
| 45 | + shell$ ./ompi_info | grep "MPI extensions" |
| 46 | + MPI extensions: affinity, cuda, ftmpi, rocm |
| 47 | +
|
| 48 | +///////////////////////////////////////////////////////////////////////// |
| 49 | + |
| 50 | +Runtime querying of ROCm support in Open MPI |
| 51 | +-------------------------------------------- |
| 52 | + |
| 53 | +Querying the availability of ROCm support in Open MPI at runtime is |
| 54 | +possible through the memory allocation kind info object, see ::ref::`memkind` |
| 55 | +page for details. |
| 56 | + |
| 57 | +In addition, starting with Open MPI v5.0.0 :ref:`MPIX_Query_rocm_support(3) |
| 58 | +<mpix_query_rocm_support>` is available as an extension to check |
| 59 | +the availability of ROCm support in the library. To use the |
| 60 | +function, the code needs to include ``mpi-ext.h``. Note that |
| 61 | +``mpi-ext.h`` is an Open MPI specific header file. |
| 62 | + |
| 63 | + |
| 64 | +.. _sm-rocm-options-label: |
| 65 | + |
| 66 | +///////////////////////////////////////////////////////////////////////// |
| 67 | + |
| 68 | +Running single node jobs with ROCm support |
| 69 | +------------------------------------------ |
| 70 | + |
| 71 | +The user has multiple options for running an Open MPI job with GPU support |
| 72 | +in a single node scenario: |
| 73 | + |
| 74 | +* the default shared memory component ``btl/sm`` has support for |
| 75 | + accelerators, will use however by default a bounce buffer on the CPU |
| 76 | + for data transfers. Hence, while this works, it will not be able to |
| 77 | + take advantage of the high-speed GPU-to-GPU InfinityFabric |
| 78 | + interconnect (if available). |
| 79 | + |
| 80 | +* to use the high-speed GPU-to-GPU interconnect within a node, the user has to |
| 81 | + enable the accelerator single-copy component (``smsc/accelerator``), e.g.: |
| 82 | + |
| 83 | +.. code-block:: sh |
| 84 | +
|
| 85 | + # Enable the smsc/accelerator component |
| 86 | + shell$ mpirun --mca smsc_accelerator_priority 80 -n 64 ./<my_executable> |
| 87 | +
|
| 88 | +* Alternatively, the user can replace the default shared memory |
| 89 | + component ``btl/sm`` with the ``btl/smcuda`` component, which has |
| 90 | + been extended to support ROCm devices. While this approach supports |
| 91 | + communication over a high-speed GPU-to-GPU interconnect, it does not |
| 92 | + support single-copy data transfers for host-memory through |
| 93 | + e.g. ``xpmem`` or ``cma``. Hence, the performance of host-memory |
| 94 | + based data transfers might be lower than with the default ``btl/sm`` |
| 95 | + component. Example: |
| 96 | + |
| 97 | +.. code-block:: sh |
| 98 | +
|
| 99 | + # Use btl/smcuda instead of btl/sm for communication |
| 100 | + shell$ mpirun --mca btl smcuda,tcp,self -n 64 ./<my_executable> |
| 101 | +
|
| 102 | +///////////////////////////////////////////////////////////////////////// |
| 103 | + |
| 104 | +ROCm support in Open MPI with UCX |
| 105 | +--------------------------------- |
| 106 | + |
| 107 | +In this configuration, UCX will provide the ROCm support, and hence it |
| 108 | +is important to ensure that UCX itself is built with ROCm support. Both, |
| 109 | +inter- and intra-node communication will be executed through UCX. |
| 110 | + |
| 111 | +To see if your UCX library was built with ROCm support, run the |
| 112 | +following command: |
| 113 | + |
| 114 | +.. code-block:: sh |
| 115 | +
|
| 116 | + # Check if ucx was built with ROCm support |
| 117 | + shell$ ucx_info -v |
| 118 | +
|
| 119 | + # configured with: --with-rocm=/opt/rocm --enable-mt |
| 120 | +
|
| 121 | +If you need to build the UCX library yourself to include ROCm support, |
| 122 | +please see the UCX documentation for `building UCX with Open MPI: |
| 123 | +<https://openucx.readthedocs.io/en/master/running.html#openmpi-with-ucx>`_ |
| 124 | + |
| 125 | +It should look something like: |
| 126 | + |
| 127 | +.. code-block:: sh |
| 128 | +
|
| 129 | + # Configure UCX with ROCm support |
| 130 | + shell$ cd ucx |
| 131 | + shell$ ./configure --prefix=/path/to/ucx-rocm-install \ |
| 132 | + --with-rocm=/opt/rocm |
| 133 | +
|
| 134 | + # Configure Open MPI with UCX and ROCm support |
| 135 | + shell$ cd ompi |
| 136 | + shell$ ./configure --with-rocm=/opt/rocm \ |
| 137 | + --with-ucx=/path/to/ucx-rocm-install \ |
| 138 | + <other configure params> |
| 139 | +
|
| 140 | +///////////////////////////////////////////////////////////////////////// |
| 141 | + |
| 142 | +Using ROCm-aware UCX with Open MPI |
| 143 | +---------------------------------- |
| 144 | + |
| 145 | +If UCX and Open MPI have been configured with ROCm support, specifying |
| 146 | +the UCX pml component is sufficient to take advantage of the ROCm |
| 147 | +support in the libraries. For example, the command to execute the |
| 148 | +``osu_latency`` benchmark from the `OSU benchmarks |
| 149 | +<https://mvapich.cse.ohio-state.edu/benchmarks>`_ with ROCm buffers |
| 150 | +using Open MPI and UCX ROCm support is something like this: |
| 151 | + |
| 152 | +.. code-block:: sh |
| 153 | +
|
| 154 | + shell$ mpirun -n 2 --mca pml ucx \ |
| 155 | + ./osu_latency D D |
| 156 | +
|
| 157 | +.. note:: some additional configure flags are required to compile the |
| 158 | + OSU benchmark to support ROCm buffers. Please refer to the |
| 159 | + `UCX ROCm instructions |
| 160 | + <https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI>`_ |
| 161 | + for details. |
| 162 | + |
| 163 | +///////////////////////////////////////////////////////////////////////// |
| 164 | + |
| 165 | +ROCm support in Open MPI with libfabric |
| 166 | +--------------------------------------- |
| 167 | + |
| 168 | +Some network interconnects are supported through the libfabric library. |
| 169 | +Configuring libfabric and Open MPI with ROCm support looks something like: |
| 170 | + |
| 171 | +.. code-block:: sh |
| 172 | +
|
| 173 | + # Configure libfabric with ROCm support |
| 174 | + shell$ cd libfabric |
| 175 | + shell$ ./configure --prefix=/path/to/ofi-rocm-install \ |
| 176 | + --with-rocr=/opt/rocm |
| 177 | +
|
| 178 | + # Configure Open MPI with libfabric and ROCm support |
| 179 | + shell$ cd ompi |
| 180 | + shell$ ./configure --with-rocm=/opt/rocm \ |
| 181 | + --with-ofi=/path/to/ofi-rocm-install \ |
| 182 | + <other configure params> |
| 183 | +
|
| 184 | +///////////////////////////////////////////////////////////////////////// |
| 185 | + |
| 186 | + |
| 187 | +Using ROCm-aware libfabric with Open MPI |
| 188 | +---------------------------------------- |
| 189 | + |
| 190 | +There are two mechanism for using libfabric and Open MPI with ROCm support. |
| 191 | + |
| 192 | +* Specifying the ``mtl/ofi`` component is sufficient to take advantage |
| 193 | + of the ROCm support in the libraries. In this case, both intra- and |
| 194 | + inter-node communication will be performed by the libfabric library. In |
| 195 | + order to ensure that the application will make use of the shared |
| 196 | + memory provider for intra-node communication and the network |
| 197 | + interconnect specific provider for inter-node communication, the |
| 198 | + user might have to request using the ``linkX`` provider, e.g.: |
| 199 | + |
| 200 | +.. code-block:: sh |
| 201 | +
|
| 202 | + # Force using the ofi mtl component |
| 203 | + mpirun --mca pml cm --mca mtl ofi \ |
| 204 | + --mca opal_common_ofi_provider_include "shm+cxi:lnx" \ |
| 205 | + -n 64 ./<my_executable> |
| 206 | +
|
| 207 | +* Alternatively, the user can use the ``btl/ofi`` component, in which |
| 208 | + case the intra-node communication will use the Open MPI shared |
| 209 | + memory mechanisms (see <_sm-rocm-options-label>), and use |
| 210 | + libfabric only for inter-node scenarios. |
| 211 | + |
| 212 | +.. code-block:: sh |
| 213 | +
|
| 214 | + # Use the ofi btl for inter-node and sm btl |
| 215 | + # for intra-node communication |
| 216 | + mpirun --mca pml ob1 --mca btl ofi,sm,tcp,self \ |
| 217 | + --mca smsc_accelerator_priority 80 \ |
| 218 | + -n 64 ./<my_executable> |
| 219 | + |
| 220 | + |
| 221 | +///////////////////////////////////////////////////////////////////////// |
| 222 | + |
| 223 | +Collective component supporting ROCm device memory |
| 224 | +-------------------------------------------------- |
| 225 | + |
| 226 | + |
| 227 | +The ``coll/accelerator`` component supports collective operations on |
| 228 | +ROCm device buffers for many commonly used collective |
| 229 | +operations. The component works by copying data into a temporary host |
| 230 | +buffer, executing the collective operation on the host buffer, and |
| 231 | +copying the result back to the device buffer at completion. This |
| 232 | +component will lead to adequate performance for short to medium data |
| 233 | +sizes, but performance is often suboptimal especially for large reduction |
| 234 | +operations. |
| 235 | + |
| 236 | +The `UCC <https://github.com/openucx/ucc>`_ based collective component |
| 237 | +in Open MPI can be configured and compiled to include ROCm support, |
| 238 | +and will typically lead to significantly better performance for large |
| 239 | +reductions. |
| 240 | + |
| 241 | +An example for configure UCC and Open MPI with ROCm is shown below: |
| 242 | + |
| 243 | +.. code-block:: sh |
| 244 | +
|
| 245 | + # Configure and compile UCC with ROCm support |
| 246 | + shell$ cd ucc |
| 247 | + shell$ ./configure --with-rocm=/opt/rocm \ |
| 248 | + --with-ucx=/path/to/ucx-rocm-install \ |
| 249 | + --prefix=/path/to/ucc-rocm-install |
| 250 | + shell$ make -j && make install |
| 251 | +
|
| 252 | + # Configure and compile Open MPI with UCX, UCC, and ROCm support |
| 253 | + shell$ cd ompi |
| 254 | + shell$ ./configure --with-rocm=/opt/rocm \ |
| 255 | + --with-ucx=/path/to/ucx-rocm-install \ |
| 256 | + --with-ucc=/path/to/ucc-rocm-install |
| 257 | + |
| 258 | +To use the UCC component in an applicatin requires setting some |
| 259 | +additional parameters: |
| 260 | + |
| 261 | +.. code-block:: |
| 262 | +
|
| 263 | + shell$ mpirun --mca pml ucx --mca osc ucx \ |
| 264 | + --mca coll_ucc_enable 1 \ |
| 265 | + --mca coll_ucc_priority 100 -np 64 ./my_mpi_app |
| 266 | +
|
| 267 | +.. note:: Using the UCC library for collective operations in Open MPI |
| 268 | + requires using the UCX library, and hence cannot be deployed |
| 269 | + e.g. when using libfabric. |
0 commit comments