diff --git a/docs/tuning-apps/networking/cuda.rst b/docs/tuning-apps/accelerators/cuda.rst similarity index 100% rename from docs/tuning-apps/networking/cuda.rst rename to docs/tuning-apps/accelerators/cuda.rst diff --git a/docs/tuning-apps/accelerators/index.rst b/docs/tuning-apps/accelerators/index.rst new file mode 100644 index 00000000000..c6b0ecfcd70 --- /dev/null +++ b/docs/tuning-apps/accelerators/index.rst @@ -0,0 +1,16 @@ +Accelerator support +=================== + +Open MPI supports a variety of different accelerator vendor +eco-systems. This section provides some generic guidance on tuning MPI +applications that use device memory, as well as vendor specific +options. + + +.. toctree:: + :maxdepth: 1 + + initialize + memkind + cuda + rocm diff --git a/docs/tuning-apps/accelerators/initialize.rst b/docs/tuning-apps/accelerators/initialize.rst new file mode 100644 index 00000000000..0bd147f4efb --- /dev/null +++ b/docs/tuning-apps/accelerators/initialize.rst @@ -0,0 +1,39 @@ +Selecting an Accelerator Device before calling MPI_Init +======================================================= + +A common problem when using accelerators arises when selecting which +GPU should be used by an MPI process. The decision is often based by +the rank of that process in ``MPI_COMM_WORLD``. The rank of a process +can however only be retrieved after the MPI library has correctly +initialized. On the other hand, the accelerator resources initialized +during ``MPI_Init`` can have some associations with the `current` +device, which will be the default device used by a particular +eco-system if not set to a different value. + +To circumvent this circular problem, applications are encouraged to +make use of the environment variable ``OMPI_COMM_WORLD_LOCAL_RANK`` +that is set by Open MPI at launch time and can be retrieved before +``MPI_Init``. An example code sample using the HIP programming model +looks as follows: + +.. code-block:: c + + int num_devices; + hipGetDeviceCount(&num_devices); + assert (num_devices > 0); + + char* ompi_local_rank = getenv("OMPI_COMM_WORLD_LOCAL_RANK"); + if (nullptr != ompi_local_rank) { + hipSetDevice(atoi(ompi_local_rank) % num_devices); + } + + MPI_Init (&argc, &argv); + ... + + +.. note:: Open MPI currently assumes that an MPI processes is using a + single accelerator device. Certain software stacks might be + able to support multiple GPUs per rank. + + + diff --git a/docs/tuning-apps/accelerators/memkind.rst b/docs/tuning-apps/accelerators/memkind.rst new file mode 100644 index 00000000000..7567d9f7e82 --- /dev/null +++ b/docs/tuning-apps/accelerators/memkind.rst @@ -0,0 +1,64 @@ +Support for Memory-kind Info Objects +==================================== + +`MPI version 4.1. `_ +introduced the notion of memory allocation kinds, which allow an +application to specify what memory types it plans to use, and to query +what memory types are supported by the MPI library in a portable +manner. In addition, the application can place restrictions on certain +objects such as creating a separate communicator for using with +host-memory and a communicator that will be used with device memory +only. This approach allows the MPI library to perform certain +optimizations, such as bypassing checking the memory-type of buffer +pointers. Please refer to the MPI specification as well as the `Memory +Allocation Kinds Side Document +`_ for more +details and examples. + +Open MPI starting from version 6.0.0 supports the following values for the memory allocation kind Info object: + +* mpi +* system +* cuda:device +* cuda:host +* cuda:managed +* level_zero:device +* level_zero:host +* level_zero:shared +* rocm:device +* rocm:host +* rocm:managed + +.. note:: Support for accelerator memory allocation kind info objects + will depend on the accelerator support compiled into Open + MPI. + + +Passing memory-kind info to mpiexec +=================================== + +The following example demonstrates how to pass memory allocation kind +information to Open MPI at application launch: + +.. code:: sh + + # Specify that the application will use system, MPI, and CUDA device memory + shell$ mpiexec --memory-allocation-kinds system,mpi,cuda:device -n 64 ./ + +Asserting usage of memory kind when creating a Communicator +=========================================================== + +The following code-snipplet demonstrates how to assert that a +communicator will only be used for ROCm device buffers: + +.. code:: c + + MPI_Info info_assert; + MPI_Info_create (&info_assert); + char assert_key[] = "mpi_assert_memory_alloc_kinds"; + char assert_value[] = "rocm:device"; + MPI_Info_set (info_assert, assert_key, assert_value); + + MPI_Comm comm_dup + MPI_Comm_dup_with_info (MPI_COMM_WORLD, info_assert, &comm_dup); + ... diff --git a/docs/tuning-apps/accelerators/rocm.rst b/docs/tuning-apps/accelerators/rocm.rst new file mode 100644 index 00000000000..812734088d3 --- /dev/null +++ b/docs/tuning-apps/accelerators/rocm.rst @@ -0,0 +1,269 @@ +ROCm +==== + +ROCm is the name of the software stack used by AMD GPUs. It includes +the ROCm Runtime (ROCr), the HIP programming model, and numerous +numerical and machine learning libraries tuned for the AMD Instinct and Radeon +accelerators. More information can be found at the following +`AMD webpages `_ + + +Building Open MPI with ROCm support +----------------------------------- + +ROCm-aware support means that the MPI library can send and receive +data from AMD GPU device buffers directly. Starting from Open MPI +v6.0.0 ROCm support is available directly within Open MPI for single +node scenarios, and through UCX or libfabric for multi-node scenarios. + + +Compiling Open MPI with ROCm support +------------------------------------ + +Compiling Open MPI with ROCm support requires setting the +``--with-rocm=`` option at configure time: + +.. code-block:: sh + + # Configure Open MPI with ROCm support + shell$ cd ompi + shell$ ./configure --with-rocm=/opt/rocm \ + + + +///////////////////////////////////////////////////////////////////////// + +Checking that Open MPI has been built with ROCm support +------------------------------------------------------- + +Verify that Open MPI has been built with ROCm using the +:ref:`ompi_info(1) ` command: + +.. code-block:: sh + + # Use ompi_info to verify ROCm support in Open MPI + shell$ ./ompi_info | grep "MPI extensions" + MPI extensions: affinity, cuda, ftmpi, rocm + +///////////////////////////////////////////////////////////////////////// + +Runtime querying of ROCm support in Open MPI +-------------------------------------------- + +Querying the availability of ROCm support in Open MPI at runtime is +possible through the memory allocation kind info object, see ::ref::`memkind` +page for details. + +In addition, starting with Open MPI v5.0.0 :ref:`MPIX_Query_rocm_support(3) +` is available as an extension to check +the availability of ROCm support in the library. To use the +function, the code needs to include ``mpi-ext.h``. Note that +``mpi-ext.h`` is an Open MPI specific header file. + + +.. _sm-rocm-options-label: + +///////////////////////////////////////////////////////////////////////// + +Running single node jobs with ROCm support +------------------------------------------ + +The user has multiple options for running an Open MPI job with GPU support +in a single node scenario: + +* the default shared memory component ``btl/sm`` has support for + accelerators, will use however by default a bounce buffer on the CPU + for data transfers. Hence, while this works, it will not be able to + take advantage of the high-speed GPU-to-GPU InfinityFabric + interconnect (if available). + +* to use the high-speed GPU-to-GPU interconnect within a node, the user has to + enable the accelerator single-copy component (``smsc/accelerator``), e.g.: + +.. code-block:: sh + + # Enable the smsc/accelerator component + shell$ mpirun --mca smsc_accelerator_priority 80 -n 64 ./ + +* Alternatively, the user can replace the default shared memory + component ``btl/sm`` with the ``btl/smcuda`` component, which has + been extended to support ROCm devices. While this approach supports + communication over a high-speed GPU-to-GPU interconnect, it does not + support single-copy data transfers for host-memory through + e.g. ``xpmem`` or ``cma``. Hence, the performance of host-memory + based data transfers might be lower than with the default ``btl/sm`` + component. Example: + +.. code-block:: sh + + # Use btl/smcuda instead of btl/sm for communication + shell$ mpirun --mca btl smcuda,tcp,self -n 64 ./ + +///////////////////////////////////////////////////////////////////////// + +ROCm support in Open MPI with UCX +--------------------------------- + +In this configuration, UCX will provide the ROCm support, and hence it +is important to ensure that UCX itself is built with ROCm support. Both, +inter- and intra-node communication will be executed through UCX. + +To see if your UCX library was built with ROCm support, run the +following command: + +.. code-block:: sh + + # Check if ucx was built with ROCm support + shell$ ucx_info -v + + # configured with: --with-rocm=/opt/rocm --enable-mt + +If you need to build the UCX library yourself to include ROCm support, +please see the UCX documentation for `building UCX with Open MPI: +`_ + +It should look something like: + +.. code-block:: sh + + # Configure UCX with ROCm support + shell$ cd ucx + shell$ ./configure --prefix=/path/to/ucx-rocm-install \ + --with-rocm=/opt/rocm + + # Configure Open MPI with UCX and ROCm support + shell$ cd ompi + shell$ ./configure --with-rocm=/opt/rocm \ + --with-ucx=/path/to/ucx-rocm-install \ + + +///////////////////////////////////////////////////////////////////////// + +Using ROCm-aware UCX with Open MPI +---------------------------------- + +If UCX and Open MPI have been configured with ROCm support, specifying +the UCX pml component is sufficient to take advantage of the ROCm +support in the libraries. For example, the command to execute the +``osu_latency`` benchmark from the `OSU benchmarks +`_ with ROCm buffers +using Open MPI and UCX ROCm support is something like this: + +.. code-block:: sh + + shell$ mpirun -n 2 --mca pml ucx \ + ./osu_latency D D + +.. note:: some additional configure flags are required to compile the + OSU benchmark to support ROCm buffers. Please refer to the + `UCX ROCm instructions + `_ + for details. + +///////////////////////////////////////////////////////////////////////// + +ROCm support in Open MPI with libfabric +--------------------------------------- + +Some network interconnects are supported through the libfabric library. +Configuring libfabric and Open MPI with ROCm support looks something like: + +.. code-block:: sh + + # Configure libfabric with ROCm support + shell$ cd libfabric + shell$ ./configure --prefix=/path/to/ofi-rocm-install \ + --with-rocr=/opt/rocm + + # Configure Open MPI with libfabric and ROCm support + shell$ cd ompi + shell$ ./configure --with-rocm=/opt/rocm \ + --with-ofi=/path/to/ofi-rocm-install \ + + +///////////////////////////////////////////////////////////////////////// + + +Using ROCm-aware libfabric with Open MPI +---------------------------------------- + +There are two mechanism for using libfabric and Open MPI with ROCm support. + +* Specifying the ``mtl/ofi`` component is sufficient to take advantage + of the ROCm support in the libraries. In this case, both intra- and + inter-node communication will be performed by the libfabric library. In + order to ensure that the application will make use of the shared + memory provider for intra-node communication and the network + interconnect specific provider for inter-node communication, the + user might have to request using the ``linkX`` provider, e.g.: + +.. code-block:: sh + + # Force using the ofi mtl component + mpirun --mca pml cm --mca mtl ofi \ + --mca opal_common_ofi_provider_include "shm+cxi:lnx" \ + -n 64 ./ + +* Alternatively, the user can use the ``btl/ofi`` component, in which + case the intra-node communication will use the Open MPI shared + memory mechanisms (see <_sm-rocm-options-label>), and use + libfabric only for inter-node scenarios. + +.. code-block:: sh + + # Use the ofi btl for inter-node and sm btl + # for intra-node communication + mpirun --mca pml ob1 --mca btl ofi,sm,tcp,self \ + --mca smsc_accelerator_priority 80 \ + -n 64 ./ + + +///////////////////////////////////////////////////////////////////////// + +Collective component supporting ROCm device memory +-------------------------------------------------- + + +The ``coll/accelerator`` component supports collective operations on +ROCm device buffers for many commonly used collective +operations. The component works by copying data into a temporary host +buffer, executing the collective operation on the host buffer, and +copying the result back to the device buffer at completion. This +component will lead to adequate performance for short to medium data +sizes, but performance is often suboptimal especially for large reduction +operations. + +The `UCC `_ based collective component +in Open MPI can be configured and compiled to include ROCm support, +and will typically lead to significantly better performance for large +reductions. + +An example for configure UCC and Open MPI with ROCm is shown below: + +.. code-block:: sh + + # Configure and compile UCC with ROCm support + shell$ cd ucc + shell$ ./configure --with-rocm=/opt/rocm \ + --with-ucx=/path/to/ucx-rocm-install \ + --prefix=/path/to/ucc-rocm-install + shell$ make -j && make install + + # Configure and compile Open MPI with UCX, UCC, and ROCm support + shell$ cd ompi + shell$ ./configure --with-rocm=/opt/rocm \ + --with-ucx=/path/to/ucx-rocm-install \ + --with-ucc=/path/to/ucc-rocm-install + +To use the UCC component in an applicatin requires setting some +additional parameters: + +.. code-block:: + + shell$ mpirun --mca pml ucx --mca osc ucx \ + --mca coll_ucc_enable 1 \ + --mca coll_ucc_priority 100 -np 64 ./my_mpi_app + +.. note:: Using the UCC library for collective operations in Open MPI + requires using the UCX library, and hence cannot be deployed + e.g. when using libfabric. diff --git a/docs/tuning-apps/coll-tuned.rst b/docs/tuning-apps/coll-tuned.rst index 1d5549256d8..b71f4d694ef 100644 --- a/docs/tuning-apps/coll-tuned.rst +++ b/docs/tuning-apps/coll-tuned.rst @@ -3,7 +3,7 @@ Tuning Collectives Open MPI's ``coll`` framework provides a number of components implementing collective communication, including: ``han``, ``libnbc``, ``self``, ``ucc`` ``base``, -``sync``, ``xhc``, ``accelerator``, ``basic``, ``ftagree``, ``inter``, ``portals4``, +``sync``, ``xhc``, ``accelerator``, ``basic``, ``ftagree``, ``inter``, ``portals4``, ``acoll``, and ``tuned``. Some of these components may not be available depending on how Open MPI was compiled and what hardware is available on the system. A run-time decision based on each component's self reported priority, selects which diff --git a/docs/tuning-apps/index.rst b/docs/tuning-apps/index.rst index debc86a0e5e..4d77f176e52 100644 --- a/docs/tuning-apps/index.rst +++ b/docs/tuning-apps/index.rst @@ -9,6 +9,7 @@ components that can be tuned to affect behavior at run time. environment-var networking/index + accelerators/index multithreaded dynamic-loading fork-system-popen diff --git a/docs/tuning-apps/mpi-io.rst b/docs/tuning-apps/mpi-io.rst index ddb84d62874..d478536458c 100644 --- a/docs/tuning-apps/mpi-io.rst +++ b/docs/tuning-apps/mpi-io.rst @@ -1,5 +1,5 @@ -Open MPI IO ("OMPIO") -===================== +MPI IO +====== OMPIO is an Open MPI-native implementation of the MPI I/O functions defined in the MPI specification. @@ -23,7 +23,7 @@ OMPIO is fundamentally a component of the ``io`` framework in Open MPI. Upon opening a file, the OMPIO component initializes a number of sub-frameworks and their components, namely: -* ``fs``: responsible for all file management operations +* ``fs``: responsible for all file management operations * ``fbtl``: support for blocking and non-blocking individual I/O operations * ``fcoll``: support for blocking and non-blocking collective I/O @@ -70,8 +70,7 @@ mechanism available in Open MPI to influence a parameter value, e.g.: shell$ mpirun --mca fcoll dynamic -n 64 ./a.out ``fs`` and ``fbtl`` components are typically chosen based on the file -system type utilized (e.g. the ``pvfs2`` component is chosen when the -file is located on an PVFS2/OrangeFS file system, the ``lustre`` +system type utilized (e.g. the ``lustre`` component is chosen for Lustre file systems, etc.). The ``ufs`` ``fs`` component is used if no file system specific component is availabe (e.g. local file systems, NFS, BeefFS, etc.), and the ``posix`` @@ -154,21 +153,11 @@ operation are listed below: Setting stripe size and stripe width on parallel file systems ------------------------------------------------------------- -Many ``fs`` components allow you to manipulate the layout of a new +Some ``fs`` components allow you to manipulate the layout of a new file on a parallel file system. Note, that many file systems only allow changing these setting upon file creation, i.e. modifying these values for an already existing file might not be possible. -#. ``fs_pvfs2_stripe_size``: Sets the number of storage servers for a - new file on a PVFS2/OrangeFS file system. If not set, system default will be - used. Note that this parameter can also be set through the - ``stripe_size`` MPI Info value. - -#. ``fs_pvfs2_stripe_width``: Sets the size of an individual block for - a new file on a PVFS2 file system. If not set, system default will - be used. Note that this parameter can also be set through the - ``stripe_width`` MPI Info value. - #. ``fs_lustre_stripe_size``: Sets the number of storage servers for a new file on a Lustre file system. If not set, system default will be used. Note that this parameter can also be set through the @@ -193,6 +182,12 @@ significant influence on the performance of the file I/O operation from device buffers, and can be controlled using the ``io_ompio_pipeline_buffer_size`` MCA parameter. +Furthermore, some collective file I/O components such as +``fcoll/vulcan`` allow the user to influence whether the buffer used +for collective aggregation is located in host or device memory through +the ``io_ompio_use_accelerator_buffers`` MCA parameter. + + .. _label-ompio-individual-sharedfp: Using the ``individual`` ``sharedfp`` component and its limitations diff --git a/docs/tuning-apps/networking/index.rst b/docs/tuning-apps/networking/index.rst index 00aa0f39df5..2be844cb61a 100644 --- a/docs/tuning-apps/networking/index.rst +++ b/docs/tuning-apps/networking/index.rst @@ -24,5 +24,3 @@ build support for that library). shared-memory ib-and-roce iwarp - cuda - rocm diff --git a/docs/tuning-apps/networking/rocm.rst b/docs/tuning-apps/networking/rocm.rst deleted file mode 100644 index 10ee12fe9e2..00000000000 --- a/docs/tuning-apps/networking/rocm.rst +++ /dev/null @@ -1,134 +0,0 @@ -ROCm -==== - -ROCm is the name of the software stack used by AMD GPUs. It includes -the ROCm Runtime (ROCr), the HIP programming model, and numerous -numerical and machine learning libraries tuned for the AMD Instinct -accelerators. More information can be found at the following -`AMD webpages `_ - - -Building Open MPI with ROCm support ------------------------------------ - -ROCm-aware support means that the MPI library can send and receive -data from AMD GPU device buffers directly. As of today, ROCm support -is available through UCX. While other communication transports might -work as well, UCX is the only transport formally supported in Open MPI -|ompi_ver| for ROCm devices. - -Since UCX will be providing the ROCm support, it is important to -ensure that UCX itself is built with ROCm support. - -To see if your UCX library was built with ROCm support, run the -following command: - -.. code-block:: sh - - # Check if ucx was built with ROCm support - shell$ ucx_info -v - - # configured with: --with-rocm=/opt/rocm --without-knem --without-cuda - -If you need to build the UCX library yourself to include ROCm support, -please see the UCX documentation for `building UCX with Open MPI: -`_ - -It should look something like: - -.. code-block:: sh - - # Configure UCX with ROCm support - shell$ cd ucx - shell$ ./configure --prefix=/path/to/ucx-rocm-install \ - --with-rocm=/opt/rocm --without-knem - - # Configure Open MPI with UCX and ROCm support - shell$ cd ompi - shell$ ./configure --with-rocm=/opt/rocm \ - --with-ucx=/path/to/ucx-rocm-install \ - - -///////////////////////////////////////////////////////////////////////// - -Checking that Open MPI has been built with ROCm support -------------------------------------------------------- - -Verify that Open MPI has been built with ROCm using the -:ref:`ompi_info(1) ` command: - -.. code-block:: sh - - # Use ompi_info to verify ROCm support in Open MPI - shell$ ./ompi_info | grep "MPI extensions" - MPI extensions: affinity, cuda, ftmpi, rocm - -///////////////////////////////////////////////////////////////////////// - - -Using ROCm-aware UCX with Open MPI --------------------------------------------------------------------------- - -If UCX and Open MPI have been configured with ROCm support, specifying -the UCX pml component is sufficient to take advantage of the ROCm -support in the libraries. For example, the command to execute the -``osu_latency`` benchmark from the `OSU benchmarks -`_ with ROCm buffers -using Open MPI and UCX ROCm support is something like this: - -.. code-block:: - - shell$ mpirun -n 2 --mca pml ucx \ - ./osu_latency D D - -Note: some additional configure flags are required to compile the OSU -benchmark to support ROCm buffers. Please refer to the `UCX ROCm -instructions -`_ -for details. - - -///////////////////////////////////////////////////////////////////////// - -Runtime querying of ROCm support in Open MPI --------------------------------------------- - -Starting with Open MPI v5.0.0 :ref:`MPIX_Query_rocm_support(3) -` is available as an extension to check -the availability of ROCm support in the library. To use the -function, the code needs to include ``mpi-ext.h``. Note that -``mpi-ext.h`` is an Open MPI specific header file. - -///////////////////////////////////////////////////////////////////////// - -Collective component supporting ROCm device memory --------------------------------------------------- - -The `UCC `_ based collective component -in Open MPI can be configured and compiled to include ROCm support. - -An example for configure UCC and Open MPI with ROCm is shown below: - -.. code-block:: - - # Configure and compile UCC with ROCm support - shell$ cd ucc - shell$ ./configure --with-rocm=/opt/rocm \ - --with-ucx=/path/to/ucx-rocm-install \ - --prefix=/path/to/ucc-rocm-install - shell$ make -j && make install - - # Configure and compile Open MPI with UCX, UCC, and ROCm support - shell$ cd ompi - shell$ ./configure --with-rocm=/opt/rocm \ - --with-ucx=/path/to/ucx-rocm-install \ - --with-ucc=/path/to/ucc-rocm-install - -To use the UCC component in an applicatin requires setting some -additional parameters: - -.. code-block:: - - shell$ mpirun --mca pml ucx --mca osc ucx \ - --mca coll_ucc_enable 1 \ - --mca coll_ucc_priority 100 -np 64 ./my_mpi_app diff --git a/docs/tuning-apps/networking/shared-memory.rst b/docs/tuning-apps/networking/shared-memory.rst index 7c40693cd76..0584c554e4f 100644 --- a/docs/tuning-apps/networking/shared-memory.rst +++ b/docs/tuning-apps/networking/shared-memory.rst @@ -13,7 +13,7 @@ can only be used between processes executing on the same node. BTL was named ``vader``. As of Open MPI version 5.0.0, the BTL has been renamed ``sm``. -.. warning:: In Open MPI version 5.0.x, the name ``vader`` is simply +.. warning:: In Open MPI version 6.0.x, the name ``vader`` is simply an alias for the ``sm`` BTL. Similarly, all ``vader_``-prefixed MCA parameters are automatically aliased to their corresponding ``sm_``-prefixed MCA @@ -90,7 +90,7 @@ The ``sm`` BTL supports two modes of shared memory communication: #. **Single copy:** In this mode, the sender or receiver makes a single copy of the message data from the source buffer in one process to the destination buffer in another process. Open MPI - supports three flavors of shared memory single-copy transfers: + supports four flavors of shared memory single-copy transfers: * `Linux KNEM `_. This is a standalone Linux kernel module, made specifically for HPC and MPI @@ -118,6 +118,18 @@ The ``sm`` BTL supports two modes of shared memory communication: Open MPI must be built on a Linux system with a recent enough Glibc and kernel version in order to build support for Linux CMA. + * Accelerator IPC mechanism: some accelerator devices support + direct GPU-to-GPU data transfers that can take advantage of + high-speed interconnects between the accelerators. This component + is based on IPC abstractions introduced in the accelerator + framework, which allows the sm btl component to use this + mechanism if requested by the user. For host memory this + component will pass through the operation to another single-copy + component. + + The component is disabled by default. To use this component, the + application has to increase the priority of the component. + Which mechanism is used at run time depends both on how Open MPI was built and how your system is configured. You can check to see which single-copy mechanisms Open MPI was built with via two mechanisms: