Skip to content

Reduced performance of MPI_Allreduce in OpenMPI-5 compared to OpenMPI-4 #13082

@maxim-masterov

Description

@maxim-masterov

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.3 and v4.1.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From sources: https://www.open-mpi.org/software/ompi
Both versions of OpenMPI were installed using the following versions of dependencies:

depends_on("GCC/12.3.0")
depends_on("zlib/1.2.13-GCCcore-12.3.0")
depends_on("hwloc/2.9.1-GCCcore-12.3.0")
depends_on("libevent/2.1.12-GCCcore-12.3.0")
depends_on("UCX/1.14.1-GCCcore-12.3.0")
depends_on("libfabric/1.18.0-GCCcore-12.3.0")
depends_on("UCC/1.2.0-GCCcore-12.3.0")

Please describe the system on which you are running

  • Operating system/version:
    RHEL9.4
  • Computer hardware:
    $ cat /proc/cpuinfo | grep "model name" | tail -n 1
    model name	: AMD EPYC 7H12 64-Core Processor
  • Network type:
    $ lspci | grep -i Mellanox
    01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    01:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.2 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.3 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.4 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.5 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.6 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.7 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:01.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    21:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
    21:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Details of the problem

I've noticed a performance degradation of some collective operations in OpenMPI-5, compared to OpenMPI-4. The test code below executes a simple MPI_Allreduce on 16M doubles:

#include <iostream>
#include <mpi.h>
#include <sys/time.h>

#define TABLE_SIZE 16777216

int main(int argc, char **argv) {

  int rank, size;
  double table[TABLE_SIZE];
  double global_result[TABLE_SIZE];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get the rank of the process
  MPI_Comm_size(MPI_COMM_WORLD, &size); // Get the total number of processes

  for (int i = 0; i < TABLE_SIZE; i++) {
    table[i] = rank + i;
  }

  double start_time = MPI_Wtime();
  MPI_Allreduce(table, global_result, TABLE_SIZE, MPI_DOUBLE, MPI_SUM,
                MPI_COMM_WORLD);
  double end_time = MPI_Wtime();
  double elapsed_time = end_time - start_time;

  double min_time, max_time, avg_time;
  MPI_Allreduce(&elapsed_time, &min_time, 1, MPI_DOUBLE, MPI_MIN,
                MPI_COMM_WORLD);
  MPI_Allreduce(&elapsed_time, &max_time, 1, MPI_DOUBLE, MPI_MAX,
                MPI_COMM_WORLD);
  MPI_Allreduce(&elapsed_time, &avg_time, 1, MPI_DOUBLE, MPI_SUM,
                MPI_COMM_WORLD);
  avg_time /= size;

  if (rank == 0) {
    std::cout << "Global Reduced Result (sum of all elements across all "
                 "processes):\n";
    std::cout << "Result[0]: " << global_result[0] << std::endl;
    std::cout << "Result[" << TABLE_SIZE - 1
              << "]: " << global_result[TABLE_SIZE - 1] << std::endl;
    std::cout << "MPI_Allreduce (s): " << (end_time - start_time) << std::endl;
    std::cout << "MPI_Allreduce Timing Analysis:" << std::endl;
    std::cout << "  Minimum time: " << min_time << " seconds" << std::endl;
    std::cout << "  Maximum time: " << max_time << " seconds" << std::endl;
    std::cout << "  Average time: " << avg_time << " seconds" << std::endl;
  }

  MPI_Finalize();
  return 0;
}

The code was compiled with a single -O3 flag and executed on 256 processes across two nodes:

$ mpicxx -O3 mpi_allreduce_16M.cpp
...
$ mpirun -n 256 ./a.out

The timing for OpenMPI-5.0.3:

  Minimum time: 3.00434 seconds
  Maximum time: 3.01639 seconds
  Average time: 3.00823 seconds

Timing for OpenMPI-4.1.5:

  Minimum time: 0.816602 seconds
  Maximum time: 0.870789 seconds
  Average time: 0.85539 seconds

Any suggestions on why the performance could be so different? Are there any recommendations on where to look to improve OMPI-5 performance?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions