- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
Closed as not planned
Closed as not planned
Copy link
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.3 and v4.1.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From sources: https://www.open-mpi.org/software/ompi
Both versions of OpenMPI were installed using the following versions of dependencies:
depends_on("GCC/12.3.0")
depends_on("zlib/1.2.13-GCCcore-12.3.0")
depends_on("hwloc/2.9.1-GCCcore-12.3.0")
depends_on("libevent/2.1.12-GCCcore-12.3.0")
depends_on("UCX/1.14.1-GCCcore-12.3.0")
depends_on("libfabric/1.18.0-GCCcore-12.3.0")
depends_on("UCC/1.2.0-GCCcore-12.3.0")Please describe the system on which you are running
- Operating system/version:
RHEL9.4 - Computer hardware:
$ cat /proc/cpuinfo | grep "model name" | tail -n 1 model name : AMD EPYC 7H12 64-Core Processor
 - Network type:
$ lspci | grep -i Mellanox 01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] 01:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.2 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.3 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.4 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.5 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.6 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.7 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:01.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 21:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 21:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 
Details of the problem
I've noticed a performance degradation of some collective operations in OpenMPI-5, compared to OpenMPI-4. The test code below executes a simple MPI_Allreduce on 16M doubles:
#include <iostream>
#include <mpi.h>
#include <sys/time.h>
#define TABLE_SIZE 16777216
int main(int argc, char **argv) {
  int rank, size;
  double table[TABLE_SIZE];
  double global_result[TABLE_SIZE];
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get the rank of the process
  MPI_Comm_size(MPI_COMM_WORLD, &size); // Get the total number of processes
  for (int i = 0; i < TABLE_SIZE; i++) {
    table[i] = rank + i;
  }
  double start_time = MPI_Wtime();
  MPI_Allreduce(table, global_result, TABLE_SIZE, MPI_DOUBLE, MPI_SUM,
                MPI_COMM_WORLD);
  double end_time = MPI_Wtime();
  double elapsed_time = end_time - start_time;
  double min_time, max_time, avg_time;
  MPI_Allreduce(&elapsed_time, &min_time, 1, MPI_DOUBLE, MPI_MIN,
                MPI_COMM_WORLD);
  MPI_Allreduce(&elapsed_time, &max_time, 1, MPI_DOUBLE, MPI_MAX,
                MPI_COMM_WORLD);
  MPI_Allreduce(&elapsed_time, &avg_time, 1, MPI_DOUBLE, MPI_SUM,
                MPI_COMM_WORLD);
  avg_time /= size;
  if (rank == 0) {
    std::cout << "Global Reduced Result (sum of all elements across all "
                 "processes):\n";
    std::cout << "Result[0]: " << global_result[0] << std::endl;
    std::cout << "Result[" << TABLE_SIZE - 1
              << "]: " << global_result[TABLE_SIZE - 1] << std::endl;
    std::cout << "MPI_Allreduce (s): " << (end_time - start_time) << std::endl;
    std::cout << "MPI_Allreduce Timing Analysis:" << std::endl;
    std::cout << "  Minimum time: " << min_time << " seconds" << std::endl;
    std::cout << "  Maximum time: " << max_time << " seconds" << std::endl;
    std::cout << "  Average time: " << avg_time << " seconds" << std::endl;
  }
  MPI_Finalize();
  return 0;
}The code was compiled with a single -O3 flag and executed on 256 processes across two nodes:
$ mpicxx -O3 mpi_allreduce_16M.cpp
...
$ mpirun -n 256 ./a.outThe timing for OpenMPI-5.0.3:
  Minimum time: 3.00434 seconds
  Maximum time: 3.01639 seconds
  Average time: 3.00823 secondsTiming for OpenMPI-4.1.5:
  Minimum time: 0.816602 seconds
  Maximum time: 0.870789 seconds
  Average time: 0.85539 secondsAny suggestions on why the performance could be so different? Are there any recommendations on where to look to improve OMPI-5 performance?