Reduced performance of MPI_Allreduce in OpenMPI-5 compared to OpenMPI-4

## Background information

### What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.3 and v4.1.5


### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From sources: https://www.open-mpi.org/software/ompi
Both versions of OpenMPI were installed using the following versions of dependencies:
```bash
depends_on("GCC/12.3.0")
depends_on("zlib/1.2.13-GCCcore-12.3.0")
depends_on("hwloc/2.9.1-GCCcore-12.3.0")
depends_on("libevent/2.1.12-GCCcore-12.3.0")
depends_on("UCX/1.14.1-GCCcore-12.3.0")
depends_on("libfabric/1.18.0-GCCcore-12.3.0")
depends_on("UCC/1.2.0-GCCcore-12.3.0")
```


### Please describe the system on which you are running

* Operating system/version: 
  RHEL9.4
* Computer hardware: 
  ```bash
  $ cat /proc/cpuinfo | grep "model name" | tail -n 1
  model name	: AMD EPYC 7H12 64-Core Processor
  ```
* Network type:
  ```bash
  $ lspci | grep -i Mellanox
  01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
  01:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
  01:00.2 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
  01:00.3 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
  01:00.4 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
  01:00.5 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
  01:00.6 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
  01:00.7 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
  01:01.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
  21:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
  21:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
  ```

-----------------------------

## Details of the problem

I've noticed a performance degradation of some collective operations in OpenMPI-5, compared to OpenMPI-4. The test code below executes a simple `MPI_Allreduce` on 16M doubles:
```c++
#include <iostream>
#include <mpi.h>
#include <sys/time.h>

#define TABLE_SIZE 16777216

int main(int argc, char **argv) {

  int rank, size;
  double table[TABLE_SIZE];
  double global_result[TABLE_SIZE];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get the rank of the process
  MPI_Comm_size(MPI_COMM_WORLD, &size); // Get the total number of processes

  for (int i = 0; i < TABLE_SIZE; i++) {
    table[i] = rank + i;
  }

  double start_time = MPI_Wtime();
  MPI_Allreduce(table, global_result, TABLE_SIZE, MPI_DOUBLE, MPI_SUM,
                MPI_COMM_WORLD);
  double end_time = MPI_Wtime();
  double elapsed_time = end_time - start_time;

  double min_time, max_time, avg_time;
  MPI_Allreduce(&elapsed_time, &min_time, 1, MPI_DOUBLE, MPI_MIN,
                MPI_COMM_WORLD);
  MPI_Allreduce(&elapsed_time, &max_time, 1, MPI_DOUBLE, MPI_MAX,
                MPI_COMM_WORLD);
  MPI_Allreduce(&elapsed_time, &avg_time, 1, MPI_DOUBLE, MPI_SUM,
                MPI_COMM_WORLD);
  avg_time /= size;

  if (rank == 0) {
    std::cout << "Global Reduced Result (sum of all elements across all "
                 "processes):\n";
    std::cout << "Result[0]: " << global_result[0] << std::endl;
    std::cout << "Result[" << TABLE_SIZE - 1
              << "]: " << global_result[TABLE_SIZE - 1] << std::endl;
    std::cout << "MPI_Allreduce (s): " << (end_time - start_time) << std::endl;
    std::cout << "MPI_Allreduce Timing Analysis:" << std::endl;
    std::cout << "  Minimum time: " << min_time << " seconds" << std::endl;
    std::cout << "  Maximum time: " << max_time << " seconds" << std::endl;
    std::cout << "  Average time: " << avg_time << " seconds" << std::endl;
  }

  MPI_Finalize();
  return 0;
}
```
The code was compiled with a single `-O3` flag and executed on 256 processes across two nodes:
```bash
$ mpicxx -O3 mpi_allreduce_16M.cpp
...
$ mpirun -n 256 ./a.out
```

The timing for OpenMPI-5.0.3:
```bash
  Minimum time: 3.00434 seconds
  Maximum time: 3.01639 seconds
  Average time: 3.00823 seconds
```
Timing for OpenMPI-4.1.5:
```bash
  Minimum time: 0.816602 seconds
  Maximum time: 0.870789 seconds
  Average time: 0.85539 seconds
```

Any suggestions on why the performance could be so different? Are there any recommendations on where to look to improve OMPI-5 performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduced performance of MPI_Allreduce in OpenMPI-5 compared to OpenMPI-4 #13082

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduced performance of MPI_Allreduce in OpenMPI-5 compared to OpenMPI-4 #13082

Description

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions