Skip to content

CPU core duplication in OpenMPI 5.0.5 when running multiple tasks #12884

@lemon3242

Description

@lemon3242

Hi all,

I encountered an issue with OpenMPI 5.0.5 on my system where multiple MPI tasks are being assigned to the same CPU cores, even though I expect distinct cores for different programs. This issue occurs when running two separate MPI jobs concurrently. Here are the details:

Operating System: Ubuntu 22.04
CPU: Dual AMD EPYC 9754

The first program assigns 64 MPI ranks, and the --report-bindings output shows the correct assignment of cores. However, when I run a second program while the first one is still running, MPI assigns the same cores to the second program, resulting in core overlap and suboptimal CPU usage.

First Program Command and Output

$ mpirun --report-bindings -np 64 python ./xxx1.py
[ytgroup-epyc9754:1162290] Rank 0 bound to package[0][core:0]
[ytgroup-epyc9754:1162290] Rank 1 bound to package[0][core:1]
[ytgroup-epyc9754:1162290] Rank 2 bound to package[0][core:2]
[ytgroup-epyc9754:1162290] Rank 3 bound to package[0][core:3]
[ytgroup-epyc9754:1162290] Rank 4 bound to package[0][core:4]
[ytgroup-epyc9754:1162290] Rank 5 bound to package[0][core:5]
[ytgroup-epyc9754:1162290] Rank 6 bound to package[0][core:6]
[ytgroup-epyc9754:1162290] Rank 7 bound to package[0][core:7]
[ytgroup-epyc9754:1162290] Rank 8 bound to package[0][core:8]
[ytgroup-epyc9754:1162290] Rank 9 bound to package[0][core:9]
[ytgroup-epyc9754:1162290] Rank 10 bound to package[0][core:10]
[ytgroup-epyc9754:1162290] Rank 11 bound to package[0][core:11]
...
[ytgroup-epyc9754:1162290] Rank 63 bound to package[0][core:63]

Second Program Command and Output

When I start the second program using the following command:

$ mpirun --report-bindings -np 27 python ./xxx2.py

The second program is assigned to the same cores that the first program is already using, causing core overlap:

[ytgroup-epyc9754:1162897] Rank 0 bound to package[0][core:0]
[ytgroup-epyc9754:1162897] Rank 1 bound to package[0][core:1]
[ytgroup-epyc9754:1162897] Rank 2 bound to package[0][core:2]
[ytgroup-epyc9754:1162897] Rank 3 bound to package[0][core:3]
[ytgroup-epyc9754:1162897] Rank 4 bound to package[0][core:4]
[ytgroup-epyc9754:1162897] Rank 5 bound to package[0][core:5]
[ytgroup-epyc9754:1162897] Rank 6 bound to package[0][core:6]
[ytgroup-epyc9754:1162897] Rank 7 bound to package[0][core:7]
[ytgroup-epyc9754:1162897] Rank 8 bound to package[0][core:8]
[ytgroup-epyc9754:1162897] Rank 9 bound to package[0][core:9]
...
[ytgroup-epyc9754:1162897] Rank 26 bound to package[0][core:26]

Workaround

For now, I found that using --bind-to none in OpenMPI 5.0.5 can avoid this problem by disabling core binding. However, this is not an optimal solution because it prevents MPI from utilizing core affinity optimally. I would like to know if there is a more appropriate way to correctly assign distinct cores to different programs without completely turning off the binding.

Related Observations

In OpenMPI 4.1.5, I have noticed that hyper-threading is sometimes enabled unexpectedly when running MPI programs. Setting OMP_NUM_THREADS=1 seems to mitigate the issue, but I am unsure if this is related to OpenMPI, or OpenBLAS.

I would greatly appreciate any guidance on how to configure OpenMPI to avoid core duplication between concurrent MPI jobs. Additionally, I am open to suggestions or fixes that could help resolve the hyper-threading issue observed in OpenMPI 4.1.5.

N.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions