Skip to content

Help needed with understanding ompi_display_comm output #12323

@tlivolsi

Description

@tlivolsi

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

OpenMPI 5.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

OpenMPI was installed via Spack

spack install [email protected]+atomics+gpfs+openshmem+romio fabrics=hcoll,ucx schedulers=tm ^[email protected] ^[email protected] %[email protected] target=zen3

Please describe the system on which you are running

  • Operating system/version: RHEL 8.8
  • Computer hardware: AMD EPYC 7713 64-Core Processor x2
  • Network type: Infiniband

Details of the problem

I am in the process of diagnosing performance issues that I'm seeing on my cluster, and the output of --mca ompi_display_comm 1 caught my attention. The transport methods for many nodes are showing as a generic ucx method, which is contrary to the behavior that I expect. The behavior that I expect is for all internode communication to be using the ucx=dc_mlx5;mlx5_0 transport method. I enabled debugging for UCX, but nothing stands out to me. Any insight would be greatly appreciated!

$ tail -f *97877
Host 0 [i041] ranks 0 - 127
Host 1 [i042] ranks 128 - 255
Host 2 [i048] ranks 256 - 383
Host 3 [i063] ranks 384 - 511
Host 4 [i064] ranks 512 - 639
Host 5 [i067] ranks 640 - 767
Host 6 [i070] ranks 768 - 895
Host 7 [i075] ranks 896 - 1023
Host 8 [i090] ranks 1024 - 1151
Host 9 [i091] ranks 1152 - 1279
Host 10 [i092] ranks 1280 - 1407
Host 11 [i093] ranks 1408 - 1535
Host 12 [i094] ranks 1536 - 1663
Host 13 [i095] ranks 1664 - 1791
Host 14 [i096] ranks 1792 - 1919
Host 15 [i097] ranks 1920 - 2047
Host 16 [i098] ranks 2048 - 2175
Host 17 [i099] ranks 2176 - 2303
Host 18 [i100] ranks 2304 - 2431
Host 19 [i101] ranks 2432 - 2559
Host 20 [i102] ranks 2560 - 2687
Host 21 [i103] ranks 2688 - 2815
Host 22 [i104] ranks 2816 - 2943
Host 23 [i105] ranks 2944 - 3071
Host 24 [i113] ranks 3072 - 3199
Host 25 [i121] ranks 3200 - 3327
Host 26 [i122] ranks 3328 - 3455
Host 27 [i123] ranks 3456 - 3583
Host 28 [i124] ranks 3584 - 3711
Host 29 [i125] ranks 3712 - 3839
Host 30 [i126] ranks 3840 - 3967
Host 31 [i127] ranks 3968 - 4095

 host | 0 1 2 3 4       8       12      16      20      24      28
======|=============================================================
    0 : A C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
    1 : C A C C C C C C C C C C C C C C C C C B B B B B B B B B B B B B
    2 : C B A C C C C C C C C C C C C C C C C C B B B B B B B B B B B B
    3 : C C C A C C C C C C C C C C C C C C C C C B B B B B B B B B B B
    4 : C B B B A C C C C C C C C C C C C C C C C C B B B B B B B B B B
    5 : C C B B C A C C C C C C C C C C C C C C C C C B B B B B B B B B
    6 : C B C B C B A C C C C C C C C C C C C C C C C C B B B B B B B B
    7 : C B B C B C C A C C C C C C C C C C C C C C C C C B B B B B B B
    8 : C B B B B B B B A C C C C C C C C C C C C C C C C C B B B B B B
    9 : C C B B B B B B C A C C C C C C C C C C C C C C C C C B B B B B
   10 : C B C B B B B B C B A C C C C C C C C C C C C C C C C C B B B B
   11 : C B B C B B B B B C C A C C C C C C C C C C C C C C C C C B B B
   12 : C B B B C B B B C B B B A C C C C C C C C C C C C C C C C C B B
   13 : C B B B B C B B B C B B C A C C C C C C C C C C C C C C C C C B
   14 : C B B B B B C B B B C B C B A C C C C C C C C C C C C C C C C C
   15 : C B B B B B B C B B B C B C C A C C C C C C C C C C C C C C C C
   16 : C C B B B B B B B B B B B B B B A C C C C C C C C C C C C C C C
   17 : C C C B B B B B B B B B B B B B C A C C C C C C C C C C C C C C
   18 : C C C C B B B B B B B B B B B B C B A C C C C C C C C C C C C C
   19 : C C C C C B B B B B B B B B B B B C C A C C C C C C C C C C C C
   20 : C C C C C C B B B B B B B B B B C B B B A C C C C C C C C C C C
   21 : C C C C C C C B B B B B B B B B B C B B C A C C C C C C C C C C
   22 : C C C C C C C C B B B B B B B B B B C B C B A C C C C C C C C C
   23 : C C C C C C C C C B B B B B B B B B B C B C C A C C C C C C C C
   24 : C C C C C C C C C C B B B B B B C B B B B B B B A C C C C C C C
   25 : C C C C C C C C C C C B B B B B B C B B B B B B C A C C C C C C
   26 : C C C C C C C C C C C C B B B B B B C B B B B B C B A C C C C C
   27 : C C C C C C C C C C C C C B B B B B B C B B B B B C C A C C C C
   28 : C C C C C C C C C C C C C C B B B B B B C B B B C B B B A C C C
   29 : C C C C C C C C C C C C C C C B B B B B B C B B B C B B C A C C
   30 : C C C C C C C C C C C C C C C C B B B B B B C B B B C B C B A C
   31 : C C C C C C C C C C C C C C C C C B B B B B B C B B B C B C C A
key: A == ucx=sysv;memory,xpmem;memory,knem;memory
key: B == ucx
key: C == ucx=dc_mlx5;mlx5_0:1

Connection summary: (pml)
  on-host:  all connections are ucx=sysv;memory,xpmem;memory,knem;memory
  off-host: most connections are ucx=dc_mlx5;mlx5_0:1
Exceptions:
  host 1: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 2: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 3: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 4: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 5: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 6: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 7: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 8: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 9: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 10: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 11: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 12: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 13: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 14: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 15: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 16: [14x ucx] [17x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 17: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 18: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 19: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 20: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 21: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 22: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 23: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 24: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 25: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 26: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 27: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 28: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 29: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 30: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
  host 31: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions