-
Notifications
You must be signed in to change notification settings - Fork 936
Open
Labels
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
OpenMPI 5.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
OpenMPI was installed via Spack
spack install [email protected]+atomics+gpfs+openshmem+romio fabrics=hcoll,ucx schedulers=tm ^[email protected] ^[email protected] %[email protected] target=zen3
Please describe the system on which you are running
- Operating system/version: RHEL 8.8
- Computer hardware: AMD EPYC 7713 64-Core Processor x2
- Network type: Infiniband
Details of the problem
I am in the process of diagnosing performance issues that I'm seeing on my cluster, and the output of --mca ompi_display_comm 1 caught my attention. The transport methods for many nodes are showing as a generic ucx method, which is contrary to the behavior that I expect. The behavior that I expect is for all internode communication to be using the ucx=dc_mlx5;mlx5_0 transport method. I enabled debugging for UCX, but nothing stands out to me. Any insight would be greatly appreciated!
$ tail -f *97877
Host 0 [i041] ranks 0 - 127
Host 1 [i042] ranks 128 - 255
Host 2 [i048] ranks 256 - 383
Host 3 [i063] ranks 384 - 511
Host 4 [i064] ranks 512 - 639
Host 5 [i067] ranks 640 - 767
Host 6 [i070] ranks 768 - 895
Host 7 [i075] ranks 896 - 1023
Host 8 [i090] ranks 1024 - 1151
Host 9 [i091] ranks 1152 - 1279
Host 10 [i092] ranks 1280 - 1407
Host 11 [i093] ranks 1408 - 1535
Host 12 [i094] ranks 1536 - 1663
Host 13 [i095] ranks 1664 - 1791
Host 14 [i096] ranks 1792 - 1919
Host 15 [i097] ranks 1920 - 2047
Host 16 [i098] ranks 2048 - 2175
Host 17 [i099] ranks 2176 - 2303
Host 18 [i100] ranks 2304 - 2431
Host 19 [i101] ranks 2432 - 2559
Host 20 [i102] ranks 2560 - 2687
Host 21 [i103] ranks 2688 - 2815
Host 22 [i104] ranks 2816 - 2943
Host 23 [i105] ranks 2944 - 3071
Host 24 [i113] ranks 3072 - 3199
Host 25 [i121] ranks 3200 - 3327
Host 26 [i122] ranks 3328 - 3455
Host 27 [i123] ranks 3456 - 3583
Host 28 [i124] ranks 3584 - 3711
Host 29 [i125] ranks 3712 - 3839
Host 30 [i126] ranks 3840 - 3967
Host 31 [i127] ranks 3968 - 4095
host | 0 1 2 3 4 8 12 16 20 24 28
======|=============================================================
0 : A C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
1 : C A C C C C C C C C C C C C C C C C C B B B B B B B B B B B B B
2 : C B A C C C C C C C C C C C C C C C C C B B B B B B B B B B B B
3 : C C C A C C C C C C C C C C C C C C C C C B B B B B B B B B B B
4 : C B B B A C C C C C C C C C C C C C C C C C B B B B B B B B B B
5 : C C B B C A C C C C C C C C C C C C C C C C C B B B B B B B B B
6 : C B C B C B A C C C C C C C C C C C C C C C C C B B B B B B B B
7 : C B B C B C C A C C C C C C C C C C C C C C C C C B B B B B B B
8 : C B B B B B B B A C C C C C C C C C C C C C C C C C B B B B B B
9 : C C B B B B B B C A C C C C C C C C C C C C C C C C C B B B B B
10 : C B C B B B B B C B A C C C C C C C C C C C C C C C C C B B B B
11 : C B B C B B B B B C C A C C C C C C C C C C C C C C C C C B B B
12 : C B B B C B B B C B B B A C C C C C C C C C C C C C C C C C B B
13 : C B B B B C B B B C B B C A C C C C C C C C C C C C C C C C C B
14 : C B B B B B C B B B C B C B A C C C C C C C C C C C C C C C C C
15 : C B B B B B B C B B B C B C C A C C C C C C C C C C C C C C C C
16 : C C B B B B B B B B B B B B B B A C C C C C C C C C C C C C C C
17 : C C C B B B B B B B B B B B B B C A C C C C C C C C C C C C C C
18 : C C C C B B B B B B B B B B B B C B A C C C C C C C C C C C C C
19 : C C C C C B B B B B B B B B B B B C C A C C C C C C C C C C C C
20 : C C C C C C B B B B B B B B B B C B B B A C C C C C C C C C C C
21 : C C C C C C C B B B B B B B B B B C B B C A C C C C C C C C C C
22 : C C C C C C C C B B B B B B B B B B C B C B A C C C C C C C C C
23 : C C C C C C C C C B B B B B B B B B B C B C C A C C C C C C C C
24 : C C C C C C C C C C B B B B B B C B B B B B B B A C C C C C C C
25 : C C C C C C C C C C C B B B B B B C B B B B B B C A C C C C C C
26 : C C C C C C C C C C C C B B B B B B C B B B B B C B A C C C C C
27 : C C C C C C C C C C C C C B B B B B B C B B B B B C C A C C C C
28 : C C C C C C C C C C C C C C B B B B B B C B B B C B B B A C C C
29 : C C C C C C C C C C C C C C C B B B B B B C B B B C B B C A C C
30 : C C C C C C C C C C C C C C C C B B B B B B C B B B C B C B A C
31 : C C C C C C C C C C C C C C C C C B B B B B B C B B B C B C C A
key: A == ucx=sysv;memory,xpmem;memory,knem;memory
key: B == ucx
key: C == ucx=dc_mlx5;mlx5_0:1
Connection summary: (pml)
on-host: all connections are ucx=sysv;memory,xpmem;memory,knem;memory
off-host: most connections are ucx=dc_mlx5;mlx5_0:1
Exceptions:
host 1: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 2: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 3: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 4: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 5: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 6: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 7: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 8: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 9: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 10: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 11: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 12: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 13: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 14: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 15: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 16: [14x ucx] [17x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 17: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 18: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 19: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 20: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 21: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 22: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 23: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 24: [13x ucx] [18x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 25: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 26: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 27: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 28: [12x ucx] [19x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 29: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 30: [11x ucx] [20x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]
host 31: [10x ucx] [21x ucx=dc_mlx5;mlx5_0:1] [1x ucx=sysv;memory,xpmem;memory,knem;memory]