- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
I am testing this on 5.0.3 vs. 5.0.5 (only 5.0.5 has this problem).
I don't have 5.0.4 installed, so I don't know if that is affected, but I am quite confident
that this also occurs for 5.0.4 (since the submodule for prrte is the same for 5.0.4 and 5.0.5).
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From the sources. A little bit of ompi_info -c info:
 Configure command line: 'CC=gcc' 'CXX=g++' 'FC=gfortran'
                          '--prefix=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-lsf=/lsf/10.1'
                          '--with-lsf-libdir=/lsf/10.1/linux3.10-glibc2.17-x86_64/lib'
                          '--without-tm' '--enable-mpi-fortran=all'
                          '--with-hwloc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--enable-orterun-prefix-by-default'
                          '--with-ucx=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-ucc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-knem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--without-verbs' 'FCFLAGS=-O3 -march=haswell
                          -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32' 'CFLAGS=-O3
                          -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          'CXXFLAGS=-O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          '--with-ofi=no'
                          '--with-libevent=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          'LDFLAGS=-L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt'
                          '--with-xpmem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'And env-vars:
            Build CFLAGS: -DNDEBUG -O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
                          -finline-functions
           Build FCFLAGS: -O3 -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
           Build LDFLAGS: -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
              Build LIBS: -levent_core -levent_pthreads -lhwloc
                          /tmp/sebo3-gcc-13.3.0-binutils-2.42/openmpi-5.0.3/3rd-party/openpmix/src/libpmix.laVersion numbers are of course different for 5.0.5, otherwise the same.
Please describe the system on which you are running
- 
Operating system/version:
Alma Linux 9.4
$> cat /proc/version Linux version 6.1.106-1.el9.elrepo.x86_64 (mockbuild@83178ea248724ccf8c107949ffbafbc2) (gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3), GNU ld version 2.35.2-43.el9) #1 SMP PREEMPT_DYNAMIC Mon Aug 19 02:01:39 EDT 2024
 - 
Computer hardware:
Tested on various hardware, both with and without hardware threads (see below).
 - 
Network type:
Not relevant, I think. 
Details of the problem
The problem relates to the interaction between LSF and OpenMPI.
A couple of issues that are shown here.
Bug introduced between 5.0.3 and 5.0.5
I encounter problems running simple programs (hello-world) in a multinode configuration:
$> bsub -n 8 -R "span[ptile=2]" ... < run.bsub
$> cat run.bsub
...
mpirun --report-bindings a.outThis will run on 4 nodes, each using 2 cores.
Output from:
- 
5.0.3:[n-62-28-31:793074] Rank 0 bound to package[1][hwt:14] [n-62-28-31:793074] Rank 1 bound to package[1][hwt:15] [n-62-28-28:3418906] Rank 2 bound to package[1][hwt:12] [n-62-28-28:3418906] Rank 3 bound to package[1][hwt:13] [n-62-28-29:1577632] Rank 4 bound to package[1][hwt:12] [n-62-28-29:1577632] Rank 5 bound to package[1][hwt:13] [n-62-28-30:53375] Rank 6 bound to package[1][hwt:12] [n-62-28-30:53375] Rank 7 bound to package[1][hwt:13]
This looks reasonable. And LSF affinity file corresponds to this binding.
Note, that these nodes does not have hyper-threading enabled.
So our guess is that LSF always puts affinity for HWT, which is OK.
It still obeys the default core binding which is what our end-users
would expect. - 
5.0.5[n-62-28-31:793073:0:793073] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x34) ==== backtrace (tid: 793073) ==== 0 0x000000000003e6f0 __GI___sigaction() :0 1 0x00000000000e94f4 prte_rmaps_rf_lsf_convert_affinity_to_rankfile() rmaps_rank_file.c:0 2 0x00000000000e8fc1 prte_rmaps_rf_process_lsf_affinity_hostfile() rmaps_rank_file.c:0 3 0x00000000000e684e prte_rmaps_rf_map() rmaps_rank_file.c:0 4 0x00000000000da965 prte_rmaps_base_map_job() ???:0 5 0x0000000000027cf9 event_process_active_single_queue() event.c:0 6 0x000000000002856f event_base_loop() ???:0 7 0x000000000040761a main() ???:0 8 0x0000000000029590 __libc_start_call_main() ???:0 9 0x0000000000029640 __libc_start_main_alias_2() :0 10 0x0000000000407b05 _start() ???:0 =================================
Clearly something went wrong when parsing the affinity hostfile.
The hostfile looks like this (for both 5.0.3 and 5.0.5):
$> cat $LSB_AFFINITY_HOSTFILE n-62-28-31 16 n-62-28-31 17 n-62-28-28 14 n-62-28-28 15 n-62-28-29 14 n-62-28-29 15 n-62-28-30 14 n-62-28-30 15
(different job, hence different nodes/ranks)
 
So the above, indicates some regression for this handling. I tried to backtrack
something from prrte, but I am not skilled enough for the logic happening there.
I tracked the submodule hashes of OpenMPI between 5.0.3 and 5.0.4 to these:
- [email protected] 
3a70fac9a21700b31c4a9f9958afa207a627f0fa - [email protected] 
b68a0acb32cfc0d3c19249e5514820555bcf438b - [email protected] 
b68a0acb32cfc0d3c19249e5514820555bcf438b 
So my suspicion is that also 5.0.4 has this.
Now, these things are relatively easily fixed.
I just do:
unset LSB_AFFINITY_HOSTFILEand rely on cgroups. Then I get the correct behaviour.
Correct bindings etc.
By unsetting, I also fallback to the default OpenMPI binding:
- 
5.0.3
$> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE n-62-28-31 18 n-62-28-31 19 n-62-28-28 16 n-62-28-28 17 n-62-28-29 16 n-62-28-29 17 n-62-28-30 16 n-62-28-30 17 (ompi binding) [n-62-28-31:793075] Rank 0 bound to package[1][core:18] [n-62-28-31:793075] Rank 1 bound to package[1][core:19] [n-62-28-28:3418905] Rank 2 bound to package[1][core:16] [n-62-28-28:3418905] Rank 3 bound to package[1][core:17] [n-62-28-29:1577633] Rank 4 bound to package[1][core:16] [n-62-28-29:1577633] Rank 5 bound to package[1][core:17] [n-62-28-30:53374] Rank 6 bound to package[1][core:16] [n-62-28-30:53374] Rank 7 bound to package[1][core:17]
Note here that it says
coreinstead ofhwt. - 
5.0.5
$> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE n-62-28-28 18 n-62-28-28 19 n-62-28-29 18 n-62-28-29 19 n-62-28-30 18 n-62-28-30 19 n-62-28-33 12 n-62-28-33 13 (ompi binding) [n-62-28-28:3418897] Rank 0 bound to package[1][core:18] [n-62-28-28:3418897] Rank 1 bound to package[1][core:19] [n-62-28-29:1577625] Rank 2 bound to package[1][core:18] [n-62-28-29:1577625] Rank 3 bound to package[1][core:19] [n-62-28-33:2083367] Rank 7 bound to package[1][core:13] [n-62-28-33:2083367] Rank 6 bound to package[1][core:12] [n-62-28-30:53366] Rank 4 bound to package[1][core:18] [n-62-28-30:53366] Rank 5 bound to package[1][core:19]
So same thing happens, good!
 
Nodes with HW threads
This is likely related to the above, I just put it here for completeness.
As mentioned above I can do unset LSB_AFFINITY_HOSTFILE and get correct bindings.
However, the above works only when there are no HWT.
Here is the same thing for a node with 2 HWT / core (EPYC milan, 32-core/socket in 2-socket)
Only requesting 4 cores here.
- 
5.0.3
$> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE n-62-12-14 4,68 n-62-12-14 5,69 n-62-12-15 4,68 n-62-12-15 5,69 (ompi binding) [n-62-12-14:202682] Rank 0 bound to package[0][core:4] [n-62-12-14:202682] Rank 1 bound to package[0][core:5] [n-62-12-15:1179019] Rank 2 bound to package[0][core:4] [n-62-12-15:1179019] Rank 3 bound to package[0][core:5]
This looks OK. Still binding to the cgroup cores.
 - 
5.0.5
$> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE n-62-12-14 6,70 n-62-12-14 7,71 n-62-12-15 6,70 n-62-12-15 7,71 (ompi binding) [n-62-12-14:202680] Rank 0 bound to package[0][core:0] [n-62-12-14:202680] Rank 1 bound to package[0][core:1] [n-62-12-15:1179020] Rank 2 bound to package[0][core:0] [n-62-12-15:1179020] Rank 3 bound to package[0][core:1]
This looks bad, wrong core binding, should have been 6,7 on both nodes.
 
If you need more information, let me know!