Skip to content

5.0.4 and newer -- LSF Affinity hostfile bug #12794

@zerothi

Description

@zerothi

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

I am testing this on 5.0.3 vs. 5.0.5 (only 5.0.5 has this problem).
I don't have 5.0.4 installed, so I don't know if that is affected, but I am quite confident
that this also occurs for 5.0.4 (since the submodule for prrte is the same for 5.0.4 and 5.0.5).

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From the sources. A little bit of ompi_info -c info:

 Configure command line: 'CC=gcc' 'CXX=g++' 'FC=gfortran'
                          '--prefix=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-lsf=/lsf/10.1'
                          '--with-lsf-libdir=/lsf/10.1/linux3.10-glibc2.17-x86_64/lib'
                          '--without-tm' '--enable-mpi-fortran=all'
                          '--with-hwloc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--enable-orterun-prefix-by-default'
                          '--with-ucx=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-ucc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-knem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--without-verbs' 'FCFLAGS=-O3 -march=haswell
                          -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32' 'CFLAGS=-O3
                          -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          'CXXFLAGS=-O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          '--with-ofi=no'
                          '--with-libevent=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          'LDFLAGS=-L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt'
                          '--with-xpmem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'

And env-vars:

            Build CFLAGS: -DNDEBUG -O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
                          -finline-functions
           Build FCFLAGS: -O3 -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
           Build LDFLAGS: -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
              Build LIBS: -levent_core -levent_pthreads -lhwloc
                          /tmp/sebo3-gcc-13.3.0-binutils-2.42/openmpi-5.0.3/3rd-party/openpmix/src/libpmix.la

Version numbers are of course different for 5.0.5, otherwise the same.

Please describe the system on which you are running

  • Operating system/version:

    Alma Linux 9.4

    $> cat /proc/version
    Linux version 6.1.106-1.el9.elrepo.x86_64 (mockbuild@83178ea248724ccf8c107949ffbafbc2) (gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3), GNU ld version 2.35.2-43.el9) #1 SMP PREEMPT_DYNAMIC Mon Aug 19 02:01:39 EDT 2024
  • Computer hardware:

    Tested on various hardware, both with and without hardware threads (see below).

  • Network type:
    Not relevant, I think.


Details of the problem

The problem relates to the interaction between LSF and OpenMPI.

A couple of issues that are shown here.

Bug introduced between 5.0.3 and 5.0.5

I encounter problems running simple programs (hello-world) in a multinode configuration:

$> bsub -n 8 -R "span[ptile=2]" ... < run.bsub

$> cat run.bsub
...
mpirun --report-bindings a.out

This will run on 4 nodes, each using 2 cores.

Output from:

  • 5.0.3:

    [n-62-28-31:793074] Rank 0 bound to package[1][hwt:14]
    [n-62-28-31:793074] Rank 1 bound to package[1][hwt:15]
    [n-62-28-28:3418906] Rank 2 bound to package[1][hwt:12]
    [n-62-28-28:3418906] Rank 3 bound to package[1][hwt:13]
    [n-62-28-29:1577632] Rank 4 bound to package[1][hwt:12]
    [n-62-28-29:1577632] Rank 5 bound to package[1][hwt:13]
    [n-62-28-30:53375] Rank 6 bound to package[1][hwt:12]
    [n-62-28-30:53375] Rank 7 bound to package[1][hwt:13]

    This looks reasonable. And LSF affinity file corresponds to this binding.

    Note, that these nodes does not have hyper-threading enabled.
    So our guess is that LSF always puts affinity for HWT, which is OK.
    It still obeys the default core binding which is what our end-users
    would expect.

  • 5.0.5

    [n-62-28-31:793073:0:793073] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x34)
    ==== backtrace (tid: 793073) ====
     0 0x000000000003e6f0 __GI___sigaction()  :0
     1 0x00000000000e94f4 prte_rmaps_rf_lsf_convert_affinity_to_rankfile()  rmaps_rank_file.c:0
     2 0x00000000000e8fc1 prte_rmaps_rf_process_lsf_affinity_hostfile()  rmaps_rank_file.c:0
     3 0x00000000000e684e prte_rmaps_rf_map()  rmaps_rank_file.c:0
     4 0x00000000000da965 prte_rmaps_base_map_job()  ???:0
     5 0x0000000000027cf9 event_process_active_single_queue()  event.c:0
     6 0x000000000002856f event_base_loop()  ???:0
     7 0x000000000040761a main()  ???:0
     8 0x0000000000029590 __libc_start_call_main()  ???:0
     9 0x0000000000029640 __libc_start_main_alias_2()  :0
    10 0x0000000000407b05 _start()  ???:0
    =================================

    Clearly something went wrong when parsing the affinity hostfile.

    The hostfile looks like this (for both 5.0.3 and 5.0.5):

    $> cat $LSB_AFFINITY_HOSTFILE
    n-62-28-31 16
    n-62-28-31 17
    n-62-28-28 14
    n-62-28-28 15
    n-62-28-29 14
    n-62-28-29 15
    n-62-28-30 14
    n-62-28-30 15

    (different job, hence different nodes/ranks)

So the above, indicates some regression for this handling. I tried to backtrack
something from prrte, but I am not skilled enough for the logic happening there.

I tracked the submodule hashes of OpenMPI between 5.0.3 and 5.0.4 to these:

So my suspicion is that also 5.0.4 has this.

Now, these things are relatively easily fixed.

I just do:

unset LSB_AFFINITY_HOSTFILE

and rely on cgroups. Then I get the correct behaviour.
Correct bindings etc.

By unsetting, I also fallback to the default OpenMPI binding:

  • 5.0.3

    $> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE
    n-62-28-31 18
    n-62-28-31 19
    n-62-28-28 16
    n-62-28-28 17
    n-62-28-29 16
    n-62-28-29 17
    n-62-28-30 16
    n-62-28-30 17
    (ompi binding)
    [n-62-28-31:793075] Rank 0 bound to package[1][core:18]
    [n-62-28-31:793075] Rank 1 bound to package[1][core:19]
    [n-62-28-28:3418905] Rank 2 bound to package[1][core:16]
    [n-62-28-28:3418905] Rank 3 bound to package[1][core:17]
    [n-62-28-29:1577633] Rank 4 bound to package[1][core:16]
    [n-62-28-29:1577633] Rank 5 bound to package[1][core:17]
    [n-62-28-30:53374] Rank 6 bound to package[1][core:16]
    [n-62-28-30:53374] Rank 7 bound to package[1][core:17]

    Note here that it says core instead of hwt.

  • 5.0.5

    $> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE
    n-62-28-28 18
    n-62-28-28 19
    n-62-28-29 18
    n-62-28-29 19
    n-62-28-30 18
    n-62-28-30 19
    n-62-28-33 12
    n-62-28-33 13
    (ompi binding)
    [n-62-28-28:3418897] Rank 0 bound to package[1][core:18]
    [n-62-28-28:3418897] Rank 1 bound to package[1][core:19]
    [n-62-28-29:1577625] Rank 2 bound to package[1][core:18]
    [n-62-28-29:1577625] Rank 3 bound to package[1][core:19]
    [n-62-28-33:2083367] Rank 7 bound to package[1][core:13]
    [n-62-28-33:2083367] Rank 6 bound to package[1][core:12]
    [n-62-28-30:53366] Rank 4 bound to package[1][core:18]
    [n-62-28-30:53366] Rank 5 bound to package[1][core:19]

    So same thing happens, good!

Nodes with HW threads

This is likely related to the above, I just put it here for completeness.

As mentioned above I can do unset LSB_AFFINITY_HOSTFILE and get correct bindings.

However, the above works only when there are no HWT.

Here is the same thing for a node with 2 HWT / core (EPYC milan, 32-core/socket in 2-socket)

Only requesting 4 cores here.

  • 5.0.3

    $> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE
    n-62-12-14 4,68
    n-62-12-14 5,69
    n-62-12-15 4,68
    n-62-12-15 5,69
    (ompi binding)
    [n-62-12-14:202682] Rank 0 bound to package[0][core:4]
    [n-62-12-14:202682] Rank 1 bound to package[0][core:5]
    [n-62-12-15:1179019] Rank 2 bound to package[0][core:4]
    [n-62-12-15:1179019] Rank 3 bound to package[0][core:5]

    This looks OK. Still binding to the cgroup cores.

  • 5.0.5

    $> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE
    n-62-12-14 6,70
    n-62-12-14 7,71
    n-62-12-15 6,70
    n-62-12-15 7,71
    (ompi binding)
    [n-62-12-14:202680] Rank 0 bound to package[0][core:0]
    [n-62-12-14:202680] Rank 1 bound to package[0][core:1]
    [n-62-12-15:1179020] Rank 2 bound to package[0][core:0]
    [n-62-12-15:1179020] Rank 3 bound to package[0][core:1]

    This looks bad, wrong core binding, should have been 6,7 on both nodes.

If you need more information, let me know!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions