Skip to content

mpirun segfault while binding due to uninitialized variable #13407

@MeganD101

Description

@MeganD101

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 8ab6d680b90afd6e61766220a8724065a1b554a7 3rd-party/openpmix (v5.0.3)
 b68a0acb32cfc0d3c19249e5514820555bcf438b 3rd-party/prrte (v3.0.6)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)

Please describe the system on which you are running

  • Operating system/version:
  • Computer hardware:
  • Network type:
    \_> I don't believe this info in needed for this issue.

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

This occurred while running under an salloc -N2. Here is the topology that was being used:

$ srun -l --exclusive lstopo-no-graphics 
1: Machine (31GB total)
1:   Package L#0
1:     NUMANode L#0 (P#0 31GB)
1:     L3 L#0 (36MB) + L2 L#0 (2048KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
1:       PU L#0 (P#0)
1:       PU L#1 (P#1)
1:   HostBridge
1:     PCIBridge
1:       PCI 01:00.0 (VGA)
1:         CoProc(OpenCL) "opencl0d0"
1:         GPU(Display) ":0.0"
1:     PCI 00:02.0 (Display)
1:     PCIBridge
1:       PCI 02:00.0 (NVMExp)
1:         Block(Disk) "nvme0n1"
1:     PCI 00:17.0 (SATA)
1:     PCIBridge
1:       PCI 04:00.0 (Ethernet)
1:         Net "enp4s0"
0: Machine (31GB total)
0:   Package L#0
0:     NUMANode L#0 (P#0 31GB)
0:     L3 L#0 (36MB) + L2 L#0 (2048KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
0:       PU L#0 (P#0)
0:       PU L#1 (P#1)
0:   HostBridge
0:     PCIBridge
0:       PCI 01:00.0 (VGA)
0:         CoProc(OpenCL) "opencl0d0"
0:         GPU(Display) ":0.0"
0:     PCI 00:02.0 (Display)
0:     PCIBridge
0:       PCI 02:00.0 (NVMExp)
0:         Block(Disk) "nvme0n1"
0:     PCI 00:17.0 (SATA)
0:     PCIBridge
0:       PCI 04:00.0 (Ethernet)
0:         Net "enp4s0"

Here is the mpirun command that segfaults:

$ mpirun --bind-to=hwthread:overload-allowed -np 4 hostname
Segmentation fault (core dumped

Here is the backtrace of the crash:

Thread 1 "prterun" received signal SIGSEGV, Segmentation fault.
0x00007ffff7d61a2f in bind_generic (jdata=0x5555557a81e0, proc=0x5555557aa640, node=0x55555579e030, obj=0x5555555be660, 
    options=0x7fffffffadb0) at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_binding.c:142
142         tgtcpus = trg_obj->cpuset;
(gdb) bt
#0  0x00007ffff7d61a2f in bind_generic (jdata=0x5555557a81e0, proc=0x5555557aa640, node=0x55555579e030, obj=0x5555555be660, 
    options=0x7fffffffadb0) at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_binding.c:142
#1  0x00007ffff7d6276d in prte_rmaps_base_bind_proc (jdata=0x5555557a81e0, proc=0x5555557aa640, node=0x55555579e030, obj=0x5555555be660, 
    options=0x7fffffffadb0) at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_binding.c:463
#2  0x00007ffff7d5df3e in prte_rmaps_base_setup_proc (jdata=0x5555557a81e0, idx=0, node=0x55555579e030, obj=0x5555555be660, 
    options=0x7fffffffadb0) at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_support_fns.c:568
#3  0x00007ffff7d6a696 in prte_rmaps_rr_byobj (jdata=0x5555557a81e0, app=0x5555557a9a70, node_list=0x7fffffffa930, num_slots=4, 
    num_procs=4, options=0x7fffffffadb0)
    at ../../../../../../../../source/3rd-party/prrte/src/mca/rmaps/round_robin/rmaps_rr_mappers.c:671
#4  0x00007ffff7d66c09 in prte_rmaps_rr_map (jdata=0x5555557a81e0, options=0x7fffffffadb0)
    at ../../../../../../../../source/3rd-party/prrte/src/mca/rmaps/round_robin/rmaps_rr.c:134
#5  0x00007ffff7d57301 in prte_rmaps_base_map_job (fd=-1, args=4, cbdata=0x55555556b8d0)
    at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_map_job.c:839
#6  0x00007ffff7f892a8 in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#7  0x00007ffff7f8afaf in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#8  0x000055555555ec0c in main (argc=5, argv=0x7fffffffbc28) at ../../../../../../../source/3rd-party/prrte/src/tools/prte/prte.c:1245
(gdb) p trg_obj
$2 = (hwloc_obj_t) 0xd

The segfault occurred because trg_obj was not initialized to NULL, so it had random memory causing the following error case from being hit in bind_generic():

    if (NULL == trg_obj) {
        /* there aren't any appropriate targets under this object */
        if (PRTE_BINDING_REQUIRED(jdata->map->binding)) {
            pmix_show_help("help-prte-rmaps-base.txt", "rmaps:no-available-cpus", true, node->name);
            return PRTE_ERR_SILENT;
        } else {
            return PRTE_SUCCESS;
        }
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions