-
Couldn't load subscription status.
- Fork 928
Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
$ git submodule status
8ab6d680b90afd6e61766220a8724065a1b554a7 3rd-party/openpmix (v5.0.3)
b68a0acb32cfc0d3c19249e5514820555bcf438b 3rd-party/prrte (v3.0.6)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)
Please describe the system on which you are running
- Operating system/version:
- Computer hardware:
- Network type:
\_> I don't believe this info in needed for this issue.
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
shell$ mpirun -n 2 ./hello_worldThis occurred while running under an salloc -N2. Here is the topology that was being used:
$ srun -l --exclusive lstopo-no-graphics
1: Machine (31GB total)
1: Package L#0
1: NUMANode L#0 (P#0 31GB)
1: L3 L#0 (36MB) + L2 L#0 (2048KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
1: PU L#0 (P#0)
1: PU L#1 (P#1)
1: HostBridge
1: PCIBridge
1: PCI 01:00.0 (VGA)
1: CoProc(OpenCL) "opencl0d0"
1: GPU(Display) ":0.0"
1: PCI 00:02.0 (Display)
1: PCIBridge
1: PCI 02:00.0 (NVMExp)
1: Block(Disk) "nvme0n1"
1: PCI 00:17.0 (SATA)
1: PCIBridge
1: PCI 04:00.0 (Ethernet)
1: Net "enp4s0"
0: Machine (31GB total)
0: Package L#0
0: NUMANode L#0 (P#0 31GB)
0: L3 L#0 (36MB) + L2 L#0 (2048KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
0: PU L#0 (P#0)
0: PU L#1 (P#1)
0: HostBridge
0: PCIBridge
0: PCI 01:00.0 (VGA)
0: CoProc(OpenCL) "opencl0d0"
0: GPU(Display) ":0.0"
0: PCI 00:02.0 (Display)
0: PCIBridge
0: PCI 02:00.0 (NVMExp)
0: Block(Disk) "nvme0n1"
0: PCI 00:17.0 (SATA)
0: PCIBridge
0: PCI 04:00.0 (Ethernet)
0: Net "enp4s0"
Here is the mpirun command that segfaults:
$ mpirun --bind-to=hwthread:overload-allowed -np 4 hostname
Segmentation fault (core dumped
Here is the backtrace of the crash:
Thread 1 "prterun" received signal SIGSEGV, Segmentation fault.
0x00007ffff7d61a2f in bind_generic (jdata=0x5555557a81e0, proc=0x5555557aa640, node=0x55555579e030, obj=0x5555555be660,
options=0x7fffffffadb0) at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_binding.c:142
142 tgtcpus = trg_obj->cpuset;
(gdb) bt
#0 0x00007ffff7d61a2f in bind_generic (jdata=0x5555557a81e0, proc=0x5555557aa640, node=0x55555579e030, obj=0x5555555be660,
options=0x7fffffffadb0) at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_binding.c:142
#1 0x00007ffff7d6276d in prte_rmaps_base_bind_proc (jdata=0x5555557a81e0, proc=0x5555557aa640, node=0x55555579e030, obj=0x5555555be660,
options=0x7fffffffadb0) at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_binding.c:463
#2 0x00007ffff7d5df3e in prte_rmaps_base_setup_proc (jdata=0x5555557a81e0, idx=0, node=0x55555579e030, obj=0x5555555be660,
options=0x7fffffffadb0) at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_support_fns.c:568
#3 0x00007ffff7d6a696 in prte_rmaps_rr_byobj (jdata=0x5555557a81e0, app=0x5555557a9a70, node_list=0x7fffffffa930, num_slots=4,
num_procs=4, options=0x7fffffffadb0)
at ../../../../../../../../source/3rd-party/prrte/src/mca/rmaps/round_robin/rmaps_rr_mappers.c:671
#4 0x00007ffff7d66c09 in prte_rmaps_rr_map (jdata=0x5555557a81e0, options=0x7fffffffadb0)
at ../../../../../../../../source/3rd-party/prrte/src/mca/rmaps/round_robin/rmaps_rr.c:134
#5 0x00007ffff7d57301 in prte_rmaps_base_map_job (fd=-1, args=4, cbdata=0x55555556b8d0)
at ../../../../../../../source/3rd-party/prrte/src/mca/rmaps/base/rmaps_base_map_job.c:839
#6 0x00007ffff7f892a8 in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#7 0x00007ffff7f8afaf in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#8 0x000055555555ec0c in main (argc=5, argv=0x7fffffffbc28) at ../../../../../../../source/3rd-party/prrte/src/tools/prte/prte.c:1245
(gdb) p trg_obj
$2 = (hwloc_obj_t) 0xd
The segfault occurred because trg_obj was not initialized to NULL, so it had random memory causing the following error case from being hit in bind_generic():
if (NULL == trg_obj) {
/* there aren't any appropriate targets under this object */
if (PRTE_BINDING_REQUIRED(jdata->map->binding)) {
pmix_show_help("help-prte-rmaps-base.txt", "rmaps:no-available-cpus", true, node->name);
return PRTE_ERR_SILENT;
} else {
return PRTE_SUCCESS;
}
}