-
Notifications
You must be signed in to change notification settings - Fork 929
Description
I have had occasional segfaults of prte when starting jobs, and debugged this. This is may analysis and how I fixed it (the errors are in the prte as bundled with openmpi 5.0.8)
FIRST PROBLEM:
file 3rd-party/prrte/src/mca/plm/base/plm_base_launch_support.c, in function
prte_plm_base_daemon_callback:
The vector prte_node_topologies may be filled with entries t for which t->topo == NULL, e.g.
if only a signature is added to the topology array. This causes a segfault later on when
either prte_hwloc_base_setup_summary() or hwloc_cpuset_t prte_hwloc_base_filter_cpus()
is called with a NULL pointer. A quick-and-dirty fix is to catch this case in the file
3rd-party/prrte/src/hwloc/hwloc_base_util.c
by adding the "catch" at the two of these two functions, that is:
hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
{
if (topo == NULL) {
hwloc_cpuset_t bogus = hwloc_bitmap_alloc();
hwloc_bitmap_fill(bogus);
return bogus;
}
hwloc_cpuset_t prte_hwloc_base_filter_cpus(hwloc_topology_t topo)
{
if (topo == NULL) {
hwloc_cpuset_t bogus = hwloc_bitmap_alloc();
hwloc_bitmap_fill(bogus);
return bogus;
}
(the if clauses are added). Note that I do not care much about the return value, since I do not use
CPU pinning. Note that the return value of prte_hwloc_base_setup_summary is mostly not used
anyway (this is a memory leak). This fix at least makes it work for me, it cures a symptom not the
cause.
SECOND PROBLEM:
The vector prte_node_topologies may contain two different entries t which have the same pointer
t->topo. This leads to a segfault in prte_finalize() when the second of these two entries is released.
I had no time to dig into this deeply, an even quicker-and-dirtier fix is to just commented out the
release at the end of
prte_finalize() in file 3rd-party/prrte/src/runtime/prte_finalize.c:
for (n = 0; n < prte_node_topologies->size; n++) {
topo = (prte_topology_t *) pmix_pointer_array_get_item(prte_node_topologies, n);
if (NULL == topo) {
continue;
}
pmix_pointer_array_set_item(prte_node_topologies, n, NULL);
//PMIX_RELEASE(topo);
}
PMIX_RELEASE(prte_node_topologies);
(Note the line PMIX_RELEASE(topo) is commented out).
I know the "true" solution looks different but this is to share my work in the hope it might be
useful for others.
Yours,
C. van Wüllen