Skip to content

OMPI 5.0.8 occasionally segfaults upon start (with analysis and fix) #13357

@dl1ycf

Description

@dl1ycf

I have had occasional segfaults of prte when starting jobs, and debugged this. This is may analysis and how I fixed it (the errors are in the prte as bundled with openmpi 5.0.8)

FIRST PROBLEM:

file 3rd-party/prrte/src/mca/plm/base/plm_base_launch_support.c, in function
prte_plm_base_daemon_callback:

The vector prte_node_topologies may be filled with entries t for which t->topo == NULL, e.g.
if only a signature is added to the topology array. This causes a segfault later on when
either prte_hwloc_base_setup_summary() or hwloc_cpuset_t prte_hwloc_base_filter_cpus()
is called with a NULL pointer. A quick-and-dirty fix is to catch this case in the file

3rd-party/prrte/src/hwloc/hwloc_base_util.c

by adding the "catch" at the two of these two functions, that is:


hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
{
if (topo == NULL) {
hwloc_cpuset_t bogus = hwloc_bitmap_alloc();
hwloc_bitmap_fill(bogus);
return bogus;
}


hwloc_cpuset_t prte_hwloc_base_filter_cpus(hwloc_topology_t topo)
{
if (topo == NULL) {
hwloc_cpuset_t bogus = hwloc_bitmap_alloc();
hwloc_bitmap_fill(bogus);
return bogus;
}

(the if clauses are added). Note that I do not care much about the return value, since I do not use
CPU pinning. Note that the return value of prte_hwloc_base_setup_summary is mostly not used
anyway (this is a memory leak). This fix at least makes it work for me, it cures a symptom not the
cause.

SECOND PROBLEM:

The vector prte_node_topologies may contain two different entries t which have the same pointer
t->topo. This leads to a segfault in prte_finalize() when the second of these two entries is released.
I had no time to dig into this deeply, an even quicker-and-dirtier fix is to just commented out the
release at the end of

prte_finalize() in file 3rd-party/prrte/src/runtime/prte_finalize.c:


for (n = 0; n < prte_node_topologies->size; n++) {
    topo = (prte_topology_t *) pmix_pointer_array_get_item(prte_node_topologies, n);
    if (NULL == topo) {
        continue;
    }
    pmix_pointer_array_set_item(prte_node_topologies, n, NULL);

    //PMIX_RELEASE(topo);
}
PMIX_RELEASE(prte_node_topologies);

(Note the line PMIX_RELEASE(topo) is commented out).

I know the "true" solution looks different but this is to share my work in the hope it might be
useful for others.

Yours,

C. van Wüllen

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions