Skip to content

MPI Errhandler causes segfault on receipt of signal #1530

@rhc54

Description

@rhc54

In trying to track down an apparent "hang" on the 1.10.3 branch, I discovered that the job wasn't actually hanging. Instead, the job was taking a long time to complete because nearly every proc was segfaulting and writing out a core dump - and it was taking so long to write out all those core dumps that MTT thought we had hung.

The test is Intel MPI_Errhandler_fatal1_f (a few others show the same symptom - this is just the one I was working with). Everything starts correctly, and then rank=0 calls MPI_Abort. ORTE correctly identifies the situation and issues "kill" signals to the other procs, starting with SIGCONT followed by SIGTERM, and then SIGKILL.

The output results look okay:

$ mpirun -n 16 ./MPI_Errhandler_fatal1_f
 MPITEST_INFO (         0): Starting test MPI_Errhandler_fatal     
 MPITEST_INFO (         0): This test should abort after printing the results
 MPITEST_INFO (         0): message, otherwise a f.a.i.l.u.r.e is noted
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (        16)
[bend001:21445] *** An error occurred in MPI_Send
[bend001:21445] *** reported by process [139981018365953,21474836480]
[bend001:21445] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[bend001:21445] *** MPI_ERR_RANK: invalid rank
[bend001:21445] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bend001:21445] ***    and potentially your MPI job)

However, the writing out of the core files takes a long time, and the shell prompt doesn't return until that dump is completed. Thus, MTT thinks the job timed out and tries to kill it - and fails, because the mpirun is actually already dead.

I have no idea why this test is segfaulting upon receipt of a signal. Can someone please investigate?

@ggouaillardet @bosilca @hjelmn @jsquyres

Not sure which of you might have time and knowledge to dig this one out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions