- 
                Notifications
    You must be signed in to change notification settings 
- Fork 929
Description
In trying to track down an apparent "hang" on the 1.10.3 branch, I discovered that the job wasn't actually hanging. Instead, the job was taking a long time to complete because nearly every proc was segfaulting and writing out a core dump - and it was taking so long to write out all those core dumps that MTT thought we had hung.
The test is Intel MPI_Errhandler_fatal1_f (a few others show the same symptom - this is just the one I was working with). Everything starts correctly, and then rank=0 calls MPI_Abort. ORTE correctly identifies the situation and issues "kill" signals to the other procs, starting with SIGCONT followed by SIGTERM, and then SIGKILL.
The output results look okay:
$ mpirun -n 16 ./MPI_Errhandler_fatal1_f
 MPITEST_INFO (         0): Starting test MPI_Errhandler_fatal     
 MPITEST_INFO (         0): This test should abort after printing the results
 MPITEST_INFO (         0): message, otherwise a f.a.i.l.u.r.e is noted
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (        16)
[bend001:21445] *** An error occurred in MPI_Send
[bend001:21445] *** reported by process [139981018365953,21474836480]
[bend001:21445] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[bend001:21445] *** MPI_ERR_RANK: invalid rank
[bend001:21445] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bend001:21445] ***    and potentially your MPI job)However, the writing out of the core files takes a long time, and the shell prompt doesn't return until that dump is completed. Thus, MTT thinks the job timed out and tries to kill it - and fails, because the mpirun is actually already dead.
I have no idea why this test is segfaulting upon receipt of a signal. Can someone please investigate?
@ggouaillardet @bosilca @hjelmn @jsquyres
Not sure which of you might have time and knowledge to dig this one out.