MPI Errhandler causes segfault on receipt of signal

In trying to track down an apparent "hang" on the 1.10.3 branch, I discovered that the job wasn't actually hanging. Instead, the job was taking a long time to complete because nearly every proc was segfaulting and writing out a core dump - and it was taking so long to write out all those core dumps that MTT thought we had hung.

The test is Intel MPI_Errhandler_fatal1_f (a few others show the same symptom - this is just the one I was working with). Everything starts correctly, and then rank=0 calls MPI_Abort. ORTE correctly identifies the situation and issues "kill" signals to the other procs, starting with SIGCONT followed by SIGTERM, and then SIGKILL.

The output results look okay:

``` shell
$ mpirun -n 16 ./MPI_Errhandler_fatal1_f
 MPITEST_INFO (         0): Starting test MPI_Errhandler_fatal     
 MPITEST_INFO (         0): This test should abort after printing the results
 MPITEST_INFO (         0): message, otherwise a f.a.i.l.u.r.e is noted
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (        16)
[bend001:21445] *** An error occurred in MPI_Send
[bend001:21445] *** reported by process [139981018365953,21474836480]
[bend001:21445] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[bend001:21445] *** MPI_ERR_RANK: invalid rank
[bend001:21445] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bend001:21445] ***    and potentially your MPI job)
```

However, the writing out of the core files takes a long time, and the shell prompt doesn't return until that dump is completed. Thus, MTT thinks the job timed out and tries to kill it - and fails, because the mpirun is actually already dead.

I have no idea why this test is segfaulting upon receipt of a signal. Can someone please investigate?

@ggouaillardet @bosilca @hjelmn @jsquyres 

Not sure which of you might have time and knowledge to dig this one out.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MPI Errhandler causes segfault on receipt of signal #1530

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MPI Errhandler causes segfault on receipt of signal #1530

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions