Skip to content

Commit 2961609

Browse files
committed
btl/tcp: add show_help message when peer hangs up
We commonly see messages on the users list where a peer has hung up because it has crashed. Instead of having just a BTL_ERROR message, make this a real opal_show_help() message that tells the user that the peer unexpectedly hung up, and they should look into *why* that peer hung up. Signed-off-by: Jeff Squyres <[email protected]>
1 parent 95c6f6c commit 2961609

File tree

2 files changed

+31
-1
lines changed

2 files changed

+31
-1
lines changed

opal/mca/btl/tcp/btl_tcp_frag.c

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
* reserved.
1515
* Copyright (c) 2015 Research Organization for Information Science
1616
* and Technology (RIST). All rights reserved.
17-
* Copyright (c) 2015 Cisco Systems, Inc. All rights reserved.
17+
* Copyright (c) 2015-2016 Cisco Systems, Inc. All rights reserved.
1818
* $COPYRIGHT$
1919
*
2020
* Additional copyrights may follow
@@ -44,8 +44,12 @@
4444

4545
#include "opal/opal_socket_errno.h"
4646
#include "opal/mca/btl/base/btl_base_error.h"
47+
#include "opal/util/show_help.h"
48+
4749
#include "btl_tcp_frag.h"
4850
#include "btl_tcp_endpoint.h"
51+
#include "btl_tcp_proc.h"
52+
4953

5054
static void mca_btl_tcp_frag_eager_constructor(mca_btl_tcp_frag_t* frag)
5155
{
@@ -225,6 +229,16 @@ bool mca_btl_tcp_frag_recv(mca_btl_tcp_frag_t* frag, int sd)
225229
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
226230
mca_btl_tcp_endpoint_close(btl_endpoint);
227231
return false;
232+
233+
case ECONNRESET:
234+
opal_show_help("help-mpi-btl-tcp.txt", "peer hung up",
235+
true, opal_process_info.nodename,
236+
getpid(),
237+
btl_endpoint->endpoint_proc->proc_opal->proc_hostname);
238+
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
239+
mca_btl_tcp_endpoint_close(btl_endpoint);
240+
return false;
241+
228242
default:
229243
BTL_ERROR(("mca_btl_tcp_frag_recv: readv failed: %s (%d)",
230244
strerror(opal_socket_errno),

opal/mca/btl/tcp/help-mpi-btl-tcp.txt

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,3 +74,19 @@ Fall back to the normal progress.
7474
Local host: %s
7575
Value: %s
7676
Message: %s
77+
#
78+
[peer hung up]
79+
An MPI communication peer process has unexpectedly disconnected. This
80+
usually indicates a failure in the peer process (e.g., a crash or
81+
other unexpected process exit without calling MPI_FINALIZE first).
82+
83+
You should check what happened to this peer process that hung up
84+
(perhaps there may even be a core file that you can examine); such
85+
crashes are frequently caused by application bugs or other external
86+
events.
87+
88+
Local host: %s
89+
Local PID: %d
90+
Peer host: %s
91+
92+
This local MPI process will behave unpredictably after this.

0 commit comments

Comments
 (0)