Skip to content

Commit 1953e34

Browse files
committed
btl/tcp: add show_help message when peer hangs up
We commonly see messages on the users list where a peer has hung up because it has crashed. Instead of having just a BTL_ERROR message, make this a real opal_show_help() message that tells the user that the peer unexpectedly hung up, and they should look into *why* that peer hung up. Signed-off-by: Jeff Squyres <[email protected]>
1 parent 95c6f6c commit 1953e34

File tree

2 files changed

+32
-1
lines changed

2 files changed

+32
-1
lines changed

opal/mca/btl/tcp/btl_tcp_frag.c

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
* reserved.
1515
* Copyright (c) 2015 Research Organization for Information Science
1616
* and Technology (RIST). All rights reserved.
17-
* Copyright (c) 2015 Cisco Systems, Inc. All rights reserved.
17+
* Copyright (c) 2015-2016 Cisco Systems, Inc. All rights reserved.
1818
* $COPYRIGHT$
1919
*
2020
* Additional copyrights may follow
@@ -44,8 +44,12 @@
4444

4545
#include "opal/opal_socket_errno.h"
4646
#include "opal/mca/btl/base/btl_base_error.h"
47+
#include "opal/util/show_help.h"
48+
4749
#include "btl_tcp_frag.h"
4850
#include "btl_tcp_endpoint.h"
51+
#include "btl_tcp_proc.h"
52+
4953

5054
static void mca_btl_tcp_frag_eager_constructor(mca_btl_tcp_frag_t* frag)
5155
{
@@ -225,6 +229,16 @@ bool mca_btl_tcp_frag_recv(mca_btl_tcp_frag_t* frag, int sd)
225229
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
226230
mca_btl_tcp_endpoint_close(btl_endpoint);
227231
return false;
232+
233+
case ECONNRESET:
234+
opal_show_help("help-mpi-btl-tcp.txt", "peer hung up",
235+
true, opal_process_info.nodename,
236+
getpid(),
237+
btl_endpoint->endpoint_proc->proc_opal->proc_hostname);
238+
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
239+
mca_btl_tcp_endpoint_close(btl_endpoint);
240+
return false;
241+
228242
default:
229243
BTL_ERROR(("mca_btl_tcp_frag_recv: readv failed: %s (%d)",
230244
strerror(opal_socket_errno),

opal/mca/btl/tcp/help-mpi-btl-tcp.txt

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,3 +74,20 @@ Fall back to the normal progress.
7474
Local host: %s
7575
Value: %s
7676
Message: %s
77+
#
78+
[peer hung up]
79+
An MPI communication peer process has unexpectedly disconnected. This
80+
usually indicates a failure in the peer process (e.g., a crash or
81+
otherwise exiting without calling MPI_FINALIZE first).
82+
83+
Although this local MPI process will likely now behave unpredictably
84+
(it may even hang or crash), the root cause of this problem is the
85+
failure of the peer -- that is what you need to investigate. For
86+
example, there may be a core file that you can examine. More
87+
generally: such peer hangups are frequently caused by application bugs
88+
or other external events.
89+
90+
Local host: %s
91+
Local PID: %d
92+
Peer host: %s
93+
#

0 commit comments

Comments
 (0)