Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Commit 35f48ba

Browse files
committed
btl/tcp: add show_help message when peer hangs up
We commonly see messages on the users list where a peer has hung up because it has crashed. Instead of having just a BTL_ERROR message, make this a real opal_show_help() message that tells the user that the peer unexpectedly hung up, and they should look into *why* that peer hung up. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit open-mpi/ompi@1953e34)
1 parent d3733dd commit 35f48ba

File tree

2 files changed

+35
-3
lines changed

2 files changed

+35
-3
lines changed

opal/mca/btl/tcp/btl_tcp_frag.c

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
* reserved.
1515
* Copyright (c) 2015 Research Organization for Information Science
1616
* and Technology (RIST). All rights reserved.
17+
* Copyright (c) 2015-2016 Cisco Systems, Inc. All rights reserved.
1718
* $COPYRIGHT$
1819
*
1920
* Additional copyrights may follow
@@ -43,8 +44,12 @@
4344

4445
#include "opal/opal_socket_errno.h"
4546
#include "opal/mca/btl/base/btl_base_error.h"
47+
#include "opal/util/show_help.h"
48+
4649
#include "btl_tcp_frag.h"
4750
#include "btl_tcp_endpoint.h"
51+
#include "btl_tcp_proc.h"
52+
4853

4954
static void mca_btl_tcp_frag_eager_constructor(mca_btl_tcp_frag_t* frag)
5055
{
@@ -222,9 +227,19 @@ bool mca_btl_tcp_frag_recv(mca_btl_tcp_frag_t* frag, int sd)
222227
frag->iov_ptr[0].iov_base, (unsigned long) frag->iov_ptr[0].iov_len,
223228
strerror(opal_socket_errno), (unsigned long) frag->iov_cnt));
224229
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
225-
mca_btl_tcp_endpoint_close(btl_endpoint);
226-
return false;
227-
default:
230+
mca_btl_tcp_endpoint_close(btl_endpoint);
231+
return false;
232+
233+
case ECONNRESET:
234+
opal_show_help("help-mpi-btl-tcp.txt", "peer hung up",
235+
true, opal_process_info.nodename,
236+
getpid(),
237+
btl_endpoint->endpoint_proc->proc_opal->proc_hostname);
238+
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
239+
mca_btl_tcp_endpoint_close(btl_endpoint);
240+
return false;
241+
242+
default:
228243
BTL_ERROR(("mca_btl_tcp_frag_recv: readv failed: %s (%d)",
229244
strerror(opal_socket_errno),
230245
opal_socket_errno));

opal/mca/btl/tcp/help-mpi-btl-tcp.txt

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,3 +74,20 @@ Fall back to the normal progress.
7474
Local host: %s
7575
Value: %s
7676
Message: %s
77+
#
78+
[peer hung up]
79+
An MPI communication peer process has unexpectedly disconnected. This
80+
usually indicates a failure in the peer process (e.g., a crash or
81+
otherwise exiting without calling MPI_FINALIZE first).
82+
83+
Although this local MPI process will likely now behave unpredictably
84+
(it may even hang or crash), the root cause of this problem is the
85+
failure of the peer -- that is what you need to investigate. For
86+
example, there may be a core file that you can examine. More
87+
generally: such peer hangups are frequently caused by application bugs
88+
or other external events.
89+
90+
Local host: %s
91+
Local PID: %d
92+
Peer host: %s
93+
#

0 commit comments

Comments
 (0)