Skip to content

Commit 527efec

Browse files
authored
Merge pull request #2050 from jsquyres/pr/btl-tcp-help-messages
Add a show_help message to TCP BTL when peer unexpectedly disconnects
2 parents 894be78 + 1953e34 commit 527efec

File tree

2 files changed

+36
-8
lines changed

2 files changed

+36
-8
lines changed

opal/mca/btl/tcp/btl_tcp_frag.c

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
* reserved.
1515
* Copyright (c) 2015 Research Organization for Information Science
1616
* and Technology (RIST). All rights reserved.
17-
* Copyright (c) 2015 Cisco Systems, Inc. All rights reserved.
17+
* Copyright (c) 2015-2016 Cisco Systems, Inc. All rights reserved.
1818
* $COPYRIGHT$
1919
*
2020
* Additional copyrights may follow
@@ -44,8 +44,12 @@
4444

4545
#include "opal/opal_socket_errno.h"
4646
#include "opal/mca/btl/base/btl_base_error.h"
47+
#include "opal/util/show_help.h"
48+
4749
#include "btl_tcp_frag.h"
4850
#include "btl_tcp_endpoint.h"
51+
#include "btl_tcp_proc.h"
52+
4953

5054
static void mca_btl_tcp_frag_eager_constructor(mca_btl_tcp_frag_t* frag)
5155
{
@@ -225,6 +229,16 @@ bool mca_btl_tcp_frag_recv(mca_btl_tcp_frag_t* frag, int sd)
225229
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
226230
mca_btl_tcp_endpoint_close(btl_endpoint);
227231
return false;
232+
233+
case ECONNRESET:
234+
opal_show_help("help-mpi-btl-tcp.txt", "peer hung up",
235+
true, opal_process_info.nodename,
236+
getpid(),
237+
btl_endpoint->endpoint_proc->proc_opal->proc_hostname);
238+
btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
239+
mca_btl_tcp_endpoint_close(btl_endpoint);
240+
return false;
241+
228242
default:
229243
BTL_ERROR(("mca_btl_tcp_frag_recv: readv failed: %s (%d)",
230244
strerror(opal_socket_errno),

opal/mca/btl/tcp/help-mpi-btl-tcp.txt

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# -*- text -*-
22
#
3-
# Copyright (c) 2009-2014 Cisco Systems, Inc. All rights reserved.
3+
# Copyright (c) 2009-2016 Cisco Systems, Inc. All rights reserved.
44
# Copyright (c) 2015-2016 The University of Tennessee and The University
55
# of Tennessee Research Foundation. All rights
66
# reserved.
@@ -59,21 +59,35 @@ most common causes when it does occur are:
5959
* The operating system ran out of file descriptors
6060
* The operating system ran out of memory
6161

62+
Your Open MPI job will likely hang (or crash) until the failure
63+
resason is fixed (e.g., more file descriptors and/or memory becomes
64+
available), and may eventually timeout / abort.
65+
66+
Local host: %s
67+
PID: %d
68+
Errno: %d (%s)
69+
#
6270
[unsuported progress thread]
6371
WARNING: Support for the TCP progress thread has not been compiled in.
6472
Fall back to the normal progress.
6573

6674
Local host: %s
6775
Value: %s
6876
Message: %s
69-
7077
#
78+
[peer hung up]
79+
An MPI communication peer process has unexpectedly disconnected. This
80+
usually indicates a failure in the peer process (e.g., a crash or
81+
otherwise exiting without calling MPI_FINALIZE first).
7182

72-
Your Open MPI job will likely hang until the failure resason is fixed
73-
(e.g., more file descriptors and/or memory becomes available), and may
74-
eventually timeout / abort.
83+
Although this local MPI process will likely now behave unpredictably
84+
(it may even hang or crash), the root cause of this problem is the
85+
failure of the peer -- that is what you need to investigate. For
86+
example, there may be a core file that you can examine. More
87+
generally: such peer hangups are frequently caused by application bugs
88+
or other external events.
7589

7690
Local host: %s
77-
PID: %d
78-
Errno: %d (%s)
91+
Local PID: %d
92+
Peer host: %s
7993
#

0 commit comments

Comments
 (0)