Skip to content

Commit 72373c6

Browse files
committed
plm/rsh: Fix signal handling for rsh launcher
* Similar to the other launchers (i.e., slurm, alps) we need to put the children in a separate process group to prevent SIGINT (from a CTRL-C) from being delivered to the whole process group and prematurely killing the rsh/ssh connections to the remote daemons. Signed-off-by: Joshua Hursey <[email protected]> (cherry picked from commit 843fcca) Signed-off-by: Joshua Hursey <[email protected]>
1 parent 3555e02 commit 72373c6

File tree

1 file changed

+37
-1
lines changed

1 file changed

+37
-1
lines changed

orte/mca/plm/rsh/plm_rsh_module.c

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
* Copyright (c) 2007-2012 Los Alamos National Security, LLC. All rights
1414
* reserved.
1515
* Copyright (c) 2008-2009 Sun Microsystems, Inc. All rights reserved.
16-
* Copyright (c) 2011 IBM Corporation. All rights reserved.
16+
* Copyright (c) 2011-2017 IBM Corporation. All rights reserved.
1717
* Copyright (c) 2014-2015 Intel Corporation. All rights reserved.
1818
* Copyright (c) 2015 Research Organization for Information Science
1919
* and Technology (RIST). All rights reserved.
@@ -963,9 +963,45 @@ static void process_launch_list(int fd, short args, void *cbdata)
963963

964964
/* child */
965965
if (pid == 0) {
966+
/*
967+
* When the user presses CTRL-C, SIGINT is sent to the whole process
968+
* group which terminates the rsh/ssh command. This can cause the
969+
* remote daemon to crash with a SIGPIPE when it tried to print out
970+
* status information. This has two concequences:
971+
* 1) The remote node is not cleaned up as it should. The local
972+
* processes will notice that the orted failed and cleanup their
973+
* part of the session directory, but the job level part will
974+
* remain littered.
975+
* 2) Any debugging information we expected to see from the orted
976+
* during shutdown is lost.
977+
*
978+
* The solution here is to put the child processes in a separate
979+
* process group from the HNP. So when the user presses CTRL-C
980+
* then only the HNP receives the signal, and not the rsh/ssh
981+
* child processes.
982+
*/
983+
#if HAVE_SETPGID
984+
if( 0 != setpgid(0, 0) ) {
985+
opal_output(0, "plm:rsh: Error: setpgid(0,0) failed in child with errno=%s(%d)\n",
986+
strerror(errno), errno);
987+
exit(-1);
988+
}
989+
#endif
990+
966991
/* do the ssh launch - this will exit if it fails */
967992
ssh_child(caddy->argc, caddy->argv);
968993
} else { /* father */
994+
// Put the child in a separate progress group
995+
// - see comment in child section.
996+
#if HAVE_SETPGID
997+
if( 0 != setpgid(pid, pid) ) {
998+
opal_output(0, "plm:rsh: Warning: setpgid(%ld,%ld) failed in parent with errno=%s(%d)\n",
999+
(long)pid, (long)pid, strerror(errno), errno);
1000+
// Ignore this error since the child is off and running.
1001+
// We still need to track it.
1002+
}
1003+
#endif
1004+
9691005
/* indicate this daemon has been launched */
9701006
caddy->daemon->state = ORTE_PROC_STATE_RUNNING;
9711007
/* record the pid of the ssh fork */

0 commit comments

Comments
 (0)