-
Notifications
You must be signed in to change notification settings - Fork 936
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v4.1.6
Details of the problem
I am using mpirun on a cluster with several nodes in it, with ssh/rsh as the expected transport for job starts.
I can't seem to get mpirun to respect the --launch-agent nor --mca plm rsh flags if it thinks that the host target in question is the local machine, even if it is not due to namespace isolation. This logic appears to be hardcoded based on hostname matching.
The command I am trying to make work, with the actual hostnames and commands replaced:
host1 $ mpirun --mca plm rsh --mca plm_rsh_assume_same_shell false --noprefix --host 'host1,host2,host3' --npernode 1 --launch-agent 'enroot start ${image_sqfs} orted' $command
<mpirun>
<!snip>
<stderr>--------------------------------------------------------------------------
</stderr>
<stderr>mpirun was unable to find the specified executable file, and therefore
</stderr>
<stderr>did not launch the job. This error was first reported for process
</stderr>
<stderr>rank 0; it may have occurred for other processes as well.
</stderr>
<stderr>
</stderr>
<stderr>NOTE: A common cause for this error is misspelling a mpirun command
</stderr>
<stderr> line parameter option (remember that mpirun interprets the first
</stderr>
<stderr> unrecognized command line token as the executable).
</stderr>
<stderr>
</stderr>
<stderr>Node: $host1
</stderr>
<stderr>Executable: $command
</stderr>
<stderr>--------------------------------------------------------------------------
</stderr>
<!snip>
</mpirun>
$command is an executable that the process running mpirun has no access to, which resides in ${image_sqfs}. This is intended. What is not intended is for mpirun to ignore the --launch-agent and --mca plm rsh commands, which run orted in the correct namespace with the required executables.
Running mpirun on host4 instead, which is not part of the --host argument list, works without issue.
Setting --mca plm_rsh_no_tree_spawn true changes the behavior and causes mpirun to hang and never attempt to launch any process at all.
I am effectively using enroot to launch orted in a mount namespace that contains the executables that I want mpirun to run. This works as expected, unless the parent mpirun call is also on one of the hosts specified in --hosts, at which point mpirun appears to attempt to directly fork/exec the executable in question, which fails as mpirun has no access to the same file that orted does.
In my opinion, the expected behavior here should be for the launch agent to always be used for job execution if specified, regardless of whether mpirun (incorrectly) thinks it is on localhost or not. Even setting --mca plm rsh appears to be completely ignored in this case as openmpi does not actually attempt to use ssh/rsh.
This appears to have come up a few times before, in #10151 and #7120.