-
Notifications
You must be signed in to change notification settings - Fork 929
Description
Background information
I am trying to launch an MPMD job using a rankfile that uses relative indexing (+n0 for the first hostname, +n1 for the second etc.), on a system that uses the SLURM job scheduler
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
OpenMPI v4.1.4.
sbatch --version report 23.02.7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Not certain (the system administrators installed it), but I looks like it was compiled from source.
Please describe the system on which you are running
- Operating system/version: RHEL8
- Computer hardware: Each node has 2 x Intel Broadwell E5-2695 v4 (18 cores), 128 GB RAM
- Network type: Intel OmniPath
Details of the problem
When I try to launch an MPMD program using mpiexec -rf rankfile.txt to specify the layout of the ranks, I get an error if:
- the rankfile uses relative indexing for the hostnames (
+n0,+n1etc.)- Replacing the relative indexing with actual hostnames makes the problem go away
- The second application is on a different host that then first
- There are enough ranks
- The attached example fills up 2 nodes, the first node with 36 ranks the first application, the second node with 36 ranks of the second application. Putting 1 rank of application 1 on the first node and 1 rank of application 2 on the second node does not produce an error
The error I get is:
--------------------------------------------------------------------------
Rankfile claimed host +n1 by index that is bigger than number of allocated hosts.
--------------------------------------------------------------------------
[ec534:05640] [[7352,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 271
[ec534:05640] [[7352,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 402
I did allocate 2 nodes, so +n1 should have been a valid hostname.
A minimal reproducer is attached. To run it, do sbatch ./batch_reduced.sh.
As noted in batch_reduced.sh, running 72 ranks of a single application works fine, so this is somehow related to running 2 applications with 36 ranks each.
mpirun_error.tar.gz