Skip to content

Rankfile and MPMD Issue #12446

@JaredCrean2

Description

@JaredCrean2

Background information

I am trying to launch an MPMD job using a rankfile that uses relative indexing (+n0 for the first hostname, +n1 for the second etc.), on a system that uses the SLURM job scheduler

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

OpenMPI v4.1.4.
sbatch --version report 23.02.7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Not certain (the system administrators installed it), but I looks like it was compiled from source.

Please describe the system on which you are running

  • Operating system/version: RHEL8
  • Computer hardware: Each node has 2 x Intel Broadwell E5-2695 v4 (18 cores), 128 GB RAM
  • Network type: Intel OmniPath

Details of the problem

When I try to launch an MPMD program using mpiexec -rf rankfile.txt to specify the layout of the ranks, I get an error if:

  • the rankfile uses relative indexing for the hostnames (+n0, +n1 etc.)
    • Replacing the relative indexing with actual hostnames makes the problem go away
  • The second application is on a different host that then first
  • There are enough ranks
    • The attached example fills up 2 nodes, the first node with 36 ranks the first application, the second node with 36 ranks of the second application. Putting 1 rank of application 1 on the first node and 1 rank of application 2 on the second node does not produce an error

The error I get is:

--------------------------------------------------------------------------
Rankfile claimed host +n1 by index that is bigger than number of allocated hosts.
--------------------------------------------------------------------------
[ec534:05640] [[7352,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 271
[ec534:05640] [[7352,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 402

I did allocate 2 nodes, so +n1 should have been a valid hostname.

A minimal reproducer is attached. To run it, do sbatch ./batch_reduced.sh.

As noted in batch_reduced.sh, running 72 ranks of a single application works fine, so this is somehow related to running 2 applications with 36 ranks each.
mpirun_error.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions