Skip to content

Inconsistent oversubscribe behavior #1709

@abouteiller

Description

@abouteiller

I am testing 1.10.3rc3 on our local cluster (cents 6.7). The map-by behavior is generally consistent, except for cores+1 deployments:

with 120 cores total, requesting for 122 processes gives the expected error message about oversubscribing being a bad idea. However, requesting for 121 processes does not give an error, which I believe is incorrect.

See below the replicator.

/opt/ompi-1.10.3rc3/bin/mpirun -hostfile /opt/etc/nd.machinefile.ompi -np 122 --display-allocation   -map-by node hostname 

======================   ALLOCATED NODES   ======================
    nd01: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd02: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd03: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd04: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd05: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd06: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     NONE:IF-SUPPORTED
   Node:        nd02
   #processes:  11
   #cpus:       10

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

12:10:12@dancer:~/ompi/imb/4.1/src                                                                                                                                       Wed May. 25; 21 users, load5,15: 2.71,2.07
$ /opt/ompi-1.10.3rc3/bin/mpirun -hostfile /opt/etc/nd.machinefile.ompi -np 121 --display-allocation   -map-by node hostname 

======================   ALLOCATED NODES   ======================
    nd01: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd02: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd03: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd04: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd05: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    nd06: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
nd01
nd01
nd01

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions