- 
                Notifications
    You must be signed in to change notification settings 
- Fork 929
Description
Testing on 1.10.3rc3, I found the following performance issue (big performance issue):
BTL SCIF has a too high exclusivity and will replace vader/sm inappropriately (causing poor performance).
Deployment is 1 rank/core on a 20 core IB56G cluster. Reduce for msglen>=256k, reduce_scatter for msglen>=2k perform extremely poorly in that setup. Notably, allreduce performs a lot better than reduce (!), for the same size/msglen.
Reduce on single node with multicores is very bad. However, reduce over pure IB (multimode, 1 rank per node) is fine, so the issue is in a bad interaction between the collective and the sm transport.
Further digging, I found that the BTL SCIF has a very high exclusivity, and will replace vader/sm as the shared memory transport, and is the culprit for the very poor observed performance. Performance problem is resolved by forcing -mca btl openib,vader,sm,self
#----------------------------------------------------------------
# Benchmarking Reduce 
# #processes = 16 
# ( 164 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         8192         1000        24.84        24.90        24.87
        16384         1000        46.44        46.54        46.50
        32768         1000       108.62       108.73       108.70
        65536          640       164.82       165.05       165.00
       131072          320       306.90       307.72       307.57
       262144           41    165738.61    165866.98    165809.52
       524288           18    699844.72    724503.50    712673.69
#----------------------------------------------------------------
# Benchmarking Reduce_scatter 
# #processes = 2 
# ( 178 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.05         0.06         0.05
            4         1000         0.40         0.44         0.42
            8         1000         1.03         1.03         1.03
...
         2048         1000         2.43         2.43         2.43
         4096         1000         3.16         3.16         3.16
         8192         1000      2240.33      2240.38      2240.35
/opt/ompi-1.10.3rc3/bin/mpirun -hostfile /opt/etc/arc.machinefile.ompi -np 180  --display-allocation   -map-by slot $PWD/IMB-MPI1  
======================   ALLOCATED NODES   ======================
    arc00: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    arc01: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    arc02: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    arc03: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    arc04: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    arc05: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    arc06: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    arc07: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
    arc08: slots=20 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================