forked from open-mpi/ompi
    
        
        - 
                Notifications
    
You must be signed in to change notification settings  - Fork 1
 
WeeklyTelcon_20160209
        Geoff Paulsen edited this page Feb 9, 2016 
        ·
        10 revisions
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
 
- Jeff Squyres
 - Geoff Paulsen
 - Brad Benton
 - Edgar Gabriel
 - Howard Pritchard
 - Joshua Ladd
 - Nathan Hjelm
 - Nysal Jan
 - ralph
 - Ryan Grant
 - Sylvain Jeaugey
 - Todd Kordenbrock
 - Yohann Burette
 
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3 - Targeting April, unless there is a need.
- Nathan will look at 0 byte send issue.
 - dev list of SLURM issues already fixed in 1.10.2
 - verbs usNIC not build by default - wait for review by Howard.
 - Fortran 08 - Jeff will take a look at today.
 - SLES 12 - was a race condition fork/exec before SIGCHILD detection.  Fixed.
- Long running jobs (Linpack) still having SIGCHILD issues.
 
 
 
- Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
 - Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
- 
Issue 1215 https://github.com/open-mpi/ompi/pull/1335: grpcomm errors
- Ralph is unable to replicate. Didn't see on Trinity and elsewhere at scale. Found where the problem is, but trying to figure out why solution isn't working. Ralph-and-Jeff-are-iterating phase.
 
 - 
https://github.com/open-mpi/ompi/issues/1252: bad perf caused by openib
- Only fails if openib finds valid procs. As soon as you ibv_cq_poll on 2nd socket. Still like 3ms openib intra-node.
 - Specific Mellanox MOFED 3.0 Verbs.
 - Mellanox has seen far socket on 100ish ns, not 7ms!
 
 
 - 
Issue 1215 https://github.com/open-mpi/ompi/pull/1335: grpcomm errors
 - 
PR 927 - need a Ralph review
- (the X / test fail was due to github being down -- it's a false failure)
 
 - Issue 1299 - Nathan Hang osc pt2pt.
 - Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
 - Mellanox would like new entrypoints in hcoll into 2.0
 - Issue with Addprocs on big-endian machines.  Now want minimum change to get 2.0 out.
- Easiest solution - for 32bit and BIG-ENDIAN - don't turn on dynamic add-procs.
 
 
- RFC to set the add_procs_cutoff to 32. PR1340 *
 - --host vs. --hostfile behavior PR1344
- how many procs to run
 - Jeff would like consistent with how over subscription works, but no -np runs 1 proc.
 - two issues... how many slots, and how many processes.
 - change behavior so that if user doesn't specify -np but DOES specify --host we'll get 1 slot (and one process).
 - keep hostfile behavior same as today.
 
 
- LANL
 - Houston
 - HLRS
 - IBM
 
- LANL, Houston, HLRS, IBM
 - Cisco, ORNL, UTK, NVIDIA
 - Mellanox, Sandia, Intel