Skip to content

WeeklyTelcon_20160126

Geoff Paulsen edited this page Jan 26, 2016 · 12 revisions

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brad Benton
  • Edgar Gabriel
  • Geoffroy Vallee
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph Castain
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
  • 1.10.2 went out the Door.
  • Already have a bug (Giles) Ralph fixed.
  • Another bug Fortran - broken F08 bindings (Jeff) saw late last night.
    • If it's broken, how did it pass testing? Jeff needs a day or two to dig into.
  • Need to verify that library versions are still correct? -Jeff took care of.
  • MPI_Abort investigation (Ralph)? - Periodically have this issue where MPI_Abort + MTT has some issue. Perl is suspect, Ralph will look into ruby or another language.
  • 1.10 C Strided mutex lock issue. (Nathan)?
  • High CPU utilization on Async progress thread (Ralph)? Ralph Fixed... One off 1.10, not in master. In 1.10.2

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
    1. Issue 1252 - Nathan's progression decay function progress? Looking at files today.
      • udcm, openib_error_handler - opal_outputs would be sufficent.
    2. Issue 1215 - Group Comm Errors thing (Ralph) - Deal with race condition in ORTE collectives.
      • Launch goes down the tree. Mutex goes across the tree.
      • So possible to receive a modex message before you receive launch message.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
  • Group Comms weren't working for Comms of powers of 2. (Nathan)? Fixed.
  • ROMIO default for OMPI on Luster (only) PR 896?
  • 894, 890, 900, 901 - Jeff and Howard are good with. Jeff?
  • Issue 1292 - Asked Ralph if this is right way to fix this. (Ralph)
  • Issue 1177 - large message writev, fixed but not merged to master - Test working everywhere but OS X / BSD (George).
    • OS X / BSD limits large message total size to 32K?
    • Not going to fix for 2.0.0
    • Someone can write code to handle OS X / BSD.
  • Issue 1299 - hang (Nathan)?
  • 2.0.0 does not compile on Solaris due to statfs(). Now that we moved to OMPIO, we're now hitting the problem.
    • Edgar is working on it. solaris has different number of args and return code.
  • Issue 1301 - check max CQ size before creating CQ. (Josh)?
  • HWThreads - Ralph? Talk to Mike about use case?
  • Travis Status on 2.0?

Review Master?

  • BTL flags = 305 perf got horrible? Edgar?
  • OMPIO not finding PFS2

MTT status:

  • Cisco was showing timeouts.

Status Updates:

  • Cisco
  • ORNL
  • UTK
  • NVIDIA

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, HLRS, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally