Skip to content

WeeklyTelcon_20160112

Geoff Paulsen edited this page Jan 12, 2016 · 6 revisions

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Brad Benton
  • Edgar Gabriel
  • Geoffroy Vallee
  • George
  • Howard
  • Josh Hursey
  • Nathan Hjelm
  • Ralph
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock

Minutes

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
    • mpirun hangs on ONLY SLES 12. Minimum 40 procs/node. at very end of mpirun. Only seeing it in certain cases. Not sure what's going on.
    • Is mpirun not exiting because ORTED not exiting? Nathan saw this on 2.0
    • wait for Paul Hardgrove.
    • No objections for Ralph shipping 1.10.2

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20

  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker

    • Group Comms weren't working for Comms of powers of 2. Nathan found massive memory issue.
    • 1252 openib - sounds like we have a localized fix in the works. Nathan.
      • Nathan's been delayed until later this week. Could get done by middle of next week.
      • openib btl specific could be made to only progress if there is a send/recv message posted.
        • _______ progress also progresses connections.
        • ugeniee progress - could only check for data grams every (only 200ns hit).
      • Prefer to stick with nathan's original decay function without modifying openib.
    • 1225 - Totalview debugger problem + PMPI-x.
      • SLURM users use srun, doesn't have this issue.
      • DDT does NOT have this issue either. Don't know why it's different. Attach FIFO.
        • mpirun waits on a pipe for debugger to write a 1 on that pipe.
        • Don't see how that CAN work.
        • Nathan's been using attach, rather than mpirun --debug. Attach happens after launch, so then it's not going through this step. Nathan thinks not so critical since attach works.
      • Anything will work, as long as you're attaching to a running job.
      • Barring a breakthrough with PMI-x notify in next week. We'll do an RC2 and just carfully document what works/doesn't as far as debuggers.
      • Should disable mpirun --debug and print an error on 2.0 branch that says it's broken.
      • No longer a blocker for 2.0.0 due to schedule. Still want to fix this for next release.
    • No new features (except for
      • Howard will review
      • review group comm
      • don't know if we'll bother with pls filesystem.
    • Howard will follow up with Mellanox on progress of UXC using Modex stuff.
    • OMPI-IO + Luster slow on 2.0.0 (and master) branches.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *

Review Master?

MTT status:

  • Bunch of failures on Master branch. No chance to look at yet.

  • Cisco and Ivy cluster.

  • Nathan's seeing a resource deadlock avoided on OMPI Waitall. Some TCP BTL issue. Looks like something going on down there. Should be fairly easy to test this. Cisco TCP one-sided stuff.

    • Nathan will see if he can figure this out. Haven't changed one-sided pt2pt receintly. Surprised. Maybe proclocks on by default? Need to work this out. Just changed locks from being conditional to being unconditional.
  • Edgar found some luster issues. OMPI master, has bad MPI-IO performance on luster. Looked reasonable on master, but now performance is poor. Not completely sure when get performance

    • Luster itself, could switch back to ROMIO for default.
    • GPFS, and others will look good, but Luster is bad. Can't have OMPI-IO as default on Luster.
    • Problem for 2.0.0 AND Master Branch.
  • Comm_Info ready for Pull request should go to 2.1 (since mpull changes pushed to 2.1).

  • PR 1118 - mpull rewrite should be ready to go, but want George to look at make comments. Probably one of first 2.1 requests after into master.

  • PR 1296 - PMI-x - spreading changes from PMI-x across non-PMI-x infrastructure. Is that OKay?

    • This is just making changes in GLUE that is OMPI specific.

Status Updates:

  • Mellanox
  • Sandia
  • Intel

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, HLRS, IBM
  3. Cisco, ORNL, UTK, NVIDIA

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally