forked from open-mpi/ompi
-
Notifications
You must be signed in to change notification settings - Fork 1
WeeklyTelcon_20160119
Geoff Paulsen edited this page Jan 19, 2016
·
5 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Jeff Squyres
- Brad Benton
- Edgar Gabriel
- Howard
- Joshua Hursey
- Joshua Ladd
- Nathan Hjelm
- Ralph
- Sylvain Jeaugey
- Todd Kordenbrock
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
- Need to verify that library versions are still correct.
- Cisco Weekend MTT tests didn't look good.
- Build failure also.
- usNIC unable to connect. Maybe a cluster issue.
- Autogen --force didn't bring to 1.10, should remove from Cisco MTT.
- Ralph will try to replicate MPI_Abort. Abort test itself.
- 1.10 C Strided mutex lock issue. Nathan not surprised if it might be a bug. 1 fail. specific build config.
- enable memchecker build could be affecting timing. Nathan will take a look... should be simple.
- Jeff will look at MTT things after call.
- High CPU utilization on Async progress thread. Ralph will take a look. From -GE.
- After all of these issues are resolved / addressed can ship 1.10.2
- Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
- Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
- Nathan's progression decay function progress?
- Did Mellanox's UCX Modex stuff get merged in?
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
- Last week discussed OMPI-IO + Luster slow on 2.0.0 (and master) branches. Discussed making ROMIO default for OMPI on Luster (only).
- Last week discussed Group Comms weren't working for Comms of powers of 2. Nathan found massive memory issue.
- Pull Requests - Several that Jeff, Ralph, or Howard need to review.
- PR 896 - not going to help us avoid Luster issue. Reduce priority of Luster below ROMIO.
- Edgar Tested on Cray.
- 894, 890, 900, 901 - Jeff and Howard are good with. Jeff will merge in.
- Travis is now being run on 2.0 branch.
- Edgar's PR into master PR (Try to work around Luster, by switching over to use ROMIO).
- Not sure if issues he's seeing on Cray or on his cluster. Could be related, but need to get cluster running again.
- Wanted to see if any warnings from jenkins.
- But running that portion of code on Edgar's cluster, hits many issues.
- BTL flags = 305 perf got horrible (used to get better).
- did something else change in configure ? Hitting one issue after another independant of OMPIO.
- OMPIO is not finding PFS2 correctly during configure. Jeff can use screen share with Edgar.
- Issues only show up with 96 procs to hit, which makes debugging more difficult.
- LANL
- Houston
- HLRS
- IBM
- LANL, Houston, HLRS, IBM
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel