-
Notifications
You must be signed in to change notification settings - Fork 1
WeeklyTelcon_20160112
- Dialup Info: (Do not post to public mailing list or public wiki)
- Brad Benton
- Edgar Gabriel
- Geoffroy Vallee
- George
- Howard
- Josh Hursey
- Nathan Hjelm
- Ralph
- Ryan Grant
- Sylvain Jeaugey
- Todd Kordenbrock
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
- mpirun hangs on ONLY SLES 12. Minimum 40 procs/node. at very end of mpirun. Only seeing it in certain cases. Not sure what's going on.
- Is mpirun not exiting because ORTED not exiting? Nathan saw this on 2.0
- wait for Paul Hardgrove.
- No objections for Ralph shipping 1.10.2
-
Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
- Group Comms weren't working for Comms of powers of 2. Nathan found massive memory issue.
- 1252 openib - sounds like we have a localized fix in the works. Nathan.
- Nathan's been delayed until later this week. Could get done by middle of next week.
- openib btl specific could be made to only progress if there is a send/recv message posted.
- _______ progress also progresses connections.
- ugeniee progress - could only check for data grams every (only 200ns hit).
- Prefer to stick with nathan's original decay function without modifying openib.
- 1225 - Totalview debugger problem + PMPI-x.
- SLURM users use srun, doesn't have this issue.
- DDT does NOT have this issue either. Don't know why it's different. Attach FIFO.
- mpirun waits on a pipe for debugger to write a 1 on that pipe.
- Don't see how that CAN work.
- Nathan's been using attach, rather than mpirun --debug. Attach happens after launch, so then it's not going through this step. Nathan thinks not so critical since attach works.
- Anything will work, as long as you're attaching to a running job.
- Barring a breakthrough with PMI-x notify in next week. We'll do an RC2 and just carfully document what works/doesn't as far as debuggers.
- Should disable mpirun --debug and print an error on 2.0 branch that says it's broken.
- No longer a blocker for 2.0.0 due to schedule. Still want to fix this for next release.
- No new features (except for
- Howard will review
- review group comm
- don't know if we'll bother with pls filesystem.
- Howard will follow up with Mellanox on progress of UXC using Modex stuff.
- OMPI-IO + Luster slow on 2.0.0 (and master) branches.
-
Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *
-
Bunch of failures on Master branch. No chance to look at yet.
-
Cisco and Ivy cluster.
-
Nathan's seeing a resource deadlock avoided on OMPI Waitall. Some TCP BTL issue. Looks like something going on down there. Should be fairly easy to test this. Cisco TCP one-sided stuff.
- Nathan will see if he can figure this out. Haven't changed one-sided pt2pt receintly. Surprised. Maybe proclocks on by default? Need to work this out. Just changed locks from being conditional to being unconditional.
-
Edgar found some luster issues. OMPI master, has bad MPI-IO performance on luster. Looked reasonable on master, but now performance is poor. Not completely sure when get performance
- Luster itself, could switch back to ROMIO for default.
- GPFS, and others will look good, but Luster is bad. Can't have OMPI-IO as default on Luster.
- Problem for 2.0.0 AND Master Branch.
-
Comm_Info ready for Pull request should go to 2.1 (since mpull changes pushed to 2.1).
-
PR 1118 - mpull rewrite should be ready to go, but want George to look at make comments. Probably one of first 2.1 requests after into master.
-
PR 1296 - PMI-x - spreading changes from PMI-x across non-PMI-x infrastructure. Is that OKay?
- This is just making changes in GLUE that is OMPI specific.
- Mellanox
- Sandia
- Intel
- Mellanox, Sandia, Intel
- LANL, Houston, HLRS, IBM
- Cisco, ORNL, UTK, NVIDIA