- 
                Notifications
    You must be signed in to change notification settings 
- Fork 928
WeeklyTelcon_20220524
        Jeff Squyres edited this page May 24, 2022 
        ·
        3 revisions
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres (Cisco)
- Austen Lauria (IBM)
- Brian Barrett (AWS)
- David Bernholdt (ORNL)
- Josh Fisher (Cornelis Networks)
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Hessam Mirsadeghi (UCX/NVIDIA)
- Tommy Janjusic (NVIDIA)
- George Bosilca (UTK)
- Edgar Gabriel (AMD)
- Howard Pritchard (LANL)
- Joseph Schuchart (UTK)
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Geoffrey Paulsen (IBM)
- Brendan Cunningham (Cornelis Networks)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (NVIDIA)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Christoph Niethammer (HLRS)
- Harumi Kuno (HPE)
- Joshua Ladd (NVIDIA)
- Marisa Roman (Cornelius Networks)
- Mark Allen (IBM)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Sam Gutierrez (LLNL)
- Xin Zhao (NVIDIA)
- 4.1.4
- Will do today!
- We'll do 4.1.5 whenever is relevant
 
- Open question about COMM_SPLIT_TYPE from user
- George is investigating. Could be a PMIx issue...?
- Need to investigate main/v5.0 after that.
 
- Will roll RC next week (after long weekend)
- have had build fixes/bug fixes come in
- 2 OMPI blockers:
- ADAPT and HAN priorities
- Setting the priorities is easy
- The Bosilica paper shows really good results
- EFA/ARM shows slight improvement on short messages and a slight regression on large messages. This could be an EFA issue. But this could also be a main/5.0 issue since the paper was written. Can someone -- who isn't EFA -- re-run the tests and ensure we don't have a regression. Meaning: we have one data point that doesn't look good, but it's also not an entirely trustworthy data point. We need more data.
- George: we were busy last week, sorry. :-(
- Would even be good for others to run, too.
- Point: tuned is pretty good when processes are mapped well. Adapt runs well all the time. Maybe try to run EFA with "poorly mapped" processes...?
- IBM said that they would re-run with UCX. I'll run on a "handful" of notes.
- AWS ran on 16 or 32 nodes.
- Joseph also volunteer to re-run.
- Nvidia: we'll run the tests, too.
- The ask is to run OSU or IMB collective benchmarks.
- David B. volunteered Tom Naughton.
- Josh F. from Cornelius will run as well.
- https://github.com/open-mpi/ompi/issues/10347 is the issue.
 
- mpirun external dependencies
- progress is being made, slowly.
 
 
- ADAPT and HAN priorities
- Main blockers are PRTE and PMIx issues.
- Need to fix PRTE 2.1.x blockers before there will be a PRTE
release.  These are not OMPI blockers, but they need to be
fixed before a PRTE 2.1 release.
- Link to these issues: https://github.com/openpmix/prrte/issues?q=is%3Aissue+is%3Aopen+label%3A%22Target+2.1%22
 
- To be clear: there are PRTE 2.0.x releases.
- We previously thought we would be able to use these for OMPI v5.0.x.
- This has unfortunately turned out to not be the case -- there are new PRTE v2.1.x features that we really need for OMPI v5.0.x.
- We really need community/people to help fix the PRTE 2.1.x issues so that we can get a PRTE 2.1.x release so that we can release OMPI v5.0.x.
 
 
- Need to fix PRTE 2.1.x blockers before there will be a PRTE
release.  These are not OMPI blockers, but they need to be
fixed before a PRTE 2.1 release.
 
Bottom line: we need resources to help with PRTE 2.1 release.
Per last week's discussion, we have now decided what the minimum versions are for OMPI v5.0.x:
- PMIx 4.0
- PRTE 2.1
There is a pending PR to change our configury.
NOTE: We should not set any public release to a PMIX / PRTE version that does not exist.
- 
Old issue that has re-surfaced: Intercomm communicators (when using more than 1 node) are hanging on main/v5.0.x. Josh Hursey thinks it might involve PMIx_Connect. - These two issues seem to be dups of the same core issue:
- No one has looked into this.
- Howard points out that RHC looked at this a while ago, and wrote
up a suggestion https://github.com/open-mpi/ompi/issues/10110.
- Corresponding PRTE issue: https://github.com/openpmix/prrte/issues/964
- This will also depend on which PMIx version we're using.
- There may also be a PRTE dependency...? Unknown. Need to triage.
 
- This is a regression for OMPI.
- Josh hasn't had a chance to triage this yet. Hopes to triage it soon.
 
- 
Lisandro has hit segv with partitioned sending. - https://github.com/open-mpi/ompi/issues/10390
- No response from Matthew this past week.
- Todd will reach out to Matthew internally.
 
- 
New: ULFM issue: https://github.com/open-mpi/ompi/issues/10389 - Still investigating.
 
- Howard and Geoff not here -- nothing new to discuss.
- Did not get to discuss this. See notes from last meeting.
- Did not get to discuss this. See notes from last meeting.
- Did not get to discuss this. See notes from last meeting.