- 
                Notifications
    You must be signed in to change notification settings 
- Fork 928
WeeklyTelcon_20200908
- Dialup Info: (Do not post to public mailing list or public wiki)
- NOT-YET-UPDATED
- No driver for 4.0.6 right now
- Adapt bug
- Doesn't handle non-communitive MPI_Ops, so they're trying to fall back, but having some issues here.
- Joseph Schuchart has been working on the fallback, and will have a PR soon.
- Mtt found these issues.
 
- Blocking on HAN issues.
- HAN has lazy initialization.
- On first call need to create sub-communicators.
- Better approach will be to use an INFO key to prevent HAN from creating subcommunicators on user's subcommunicators.
- This would be in OMPI communicator creation, not on coll_base
- George is working this.
 
 
- Giles created PR 7762 - Missing some MPI_Status
- PR's been around for a few months.
- Mark Allen working on this.
- Do we want to take this to v4.1 after it his master? Not a blocker.
 
- 
Been in a holding pattern - Josh Ladd is ready and willing for RM work, has just been busy with nVidia/Mellanox transition.
 
- 
Schedule: PMIx v4.0 Standard is in good shape. - libpmix in September
- PRRTE in October
 
- 
ULFM review - What are our Internal ABI guarantees?
- Example: in ULFM pull request changes sizeof(ompi_proc_t)
 
- size changes if ULFM is configured in or not.
- ompi_proc_t is used by SHMEM and they're using the extention space, so WE can't use that.
- ompi_proc_t is something that leaks into MPI API space... :(
- Brian will look at this.
 
 
- What are our Internal ABI guarantees?
- 
Questions about users doing their own PMIx implementation. - Is OMPI v5.0 is going to #if 0 all of the PMIx APIs not needed by MPI?
- Consensus
 
- If they implement their own pmix, they want to implement the bare minimum.
- OMPI v5 will require PMIx v3
- We should point out that we already have an existing way to interface with older PMIx, and they should use that.
- Want to support OMPI v5 in FLUX is the issue.
 
- Is OMPI v5.0 is going to #if 0 all of the PMIx APIs not needed by MPI?
- Joseph Schuchart opened PR7972 Info value - get in before v5.0 branches.
- Aurelien - ULFM PR is ready for review. Joseph agreed to review.
- 8007 waiting for waiting for review (George?)
- Been doing some work from Tools side.
- A lot of new work needed to stabilize it.
- Not too many bug reports lately, but maybe some more as use picks up.
- Some ULFM and scale testing.
- Open MPI master submodule update is manual process.
- release canidate of document for PMIx v4.0 standard.
- Bulk of standard changes was pushed yesterday.
- What should Open-MPI master track of PMIx?
- End goal would be to track PMIx releases.
- First week of October is target for Open PMIx v4.0 release
 
HWLOC initializiation thing. (Issue #7937)
- trivial to fix in master.
- Once Brian gets his configure stuff in.
- May need someone else to finish.
- Should be able to call PMIx Init, and ____ init, don't need opal init at begining of MPI_Init.
- This won't work going back into releases.
- buried in mca system.
- need
 
- What to do about fixing release branches.
- Can't give local topology without ___
- Don't run it at scale.
- The portable way to get it, is hwloc.
revert libevent https://github.com/open-mpi/ompi/pull/7940
- Summary: We committed some code
- Race condition we always win (because it happens at finalize and haven't cared), but now in ULFM (and possibly Sessions)
- We switched the configury logic so we always prefer external libevent (above a certain level of external libevent).
- Most OSes are above that level, so almost always prefer external libevent.
- If we get the fix into our internal libevent,
- Concern is that unless we or users explicitly request internal libevent, we'll almost never get this fix.
 
- One solution would be
 
- Can't think of another solution.
- Packagers don't like to use our internal component
- Only thing we can think of is if you want ULFM, you can't use external libevent.
 
- Progress of getting PR accepted upstream?
- Yes, prepared an upstream libevent PR.
- They want a non-open-mpi reproducer.
- Have ideas on how to create this reproducer, but not sure if it's very easy.
- Original code writer added some protection, but has since retired.  This PR removes this protection.
- Actually "we" added this race condition protection in libevent.  It delays removal of file descriptor until too late.
- The fix validates the FD before handling. Sounds right to all.
 
 
- Actually "we" added this race condition protection in libevent.  It delays removal of file descriptor until too late.
 
- Not started yet. Creating
- May be a way to code around this on ULFM, but not really sure, because things get into a bad state, and only way might be to ruin our performance.
 
- Yes, prepared an upstream libevent PR.
- If we protect this with configure (when building ULFM and have to use internal libevent).
- It means we move to submodules for libevent, we'd have to "mirror" libevent ourselves
 
- Only master / v5.0
- If we have TCP it could happen, but we disable errors in Finalize so don't hit this issue.
 
- libevent patch to this OLD internal libevent 2022
- It's possible that the problem goes away in newer libevent. But updating libevent was a major hassle.
- George check if code is gone or has been modified in libevent.
- Code is still there in latest libevent (so still need fix).
 
- updating libevent would be a much better solution.
 
- If upgrading to new libevent is answer.
- Jeff will send out Once a year, make sure those who have commit access should
- Have not reviewed yet:
- Amazon, Fujitsu, Google, HPE, Los Alamos, nVidia/mellanox, IBM
 
- Need to update the spreadsheet saying "looked at".
 
- Have not reviewed yet:
- August 10th, 11th, Monday and Tuesday that week.
- List of Topics to discuss, and presenters.
- On the wiki, start filling in.
 
- Need to figure out snacks.
- Dialup Info: (Do not post to public mailing list or public wiki)
- Call in user - Thomas
- Jeff Squyres (Cisco)
- Artem Polyakov (nVidia/Mellanox)
- Aurelien Bouteiller (UTK)
- Austen Lauria (IBM)
- Barrett, Brian (AWS)
- Christoph Niethammer (HLRS)
- Edgar Gabriel (UH)
- Geoffrey Paulsen (IBM)
- George Bosilca (UTK)
- Howard Pritchard (LANL)
- Joseph Schuchart
- Josh Hursey (IBM)
- Joshua Ladd (nVidia/Mellanox)
- Matthew Dosanjh (Sandia)
- Noah Evans (Sandia)
- Ralph Castain (Intel)
- Naughton III, Thomas (ORNL)
- Todd Kordenbrock (Sandia)
- Tomislav Janjusic
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- Harumi Kuno (HPE)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Michael Heinz (Intel)
- Nathan Hjelm (Google)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- William Zhang (AWS)
- Xin Zhao (nVidia/Mellanox)
- mohan (AWS)
- Obtaining cache line size from hwloc topo info.
- trivial to fix in master.
- Once Brian gets his configure stuff in.
- May need someone else to finish.
- Should be able to call PMIx Init, and ____ init, don't need opal init at begining of MPI_Init.
- This won't work going back into releases.
- buried in mca system.
- need
 
- What to do about fixing release branches.
- Can't give local topology without ___
- Don't run it at scale.
- The portable way to get it, is hwloc.
 
revert libevent https://github.com/open-mpi/ompi/pull/7940
- 
Summary: We committed some code - Race condition we always win (because it happens at finalize and haven't cared), but now in ULFM (and possibly Sessions)
- Aurelien Bouteiller posted a nice summary of the situation and we discussed mitigation
- Doesn't really affect Linux, just Mac-OS
 
- Would like a user visible message if we know we can run, rather than crash.
 
- 
George isn't here today. - Picking it up too late
- PMIX may or may not have this ordering issue.
- PMIx doesn't depend on hwloc (and not using that)
 
 
- 
Should we upgrade our internal libevent to latest 2.1.12? - Reasons for or against?
- Maybe hold off until we get configure code to change it to a submodule.
 
- 
If we make libevent a submodule pointer, then we wouldn't be able to fix problems even if we have bigger problems than this. - For OMPI v5.0, The earliest version of libevent we're going to support out of the box 2.0.21 (RHEL7)
- Issue 7666
 
- Logic if the version installed on system, is older we'll use our bundled
 
- For OMPI v5.0, The earliest version of libevent we're going to support out of the box 2.0.21 (RHEL7)
- 
There is a hypothetical risk that we can't ship patches, and we - MAC configury work
 
- 
ULFM configury work is independent of libevent configury work. 
- 
Do we still merge 7940 to revert it since submodule will replace it completely? - Might be nice for git
 
- AWS backend uses verbs interface in OFI.
- If OFI BTL is there, it initializes first.
- If EFA device is there, initialize OFI BTL before openib BTL won't cause issues.
- If EFA device isn't there, then openib BTL
- But this means mucking around with base initializiation code.
 
- Calling ibv_fork_safe() by default.
 
- 
August 10th, 11th, Monday and Tuesday that week. 
- 
List of Topics to discuss, and presenters. - On the wiki, start filling in.
 
- 
Many companies are not allowing a face to face travel until 2021 due to COVID19. - Instead lets do a series of virtual-face to face?
 
- 
Yes this summer to discuss for v5.0 - Maybe we can do it by topic?
- Maybe not 4 or 8 hour things.
 
- 
Different topics on different days. 
- 
Do a doodle poll of least-worse days in late July/August. - August 10th-14th - 3 hour block of time 8-11 Pacific time.
- Jeff will do another doodle for days of the week (vote for 2)
 
- 
Start a list of topics. 
- Sessions is now in in.
- Partition communication voted in.
- OpalTSDCreate - takes a thread storage local key that would be tracked locally in opal.
- But when we go to delete, it's not being deleted.
- But want flexibility to destroy on our own or explicitly
- George thinks the mode we have today, since tracking all keys to be released by main thread.
- George thinks Artem's approach is the correct approach.
 
- Would have to change the way that keys are USED, and different components are using it in a different way.
- Something similar should be done in different places.
- If you do it just for UCX, then others can see how you did it and check for their code.
- So we think current PR is good, but it leaves old API and new API.
- But it might be better to remove OLD way and make broken components do SOMETHING to update their code.
- Should be easy for components to add explicit cleanup calls
 
- Master branch only.
- Opened a new PUll Request yesterday that addresses the problem as discussed last week.
- Tracking of TLS in common code.
- Have a low level thread specific keys (very simple based on thread implementation)
- Tracked key, probably what you want to use if you want to ensure all TLS is accounted and released at destruction of key.
- Tommy chaged all of the places in OMPI where those keys are used. Just use tracked key instead of regular key.
- Changed set_specific and get_specific to just set and get.
- Please review and give suggestions.
 
- Does it even make sense to do TLS in OPAL at all?
- May indicate that we have an abstraction wrong somewhere.
- If MPI depends on this in OPAL, then it depends on them in PMIx and other layers?
- Not sure if there is a problem, but at a high level, sounds problematic.
 
- Baking in pthread assumptions in general is not a good idea.
- That's what this PR does is abstract pthread semantics.
 
- May be some confusion, no problem with porting this API anywhere.
- Issue raised before is that if you're relying on a certain type of thread in MPI layer.
- But we don't, because there's a framework.
- But Application is linked against PMIx and libevent and to use other threading models is dangerous.
- To make this work, you have to make changes to event polling, etc.
 
 
- Not saying we shouldn't take these patches, these make things better.
- But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
- argobots actually uses pthreads, not sure about qthreads.
- Working on a way to configure libevent to make this combo work.
 
 
- But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
- Last week:
- George needs some input on PR
- We don't need _atomic_in most cases just need volatile
- patch linked to the issue PR7914
- We're not breaking things, we just get alot of valid complaints from intel compiler.
- STDOUT of makeis ~16 MB due to all intel compiler warnings without this fix
 
- STDOUT of 
 
- There is a PR pending
- Since Open-MPI is a registered non-profit.
- If we log volunteer time we can
- Software in the Public Interest (Parent non-profit)
 
- A week or two
Blockers All Open Blockers
Review v4.0.x Milestones v4.0.5
- Blocked on a PR from George Issue 7937
- 7968 is marked as a blocker, but this is more of a UCX issue, than OMPI issue.
Review v4.1.x Milestones v4.1.0
- 
A couple of pending issues: - OFI issue Amazon is working on.
 
- 
Need Review on PR7991 
- 
Need some more cycles on HAN and Adapt in master, before pull it into v4.1 - AWS will run tests before next week.
- Waiting on George's patch to HAN and Adapt.
 
- 
Schedule: Want to release end-of-July - A minimum of a week, need changes from George on collective components
- Posted a v4.1.0 rc1 to go through mechanisms to ensure we can release.
 
- 
A number of PRs for v4.1 have not yet gone into master. 
- 
PRs against v4.1.x need reviews (and need corrisponding PRs to go into master) - A UCX init PR out for 4 weeks, still need a review
 
- 
Release Engineers: Brian (AWS) Jeff Squyres (Cisco) 
- 
The fact that we removed pt2pt in OSC, is causing One-sided. - Nathan agreed to take a look.
 
- 
George found an SM BTL issue at Init on master. Jeff filed Issue 7937 - Summary: Cacheline size is set very late after modex, everything that uses cacheline before modex.
- This is a correctness issue (not optimization) - George on today's call
- At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
- Affects all the way back to v2.x
 
 
- At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
- Backing file for shared memory is allocated by one proc, and used by many. Creates before MODEX (uses 128), but later when users try to use, they use a different 128 cacheline.
- Looking at the code, we do this other places as well, but not as dramatic.
- May need an alternative for master (after Brian finishes configury, to bring hwloc out of frameworks).
- How do we fix this?
- Can we just get the cacheline size before we get the rest of topology information? Brice said no.
- Only solution we can see is creating an opal function to do this.
- Would work, but might be slightly less optimal. If every rank does this in parallel, does this cause issues?
- George can look for it, but can't do it before end of week.
- Probably easiest to look in dstore in PMIx that's pretty clean (mentioned in ticket)
 
 
- Who can do this work?
- Showing itself in CUDA issue.
- Tomislav Janjusic (nVidia) will ask some of his colleges.
 
 
- Showing itself in CUDA issue.
- Because we align some structs based on that, but
- It would be associated with getting the topology (but not retreived until after the modex)
- Only cuda btl calls the function directly, everyone else extracts from PMIx.
- What we ought to do, no harm in getting topology earlier, just need to ensure PMIx is intialized.
- On v4.1, we don't get the topology before someone requests it much later.
- Must also affect v4.0.x
 
 
- George put a fix into master, but making a better change to load it as soon as PMIx is intialized, would be much better.
- Con is that if we're not in a PMIx environment to share this pointer, then every process will go do this discovery, even if they don't need it later.
- Problem is that the process that creates the backing file, creates it very early.
 
 
- Someone should review all the branches to Look to see if we got topology before someone uses the cacheline size.
- George saw it in SM BTL structures. Deadlock.
- This isn't tested by our CI infrastructure.
 
- 
Still want: - George's Collectives
- George is still working on master version of coll
- Next thing he's working on today.
 
- Will probably need to do something to CI to enable these for testing.
- CI not really executing
- IBM will do some testing of this.
- Will need some docs on how users to select this.
 
- Tunings for tuned coll
- Nothing to discuss today.
- https://github.com/open-mpi/ompi/pull/7952
 
- AVX
- Went in this morning.
 
- UCX PRs awaiting review.
 
- George's Collectives
- 
Past: We've come to consensus for a v4.1.0 release - Need include/exclude selection, worried about consistent selection.
- Alot of PRs outstanding, but can't merge until
- Patch for OFI stuff messed up v4.1.x branch.
- Howard has a fix PR, Jeff is looking at.
 
- Howard changed new OFI BTL parameters to be consistent with MTL
- Not breaking ABI or backwards compatibility.
- v4.1.x branch, branched from v4.0.4 tag.
- NOT touching runtime!!!
- Not going to be pulling in a new PMIx version.
 
- 
All MTT is online on v4.1.x branch 
- 
Not compiling under SLURM EFA test. (OFI BTL issue) 
Review v5.0.0 Milestones v5.0.0
- 
No update this week other than master discussion. 
- 
Need to put OSC pt2pt - OS RDMA requires a single BTL that can contact every single process.
- This didn't use to be the case. (Comment in the code)
 
 
- OS RDMA requires a single BTL that can contact every single process.
- 
We can't use the OSC pt2pt. - It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
- This is just a testing falicy. Could add tests to show this, but still at same boat.
- Either product A or B is broken and we need to fix it.
 
- 
RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics. - The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
 
- 
Jeff will close the PR, and 
- 
Jeff will Nathan will fetching, get, compare and swap. 
- 
Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller. 
- 
Does UCX support iWarp? - Does libFabric support iWarp via verbs provider?
- https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
- Brian thinks that libFabric
- OFI can support iWarp, just need to specify the provider in the include list.
- This person who's asking is a partner not a customer
 
- 
PMIX - Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
- Sessions needs something from PMIx v4
- ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
- PPN scaling issue - simple algorithmic issue in this function
- PMIX talked about it. Artem might know someone who might be interested in working on it.
- Algorithm behind one of the interfaces doesn't scale well.
- Not a regression. Above ~ 4K nodes, becomes quadratic.
 
 
- 
PRRTE - Nothing's happening there.
 
- Mostly discussed above.
- We now have a new publicly visible test repo, for new tests
- Haven't tried to do two checkouts (of both public and private test repos) in one MTT run yet.
- Should probably update instructions on how to setup mtt
- Can add new PR based tests if we want. We'll need to add new infrastructure.
 
- George and Jeff will help plan and come to community.
- Done / Submitted.
- Probably won't hear back until Sept.
 
- Probably after super computing.
- scale-testing, PRs have to opt-into it.
Review Master Master Pull Requests
Back to 2020 WeeklyTelcon-2020
- Dialup Info: (Do not post to public mailing list or public wiki)
- Call in user - Thomas
- Jeff Squyres (Cisco)
- Artem Polyakov (nVidia/Mellanox)
- Aurelien Bouteiller (UTK)
- Austen Lauria (IBM)
- Barrett, Brian (AWS)
- Brendan Cunningham (Intel)
- Christoph Niethammer (HLRS)
- Edgar Gabriel (UH)
- Geoffrey Paulsen (IBM)
- George Bosilca (UTK)
- Howard Pritchard (LANL)
- Joseph Schuchart
- Josh Hursey (IBM)
- Joshua Ladd (nVidia/Mellanox)
- Matthew Dosanjh (Sandia)
- Noah Evans (Sandia)
- Ralph Castain (Intel)
- Naughton III, Thomas (ORNL)
- Todd Kordenbrock (Sandia)
- Tomislav Janjusic
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- Harumi Kuno (HPE)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Michael Heinz (Intel)
- Nathan Hjelm (Google)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- William Zhang (AWS)
- Xin Zhao (nVidia/Mellanox)
- mohan (AWS)
HWLOC initializiation thing.
- trivial to fix in master.
- Once Brian gets his configure stuff in.
- May need someone else to finish.
- Should be able to call PMIx Init, and ____ init, don't need opal init at begining of MPI_Init.
- This won't work going back into releases.
- buried in mca system.
- need
 
- What to do about fixing release branches.
- Can't give local topology without ___
- Don't run it at scale.
- The portable way to get it, is hwloc.
revert libevent https://github.com/open-mpi/ompi/pull/7940
- Summary: We committed some code
- Race condition we always win (because it happens at finalize and haven't cared), but now in ULFM (and possibly Sessions)
- We switched the configury logic so we always prefer external libevent (above a certain level of external libevent).
- Most OSes are above that level, so almost always prefer external libevent.
- If we get the fix into our internal libevent,
- Concern is that unless we or users explicitly request internal libevent, we'll almost never get this fix.
 
- One solution would be
 
- Can't think of another solution.
- Packagers don't like to use our internal component
- Only thing we can think of is if you want ULFM, you can't use external libevent.
 
- Progress of getting PR accepted upstream?
- Yes, prepared an upstream libevent PR.
- They want a non-open-mpi reproducer.
- Have ideas on how to create this reproducer, but not sure if it's very easy.
- Original code writer added some protection, but has since retired.  This PR removes this protection.
- Actually "we" added this race condition protection in libevent.  It delays removal of file descriptor until too late.
- The fix validates the FD before handling. Sounds right to all.
 
 
- Actually "we" added this race condition protection in libevent.  It delays removal of file descriptor until too late.
 
- Not started yet. Creating
- May be a way to code around this on ULFM, but not really sure, because things get into a bad state, and only way might be to ruin our performance.
 
- Yes, prepared an upstream libevent PR.
- If we protect this with configure (when building ULFM and have to use internal libevent).
- It means we move to submodules for libevent, we'd have to "mirror" libevent ourselves
 
- Only master / v5.0
- If we have TCP it could happen, but we disable errors in Finalize so don't hit this issue.
 
- libevent patch to this OLD internal libevent 2022
- It's possible that the problem goes away in newer libevent. But updating libevent was a major hassle.
- George check if code is gone or has been modified in libevent.
- Code is still there in latest libevent (so still need fix).
 
- updating libevent would be a much better solution.
 
- If upgrading to new libevent is answer.
- Jeff will send out Once a year, make sure those who have commit access should
- Have not reviewed yet:
- Amazon, Bull, Google, Los Alamos, nVidia/Mellanox
 
- Need to update the spreadsheet saying "looked at".
 
- Have not reviewed yet:
- August 10th, 11th, Monday and Tuesday that week.
- Put stuff on the agenda wiki (URL HERE)
- List of Topics to discuss, and presenters.
- On the wiki, start filling in.
 
- George and Jeff will help plan and come to community.
- Done / Submitted.
 
- May not have Super Computing conference at ALL this year.
- Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
- Then this works pretty well, and do this a couple of times a year.
- Not constrained to Super Computing
- Almost certain that it will be virtual
- Not sure the cost.
- Ralph and Jeff have been doing ABCs of Open MPI - SO many people.  Done 2 of 3 sessions (each went 1.5 hours, lots of questions)
- Slides and Youtube are on website, and will send link to userlist.
- Part 3 is August 5th
 
- Also want an indept walk through of PMIx initialization / wireup
 
- Sessions is now in in.
- Partition communication voted in.
- OpalTSDCreate - takes a thread storage local key that would be tracked locally in opal.
- But when we go to delete, it's not being deleted.
- But want flexibility to destroy on our own or explicitly
- George thinks the mode we have today, since tracking all keys to be released by main thread.
- George thinks Artem's approach is the correct approach.
 
- Would have to change the way that keys are USED, and different components are using it in a different way.
- Something similar should be done in different places.
- If you do it just for UCX, then others can see how you did it and check for their code.
- So we think current PR is good, but it leaves old API and new API.
- But it might be better to remove OLD way and make broken components do SOMETHING to update their code.
- Should be easy for components to add explicit cleanup calls
 
- Master branch only.
- Opened a new PUll Request yesterday that addresses the problem as discussed last week.
- Tracking of TLS in common code.
- Have a low level thread specific keys (very simple based on thread implementation)
- Tracked key, probably what you want to use if you want to ensure all TLS is accounted and released at destruction of key.
- Tommy chaged all of the places in OMPI where those keys are used. Just use tracked key instead of regular key.
- Changed set_specific and get_specific to just set and get.
- Please review and give suggestions.
 
- Does it even make sense to do TLS in OPAL at all?
- May indicate that we have an abstraction wrong somewhere.
- If MPI depends on this in OPAL, then it depends on them in PMIx and other layers?
- Not sure if there is a problem, but at a high level, sounds problematic.
 
- Baking in pthread assumptions in general is not a good idea.
- That's what this PR does is abstract pthread semantics.
 
- May be some confusion, no problem with porting this API anywhere.
- Issue raised before is that if you're relying on a certain type of thread in MPI layer.
- But we don't, because there's a framework.
- But Application is linked against PMIx and libevent and to use other threading models is dangerous.
- To make this work, you have to make changes to event polling, etc.
 
 
- Not saying we shouldn't take these patches, these make things better.
- But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
- argobots actually uses pthreads, not sure about qthreads.
- Working on a way to configure libevent to make this combo work.
 
 
- But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
- Last week:
- George needs some input on PR
- We don't need _atomic_in most cases just need volatile
- patch linked to the issue PR7914
- We're not breaking things, we just get alot of valid complaints from intel compiler.
- STDOUT of makeis ~16 MB due to all intel compiler warnings without this fix
 
- STDOUT of 
 
- There is a PR pending
- Schizo SLURM binding detection - Might not need a solution on v4.0.x
- PRs have gone into v4.0.x and v4.1.x
- Since Open-MPI is a registered non-profit.
- If we log volunteer time we can
- Software in the Public Interest (Parent non-profit)
 
- A week or two
Blockers All Open Blockers
Review v4.0.x Milestones v4.0.5
- Discussing CUDA init in UCX PML PR 7898
- Looks like a bugfix, so should be okay to put into a release branch.
- Is there a better place to initialize the CUDA hooks?
- If we request a BTL or PML to be loaded, if configured with cuda
- CUDA library is loaded by BTL that requires it.
- Some questions about possibly making it more generic for all PMLs that use CUDA.
- Don't want to load cuda if using only using TCP or Shared Mem
 
- We'll take this PR once it passes CI and is reviewed.
 
- v4.0.5 schedule: End of July
- Will create RC1 today after PR7898 goes in.
- Two potential drivers for a quick v4.0.5 turn-around.
- OSC RDMA Bug - May drive a v4.0.5 release.
- Program Aborts on detach.
 
 
Review v4.1.x Milestones v4.1.0
- 
Schedule: Want to release end-of-July - A minimum of a week, need changes from George on collective components
 
- 
Posted a v4.1.0 rc1 to go through mechanisms to ensure we can release. 
- 
Release Engineers: Brian (AWS) Jeff Squyres (Cisco) 
- 
Jeff is reviewing Collective components - Yoseph also reviewing.
 
- 
George found an SM BTL issue at Init on master. Jeff filed Issue 7937 - Summary: Cacheline size is set very late after modex, everything that uses cacheline before modex.
- This is a correctness issue (not optimization) - George on today's call
- At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
- Affects all the way back to v2.x
 
 
- At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
- Backing file for shared memory is allocated by one proc, and used by many. Creates before MODEX (uses 128), but later when users try to use, they use a different 128 cacheline.
- Looking at the code, we do this other places as well, but not as dramatic.
- May need an alternative for master (after Brian finishes configury, to bring hwloc out of frameworks).
- How do we fix this?
- Can we just get the cacheline size before we get the rest of topology information? Brice said no.
- Only solution we can see is creating an opal function to do this.
- Would work, but might be slightly less optimal. If every rank does this in parallel, does this cause issues?
- George can look for it, but can't do it before end of week.
- Probably easiest to look in dstore in PMIx that's pretty clean (mentioned in ticket)
 
 
- Who can do this work?
- Showing itself in CUDA issue.
- Tomislav Janjusic (nVidia) will ask some of his colleges.
 
 
- Showing itself in CUDA issue.
- Because we align some structs based on that, but
- It would be associated with getting the topology (but not retreived until after the modex)
- Only cuda btl calls the function directly, everyone else extracts from PMIx.
- What we ought to do, no harm in getting topology earlier, just need to ensure PMIx is intialized.
- On v4.1, we don't get the topology before someone requests it much later.
- Must also affect v4.0.x
 
 
- George put a fix into master, but making a better change to load it as soon as PMIx is intialized, would be much better.
- Con is that if we're not in a PMIx environment to share this pointer, then every process will go do this discovery, even if they don't need it later.
- Problem is that the process that creates the backing file, creates it very early.
 
 
- Someone should review all the branches to Look to see if we got topology before someone uses the cacheline size.
- George saw it in SM BTL structures. Deadlock.
- This isn't tested by our CI infrastructure.
 
- 
Still want: - George's Collectives
- George is still working on master version of coll
- Next thing he's working on today.
 
- Will probably need to do something to CI to enable these for testing.
- CI not really executing
- IBM will do some testing of this.
- Will need some docs on how users to select this.
 
- Tunings for tuned coll
- Nothing to discuss today.
- https://github.com/open-mpi/ompi/pull/7952
 
- AVX
- Went in this morning.
 
- UCX PRs awaiting review.
 
- George's Collectives
- 
Past: We've come to consensus for a v4.1.0 release - Need include/exclude selection, worried about consistent selection.
- Alot of PRs outstanding, but can't merge until
- Patch for OFI stuff messed up v4.1.x branch.
- Howard has a fix PR, Jeff is looking at.
 
- Howard changed new OFI BTL parameters to be consistent with MTL
- Not breaking ABI or backwards compatibility.
- v4.1.x branch, branched from v4.0.4 tag.
- NOT touching runtime!!!
- Not going to be pulling in a new PMIx version.
 
- 
All MTT is online on v4.1.x branch 
- 
Not compiling under SLURM EFA test. (OFI BTL issue) 
Review v5.0.0 Milestones v5.0.0
- 
No update this week other than master discussion. 
- 
Need to put OSC pt2pt - OS RDMA requires a single BTL that can contact every single process.
- This didn't use to be the case. (Comment in the code)
 
 
- OS RDMA requires a single BTL that can contact every single process.
- 
We can't use the OSC pt2pt. - It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
- This is just a testing falicy. Could add tests to show this, but still at same boat.
- Either product A or B is broken and we need to fix it.
 
- 
RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics. - The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
 
- 
Jeff will close the PR, and 
- 
Jeff will Nathan will fetching, get, compare and swap. 
- 
Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller. 
- 
Does UCX support iWarp? - Does libFabric support iWarp via verbs provider?
- https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
- Brian thinks that libFabric
- OFI can support iWarp, just need to specify the provider in the include list.
- This person who's asking is a partner not a customer
 
- 
PMIX - Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
- Sessions needs something from PMIx v4
- ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
- PPN scaling issue - simple algorithmic issue in this function
- PMIX talked about it. Artem might know someone who might be interested in working on it.
- Algorithm behind one of the interfaces doesn't scale well.
- Not a regression. Above ~ 4K nodes, becomes quadratic.
 
 
- 
PRRTE - Nothing's happening there.
 
- Mostly discussed above.
- scale-testing, PRs have to opt-into it.
Review Master Master Pull Requests
Back to 2020 WeeklyTelcon-2020
- Many companies are not allowing a face to face travel until 2021 due to COVID19.
- Instead lets do a series of virtual-face to face?
 
- Yes this summer to discuss for v5.0
- Maybe we can do it by topic?
- Maybe not 4 or 8 hour things.
 
- Different topics on different days.
- Do a doodle poll of least-worse days in late July/August.
- August 10th-14th - 3 hour block of time 8-11 Pacific time.
- Jeff will do another doodle for days of the week (vote for 2)
 
- Start a list of topics.
- Sessions is now in in.
- Partition communication voted in.
- OpalTSDCreate - takes a thread storage local key that would be tracked locally in opal.
- But when we go to delete, it's not being deleted.
- But want flexibility to destroy on our own or explicitly
- George thinks the mode we have today, since tracking all keys to be released by main thread.
- George thinks Artem's approach is the correct approach.
 
- Would have to change the way that keys are USED, and different components are using it in a different way.
- Something similar should be done in different places.
- If you do it just for UCX, then others can see how you did it and check for their code.
- So we think current PR is good, but it leaves old API and new API.
- But it might be better to remove OLD way and make broken components do SOMETHING to update their code.
- Should be easy for components to add explicit cleanup calls
 
- Master branch only.
- Opened a new PUll Request yesterday that addresses the problem as discussed last week.
- Tracking of TLS in common code.
- Have a low level thread specific keys (very simple based on thread implementation)
- Tracked key, probably what you want to use if you want to ensure all TLS is accounted and released at destruction of key.
- Tommy chaged all of the places in OMPI where those keys are used. Just use tracked key instead of regular key.
- Changed set_specific and get_specific to just set and get.
- Please review and give suggestions.
 
- Does it even make sense to do TLS in OPAL at all?
- May indicate that we have an abstraction wrong somewhere.
- If MPI depends on this in OPAL, then it depends on them in PMIx and other layers?
- Not sure if there is a problem, but at a high level, sounds problematic.
 
- Baking in pthread assumptions in general is not a good idea.
- That's what this PR does is abstract pthread semantics.
 
- May be some confusion, no problem with porting this API anywhere.
- Issue raised before is that if you're relying on a certain type of thread in MPI layer.
- But we don't, because there's a framework.
- But Application is linked against PMIx and libevent and to use other threading models is dangerous.
- To make this work, you have to make changes to event polling, etc.
 
 
- Not saying we shouldn't take these patches, these make things better.
- But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
- argobots actually uses pthreads, not sure about qthreads.
- Working on a way to configure libevent to make this combo work.
 
 
- But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
- Last week:
- George needs some input on PR
- We don't need _atomic_in most cases just need volatile
- patch linked to the issue PR7914
- We're not breaking things, we just get alot of valid complaints from intel compiler.
- STDOUT of makeis ~16 MB due to all intel compiler warnings without this fix
 
- STDOUT of 
 
- There is a PR pending
- Schizo SLURM binding detection - Might not need a solution on v4.0.x
- PRs have gone into v4.0.x and v4.1.x
- Since Open-MPI is a registered non-profit.
- If we log volunteer time we can
- Software in the Public Interest (Parent non-profit)
 
- A week or two
Blockers All Open Blockers
Review v4.0.x Milestones v4.0.5
- Discussing CUDA init in UCX PML PR 7898
- Looks like a bugfix, so should be okay to put into a release branch.
- Is there a better place to initialize the CUDA hooks?
- If we request a BTL or PML to be loaded, if configured with cuda
- CUDA library is loaded by BTL that requires it.
- Some questions about possibly making it more generic for all PMLs that use CUDA.
- Don't want to load cuda if using only using TCP or Shared Mem
 
- We'll take this PR once it passes CI and is reviewed.
 
- v4.0.5 schedule: End of July
- Will create RC1 today after PR7898 goes in.
- Two potential drivers for a quick v4.0.5 turn-around.
- OSC RDMA Bug - May drive a v4.0.5 release.
- Program Aborts on detach.
 
 
Review v4.1.x Milestones v4.1.0
- 
Schedule: Want to release end-of-July - A minimum of a week, need changes from George on collective components
 
- 
Posted a v4.1.0 rc1 to go through mechanisms to ensure we can release. 
- 
Release Engineers: Brian (AWS) Jeff Squyres (Cisco) 
- 
Jeff is reviewing Collective components - Yoseph also reviewing.
 
- 
George found an SM BTL issue at Init on master. Jeff filed Issue 7937 - Summary: Cacheline size is set very late after modex, everything that uses cacheline before modex.
- This is a correctness issue (not optimization) - George on today's call
- At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
- Affects all the way back to v2.x
 
 
- At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
- Backing file for shared memory is allocated by one proc, and used by many. Creates before MODEX (uses 128), but later when users try to use, they use a different 128 cacheline.
- Looking at the code, we do this other places as well, but not as dramatic.
- May need an alternative for master (after Brian finishes configury, to bring hwloc out of frameworks).
- How do we fix this?
- Can we just get the cacheline size before we get the rest of topology information? Brice said no.
- Only solution we can see is creating an opal function to do this.
- Would work, but might be slightly less optimal. If every rank does this in parallel, does this cause issues?
- George can look for it, but can't do it before end of week.
- Probably easiest to look in dstore in PMIx that's pretty clean (mentioned in ticket)
 
 
- Who can do this work?
- Showing itself in CUDA issue.
- Tomislav Janjusic (nVidia) will ask some of his colleges.
 
 
- Showing itself in CUDA issue.
- Because we align some structs based on that, but
- It would be associated with getting the topology (but not retreived until after the modex)
- Only cuda btl calls the function directly, everyone else extracts from PMIx.
- What we ought to do, no harm in getting topology earlier, just need to ensure PMIx is intialized.
- On v4.1, we don't get the topology before someone requests it much later.
- Must also affect v4.0.x
 
 
- George put a fix into master, but making a better change to load it as soon as PMIx is intialized, would be much better.
- Con is that if we're not in a PMIx environment to share this pointer, then every process will go do this discovery, even if they don't need it later.
- Problem is that the process that creates the backing file, creates it very early.
 
 
- Someone should review all the branches to Look to see if we got topology before someone uses the cacheline size.
- George saw it in SM BTL structures. Deadlock.
- This isn't tested by our CI infrastructure.
 
- 
Still want: - George's Collectives
- George is still working on master version of coll
- Next thing he's working on today.
 
- Will probably need to do something to CI to enable these for testing.
- CI not really executing
- IBM will do some testing of this.
- Will need some docs on how users to select this.
 
- Tunings for tuned coll
- Nothing to discuss today.
- https://github.com/open-mpi/ompi/pull/7952
 
- AVX
- Went in this morning.
 
- UCX PRs awaiting review.
 
- George's Collectives
- 
Past: We've come to consensus for a v4.1.0 release - Need include/exclude selection, worried about consistent selection.
- Alot of PRs outstanding, but can't merge until
- Patch for OFI stuff messed up v4.1.x branch.
- Howard has a fix PR, Jeff is looking at.
 
- Howard changed new OFI BTL parameters to be consistent with MTL
- Not breaking ABI or backwards compatibility.
- v4.1.x branch, branched from v4.0.4 tag.
- NOT touching runtime!!!
- Not going to be pulling in a new PMIx version.
 
- 
All MTT is online on v4.1.x branch 
- 
Not compiling under SLURM EFA test. (OFI BTL issue) 
Review v5.0.0 Milestones v5.0.0
- 
No update this week other than master discussion. 
- 
Need to put OSC pt2pt - OS RDMA requires a single BTL that can contact every single process.
- This didn't use to be the case. (Comment in the code)
 
 
- OS RDMA requires a single BTL that can contact every single process.
- 
We can't use the OSC pt2pt. - It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
- This is just a testing falicy. Could add tests to show this, but still at same boat.
- Either product A or B is broken and we need to fix it.
 
- 
RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics. - The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
 
- 
Jeff will close the PR, and 
- 
Jeff will Nathan will fetching, get, compare and swap. 
- 
Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller. 
- 
Does UCX support iWarp? - Does libFabric support iWarp via verbs provider?
- https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
- Brian thinks that libFabric
- OFI can support iWarp, just need to specify the provider in the include list.
- This person who's asking is a partner not a customer
 
- 
PMIX - Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
- Sessions needs something from PMIx v4
- ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
- PPN scaling issue - simple algorithmic issue in this function
- PMIX talked about it. Artem might know someone who might be interested in working on it.
- Algorithm behind one of the interfaces doesn't scale well.
- Not a regression. Above ~ 4K nodes, becomes quadratic.
 
 
- 
PRRTE - Nothing's happening there.
 
- Mostly discussed above.
- George and Jeff will help plan and come to community.
- Done / Submitted.
 
- May not have Super Computing conference at ALL this year.
- Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
- Then this works pretty well, and do this a couple of times a year.
- Not constrained to Super Computing
- Almost certain that it will be virtual
- Not sure the cost.
- Ralph and Jeff have been doing ABCs of Open MPI - SO many people.  Done 2 of 3 sessions (each went 1.5 hours, lots of questions)
- Slides and Youtube are on website, and will send link to userlist.
- Part 3 is August 5th
 
- Also want an indept walk through of PMIx initialization / wireup
 
- scale-testing, PRs have to opt-into it.