Skip to content
Jeff Squyres edited this page Jul 12, 2017 · 121 revisions

July 2017 Open MPI Developer's Meeting

Logistics:

9am US Central time July 11 - noon US Central time July 13, 2017

Location:

Cisco, Chicago (pretty much directly next to O'Hare airport, google maps link), 9501 Technology Blvd, West Office Center, Rosemont, Illinois 60018

We are in the "Midway" conference room, which is outside Cisco reception.

Meaning: you don't have to check-in with reception / get a badge.

Just take the first hallway off to your left and Midway is clearly marked immediately on the left.

Attendees

There are no registration fees to attend this meeting.

Please add your name to the wiki list below if you are coming to the meeting:

  1. Ralph Castain (Intel)
  2. Jeff Squyres (Cisco)
  3. Brice Goglin (Inria)
  4. Brian Barrett (AWS) [only Tuesday and Wednesday]
  5. Mohan Gandhi (AWS)
  6. Shinji Sumimoto (Fujitsu)
  7. Takahiro Kawashima (Fujitsu)
  8. Nathan Hjelm (LANL)
  9. Howard Prichard (LANL)
  10. George Bosilca (UTK) [at least partially] (I hope we get the good half)
  11. Edgar Gabriel (UH)
  12. Artem Polyakov (Mellanox)
  13. Matthew Dosanjh (SNL)
  14. Geoff Paulsen (IBM)
  15. Geoffroy Vallee (ORNL)
  16. If you sign up after this point, be sure to let Jeff Squyres know so that he can get you a guest badge and wifi access!

Attending Remotely:

  1. Josh Hursey - IBM (Available from 8:30am-5pm Central) (Added a ☎️ icon next to the items I'd like to call in for, if possible)
  2. David Bernholdt - ORNL (around other commitments)

Topics to discuss

  • UCX packaging in OMPI sources (Mellanox)

    • Want this in OMPI v 4.0
    • Configuration prerequisites
      • When we turn it on (check available fabrics, tcp should be available soon, then UCX can be always on)
    • How new versions are updated
    • Placement inside the sources: needs to be available for both MPI and SHMEM layers.
    • INITIAL Got some push-back about adding more embedded packages. Will revisit tomorrow.
    • Motivation:
      • We see issues on the mailing list related to bad user experience with OMPI on Mellanox fabrics for both performance and stability.
      • Definitely need UCX for OSHMEM
    • Goal: improve the OOB experience on IB stacks:
      • by auto detecting UCX when available (as is done with SLURM autodetection).
      • by using the internal version when it is not available (for IB networks).
  • ☎️ What is the plan for 4.0 and beyond regarding embedding of:

    • hwloc v2 and v1
      • Easy way to disable hwloc internals such as NVML from OMPI's configure?
      • How to deal with hwloc 2.0 ABI break (2 components?)
    • libevent v2.1 and v2.0
    • pmix 2.0, 3.0, and 1.x
    • One suggestion: should we make the external components higher priority than the embedded components? This might naturally start deprecating / phasing out the embedded versions.
    • (Artem) List of features for 4.0
    • (Artem) What PMIx version is planned.
  • Move the entire Open MPI web site behind a CDN?

    • If so, we can remove the mirrors program
  • Investigate shared location for OMPI organization secrets/keys/passwords (e.g., LastPass? 1Password? ...?)

  • ☎️ How to better track PRs across multiple release branches?

    • E.g., ensure it has already been merged to master
    • E.g., ensure that we merge at vX only when it has been merged at all desired versions < vX
    • One possibility: should we always make an issue, and put a tag on it for each version that a given PR is merged against?
    • Can this be automated via bot somehow?
  • ☎️ Proposal for OMPI signed-off-by policy:

    1. Do not grandfather old commits
    2. If you cherry pick someone else's commit, you need to sign off
  • ☎️ Threading model

  • ☎️ Rankfile mapper: Ralph can no longer maintain it. Who will become the maintainer? (IBM volunteered)

  • ☎️ Issue/old PR roundup, esp. for the v2.0.x and v2.1.x releases.

  • ☎️ Signal forwarding

    • Came up on the user list again, this time wanting a way to signal only child procs that call MPI_Init (and not any intermediate procs such as shell scripts)
    • Ralph added an MCA param to either hit only direct children, or all descendants of those children - but not exactly what the user requested
  • ☎️ Multithreaded Onesided - It's buggy, just fix bugs or refactor?

  • Strict C99 stuff (e.g., pointer to constant)

    • Per Paul Hargrove's discovery; adapted in PR https://github.com/open-mpi/ompi/pull/3813
    • Note: there's non-C99 elsewhere in OMPI (i.e., if you enable "strict C99", OPAL fails to compile in at least a few places)
    • Do we really care about strict C99?
  • Automate reduction of symbol name pollution?

  • SPI: Any updates / action items?

    • (This is an open question)
  • Other pending PR's that require any discussion...?

    • ...
  • ☎️ CI:

    • What can we do about the fragility of the Jenkins infrastructure?
      • It seems like one or more of the CI's is broken every week due to lost connections or changed protocols, thereby blocking all commits.
    • Other random CI updates
  • What do we do about Pathscale compiler support?

Thursday

  • ☎️ MAYBE THURSDAY/GEORGE Fujitsu Status

  • ☎️ THURSDAY/GEORGE Plans for v4.0.x (recall: new datatype stuff on master is backwards incompatible with v3.0.x -- https://github.com/open-mpi/ompi/pull/3441)

    • Remove MPI symbols removed in MPI-3.0
      • Can we do this in a way to default to being a compiler error (showing the exact file / linenumber of removed symbol).
      • Any value in providing a non-default way to turn this into a warning to allow customers to make progress without changing their code? Many of these changes are straight forward, is this even worth the effort?
    • Any other Binary incompatible changes we want to do for v4.0.x (ASAP)?
  • ☎️ THURSDAY/GEORGE Remove CR from master before we branch for v4.0.x

  • THURSDAY/SO GEORGE CAN BE HERE: Shall we link components against their native main library - e.g., ORTE components to libopen-rte?

  • ☎️ THURSDAY/GEORGE PMIx working group meetings

    • Network
    • Tiered Storage
    • OpenMP/MPI coordination
    • Language bindings as apps begin using PMIx? (Ralph volunteers to do Fortran!)
  • THURSDAY/GEORGE Old issue about BTL progress functions: https://github.com/open-mpi/ompi/issues/1695

Done

  • ☎️ [George & Nathan] IMB Unidir_Get with Vader issue - https://github.com/open-mpi/ompi/issues/3821

    • RESOLVED:
      • @hjelmn to look at this in the immediate future
      • This is a blocker for v3.0.0
      • May also necessitate a release in v2.0.x and v2.1.x -- need to investigate further
  • ☎️ Should we forward all OMPI_ env vars from mpirun environments to started process environments?

    • If so, should we also for ORTE_ and OPAL_ env vars?
    • Or should we only forward OMPI_MCA_ env vars?
      • NOTE: current master forwards all OMPI_ env vars
    • Should we make a non-OMPI_MCA_ prefix that we also forward, but something less than all of OMPI_? (E.g., OMPI_FORWARD_, or something better)
    • What about non-OMPI MCA params (e.g., PMIX_MCA)?
      • Just envars, or do we add a registration function for cmd line support (e.g., -pmca foo x)?
    • RESOLVED:
      • Yes, we want to forward non-OMPI_MCA env vars.
      • Ralph:
        • Will make a PR that will enable components to register what env vars they want forwarded. At max, we will support a single * for a wildcard (not full regexps) -- e.g. PSM2_* -- for forwarding all names that match.
        • Will probably be something like: a component that wants to register for this stuff will write something to a text file somewhere (e.g., write PSM2_* to a text file somewhere) that ORTE/PMIX/whatever will see later and do the forward. This makes it possible for orterun to forward whatever env vars it needs to, without having to open all their corresponding components (e.g., orterun doesn't know anything about PSM2 components, but can still forward PSM2 env vars.
  • MPI_File backing file location

  • ☎️ Release branch status:

    • v1.10
    • v2.0.x
    • v2.x (i.e., v2.1.x)
    • v3.0.x
    • RESOLVED:
      • Talked through all of these -- basically the normal content of a Tuesday webex.
  • Release processes / Brian

    • RESOLVED:
      • Coming soon: make nightly and release tarballs exactly the same
      • AUTHORS: we should automate these updates. Brian will work on this.
        • should we keep the orgs in there? It's somewhat of a pain. And it's also a bit of a relic -- from before we did the "signed off by" stuff.
        • Should we remove it from git and just auto-generate the file during make dist? Yes, this seems like a good idea.
      • NEWS: this is a problem. Want to change this to only top-level / broad-strokes of features. Do not include individual bug fixes -- there will be a line in there saying "Here's the URL where all the Github fixed issues and PRs can be found for this release".
        • Big change: RM's will not assemble NEWS. If a dev wants an item in NEWS, they need to PR it.
      • Commit messages: we need to get better about "Reported by helpful user" in commit messages. If we're not going to cite people in NEWS any more, then we want to make sure to cite them in commit messages.
  • ☎️ Can we get a NEWS decoration to commit messages on branches so that we know what to put in NEWS?

    • RESOLVED:
      • This is now moot, per above.
  • ☎️ Revisit this old discussion: should we continue cherry-picking from master to release branches?

    • The Git Way is usually to merge from master to release branches
      • (Artem) Few comments: my impression is that Git way is vice-versa (https://www.atlassian.com/git/tutorials/comparing-workflows#gitflow-workflow). It assumes following types of branches:
        • develop (persistent, where all new features go),
        • master (persistent, where all the releases are, each marked with the tag)
        • feature (temporal, branched from develop, merged back: for the temp work on new feature)
        • hotfix (temporal, branched from master, merged back: to fix post-release bugs). (!) Once the hotfix is merged to master, master is merged back to develop, not vice-versa to keep develop consistent with master.
        • release (temporal, branched from develop, merger to master: to harden before next release)
      • Currently: a) our master = developer; b) we don't have master equivalent; c) we keep release branches which force us to do cherry-picking and we sometimes have problems with lost commits.
      • This is not to say that we should follow this, one disadvantage I already see - not easy to support the old releases as release branches are eliminated after it stabilized. Just to keep in mind.
    • This puts more emphasis on master to be more stable. But maybe with all of our new CI, master is more stable these days...?
    • There are pros, cons, and differences: e.g., things wouldn't go on master unless we intend to merge them to release branches.
    • RESOLVED:
      • Main proposal from Brian:
        • Shorten time between branch and release, merge from master->release branch during that time (instead of cherry pick), and then cherry pick after release.
        • There is some discussion still needed about exactly when we want to stop merging and start cherry picking, because what about new features that come to master that aren't destined for that release
        • Brian will be posting a proposal about this
  • ☎️ CI:

    • Release process updates
      • Where should Open MPI downloads be:
        • OMPI web site (probably not)
        • S3
        • Github
  • RESOLVED:

    • Leave the plans in place for all downloads going to S3 (not Github)
  • ☎️ We now have options for merging PRs:

    • Continue the way we do now (merge at current head)
    • Rebase and merge (i.e., much more of linear history)
    • Rebase and squash
    • RESOLVED:
      • On master: ...
        • Brian thinks rebase and merge is good
        • Howard thinks merge @HEAD is good (i.e., what we do today)
      • On release branches: continue merging @HEAD
  • (Artem) UCX/OSC component status update (ready for PR)

    • RESOLVED:
      • Seems like a no-brainer: a vendor wants to commit a component that supports their hardware. Go for it.
      • This will bring up the network selection discussions again, though. We'll need to figure those out.
  • Howard's bug scrub / issue roundup

    • RESOLVED:
      • move some issues out of 2.0.4 milestone
      • update README to reflect not supporting PGI/OS-X and not support aarch64
  • What to do about the pathscale compiler support?

    • RESOLVED:
      • Jeff to file a PR that pathscale is no longer supported after OMPI v3.0.x.
Clone this wiki locally