Skip to content

Conversation

@sam6258
Copy link
Contributor

@sam6258 sam6258 commented Jan 24, 2018

No description provided.

@jjhursey jjhursey self-requested a review January 24, 2018 17:33
@jjhursey jjhursey self-assigned this Jan 24, 2018
@jjhursey jjhursey added the bug label Jan 24, 2018
@jjhursey jjhursey added this to the v3.0.2 milestone Jan 24, 2018
Copy link
Member

@jjhursey jjhursey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern with this PR is of my own making. The tryreverse mechanism in the nidmap is only activated when we detect the multiple digit set case. I would like to keep that mechanism in this bug fix as it both fixes the bug and provides efficient packing for that case.
I'm interested in what @ggouaillardet thinks on that part of this PR.

@ggouaillardet
Copy link
Contributor

let's KISS :-)

if I understand correctly, we search the digits forward, and if we detect a second set, we go reverse.
am I correct so far ?

if yes, then since

  • forward and reverse should lead to the same regex when there is one set of digits
  • you think reverse is generally a better fit than forward when there are multiple set of digits
    why don't we always try the reverse method, and have less code to maintain ?

iirc, @rhc54 suggested backporting the new regx framework into the v3.0.x branch might be acceptable.
if approved by the release manager, that would be even easier to maintain.

if not, and given the fact that a regex is expanded the same way regardless digits were searched forward or reverse,
an other option would be to generate both regex (we can do that only if a second set of digits is found) and keep the shorter one.

Copy link
Contributor

@ggouaillardet ggouaillardet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • reference to commit ids from the master branch are missing (git cherry-pick -x can do that automatically)
  • i do not think these should be signed-off-by once again (unless backported)

ggouaillardet and others added 2 commits January 25, 2018 09:21
'-' is not an alpha character nor a digit, but it is a valid hostname
character and should be handled as an alpha character, otherwise, nodes
such as node-001 do not get "compressed" in the regex.

Refs open-mpi#4621

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit f3e2a31)
If we detect that someone has given us an incorrect node name, provide a helpful message telling them as it is almost certainly a typo.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 3269f2d)
@jjhursey
Copy link
Member

@ggouaillardet I thought we couldn't backport the regx framework to the v3.0.x series since it's already had one official release. That's why I asked Scott prepare this special version. However, if we can, then I agree that would be the better option. @hppritcha @bwbarrett what do you think?

The reverse method should handle all of the cases we need it to (obviously we will want to test this out to confirm). But @rhc54 's concern was that it's hard to prove that we won't break other, currently support hostname formats. That's why you all went down the road of creating a framework on master instead of just adopting the reverse method.

For simple hostnames (e.g., node123) the reverse and forward method will emit the same prefix and regex.
For more complex hostnames (e.g., a4r12n12) the forward method will produce a different regex than reverse (prefix [a] for forward, a4r12n for reverse). However, the reverse should be more compact if the last set of digits change more than the others (e.g., a4r12n12, a4r12n13, a4r12n14) where as the forward method won't be able to compress that example at all (as you noted here).

For the v3.0.x series we want the bug to be fixed (so we can launch >64 nodes with rsh launcher), and have an option for the user to get a better compression of the more complex hostname format (for master we recommend that users set -mca regx reverse). I think this PR achieves our goal while still trying to not break the simple hostname format processing - just in case.

@ggouaillardet
Copy link
Contributor

@jjhursey

here are the words of @rhc54 i misinterpreted

There is no problem moving a new framework into v3.1, and I would be leery of moving even this PR into the v3.0 series.

At first, let's agree we should fix the bug in the v3.0 series
(iirc, it impacts jobs with 2 nodes or more)

To me, the rationale for a regex framework is there is not a single component that fits all.
On IBM clusters, the reverse component is likely the best fit, and i can easily name my nodes in such a way that fwd would be the best fit.

I do not have a strong opinion on how to move forward for the v3.0 branch

  1. only the (fixed) fwd algorithm (pro: this is the default component and hence likely to be the most tested; cons: suboptimal on IBM cluster, SpectrumMPI will likely diverge)
  2. only the reverse algorithm (pro: only one algo to maintain and best for IBM; cons: likely less tested)
  3. this PR (pro: common case is the tested fwd algo, optimal on IBM clusters; cons: more code to maintain)
  4. try both algo and choose the shortest regex (pros: optimal, quite easy to maintain (a bunch of copy/paste from master), optimal on all clusters; cons: a bit more code to maintain strictly speaking)

As far as i am concerned, i like 4) best, and feels less comfortable with 3). But once again, this is not a strong opinion.
More important, that should not prevent us from fixing the bug asap (e.g. merge 3 commits out of 4) and give us some time to think the best move.

so any tree spawn operation properly gets the number of children underneath us.

This commit is a tiny subset of open-mpi/ompi@347ca41
that should have been back-ported into the v3.0.x branch

Fixes open-mpi#4578

Thanks to Carlos Eduardo de Andrade for reporting.

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit e445d47)
@rhc54
Copy link
Contributor

rhc54 commented Jan 26, 2018

Yeah, this is a tough one. I remain concerned about the change to the nidmap algorithm - as we pointed out on the original PR, it is easy to come up with patterns that would have worked for the "forward" approach but don't work for the "reverse" one. This makes me nervous that some portion of the community will see new problems emerge.

I therefore agree that it would be better to separate out ced40df from the other three commits as they are clear bug fixes and involve no debate.

Going with the regx framework has its own issues. I fear it can get confusing to users if 3.1 requires an MCA param to resolve the nidmap problem, but 3.0 doesn't - it breaks the compatibility promise we made. Likewise, introducing a change in 3.0.1 does the same thing, but could be considered a "bug fix" requirement.

I think a discussion with the RMs and general community is therefore required for the other commit, and probably for the commit of the new regx framework into v3.1 in #4747 as well due to the compatibility promise question.

@jjhursey
Copy link
Member

If we take out ced40df then we will need a custom patch to fix the original bug - hostnames with multiple sets of digits produce an invalid regex pattern preventing tree based launching with the rsh launcher. The other commits here do not fix that specific bug.

In master the bug was fixed at the same time the regx framework was created, so we cannot cherry-pick it. So we will need to port it over. Not a problem.

I suggest that we remove ced40df from this PR, can then port over the minimal fix from the regx/forward component so that the v3.0.x series will work on machines with valid hostnames containing multiple sets of digits.

This means that it produces a correct, but suboptimal pattern for hostnames with multiple sets of digits. We can do that backport.

As an aside:
@ggouaillardet For master it might be interesting add a new component to the regx framework that implements what you have described in (4). That seems like an interesting approach.

@ggouaillardet
Copy link
Contributor

@jjhursey let's do that !
so 31ae3ea has to be back ported.
at first glance, the code change is the same, and only the filename differ.

For master, I can simply create yet an other component with is the best of fwd and reverse.
That is possible since both components interpret the regex the same way.

Or I can do that a la routed

  • when creating the regex
    • try all the components, and choose the shortest regex
    • prepend the regex with the component name
  • when interpreting the regex
    • open the component that was used to create the regex
    • interpret the regex with that component

The latter approach is more generic, but i'd rather do it asap so it lands the v3.1.0 release
(otherwise we won't be able to use mpirun and libopen-rte from slightly different versions,
and iirc, we explained at SC'17 that we support that as long as PMIx does)

@rhc54 any advice ?

@rhc54
Copy link
Contributor

rhc54 commented Feb 1, 2018

What may save us here is that ORTE uses the PMIx regex function to generate the regex given to the application procs. This means that the clients don't see the regex built by mpirun - only other daemons do, and we already require that they be of the same version. So we should be okay from that perspective regardless of the option you use since the daemons are not allowed to mix libopen-rte versions, and the ORTE-generated regex doesn't crossover between daemons and application procs.

However, since the PMIx regex generator is the same as the ORTE "forward" algorithm, it means that this regex will also get the "wrong" answer in the IBM case...which means that clients are going to compute incorrect host locations for their peers - not good. Guess that means we need to port the regx framework to PMIx as well, and it raises questions about the embedded PMIx code in the 3.0 and 3.1 releases.

@jjhursey
Copy link
Member

jjhursey commented Feb 1, 2018

@sam6258 Can you remove commit ced40df and backport 31ae3ea (then retest of course)?

@rhc54 Yeah we need to fix the PMIx regex as well. We at least need to fix the bug in the forward approach, and provide an opportunity for an optimized implementation (not saying reverse is optimal, just better than forward for these types of hostnames). The framework technique allows for optimization opportunities.

@rhc54
Copy link
Contributor

rhc54 commented Feb 1, 2018

FWIW: PMIx already has a regx framework, so all that should be required there is to create a suitable component.

ggouaillardet added a commit to ggouaillardet/pmix that referenced this pull request Feb 1, 2018
@ggouaillardet
Copy link
Contributor

@rhc54 can you please help me correctly interpret slide 74 (Cross-version mpirun interoperability) that was presented at SC'17 Open MPI BoF ?

  • Host OS is running mpirun from Open MPI v3.1.x
  • containers are running Open MPI v2.1.x

which orted version is running (and hence installed) on the containers ?

  1. the very same version than mpirun (e.g. 3.1.x) ?
  2. the one and only version that is installed and used by the application (e.g. v2.1.x) ?
  3. something else (e.g. more relaxed rules) ?

I never really thought about it, and my first impression was the second option.
Now I have some second thoughts ...

@rhc54
Copy link
Contributor

rhc54 commented Feb 1, 2018

I think you'll find this discussed a bit here. Bottom line: all three scenarios exist in the wild and need to be supported. The guarantee given by OMPI is that any combination of v3.x will work.

So the point here is that we have a problem - changing the nidmap mid-series means that containers (and static builds) using older v3.0 releases will fail when presented with a regex generated by the "reverse" (or your modified solution) algorithm. This breaks our compatibility promise.

However, in the case of the framework approach, one could also argue that the only time the "reverse" component is activated is when the user explicitly requests it. In such cases, one would expect they have done so because they know the default mechanism will generate an error. Thus, we could argue that compatibility is preserved as the containerized older version of OMPI wouldn't have worked anyway on that system.

I'm not sure what would happen if a regex from your alternative approach is given to an older version of nidmap for processing. I suspect it will fail - yes? If so, I don't believe we can use it in the v3.x release series.

@ggouaillardet
Copy link
Contributor

Thanks for the explanation !

currently, the regx frameworks have callbacks to create and parse the regex.
In the case of fwd vs reverse, they use the same callbacks, except for creating the regex, so I do not think there is any issue here.

Generally speaking, then yes, I agree there is a problem.
I was thinking of prefixing the regex with the name of the module that was used to create it, but that would break compatibility.

A quite different approach is to have only one callback in the regx framework to create the regex
(e.g. no matter how it was created, there is only way to parse it, so any version can do it).
That would limit the options to achieve the shortest regex, but ensure compatibility between versions.

An intermediate approach would be to add a compatibility option that would disqualify regx components that have a different method of parsing the regex.

makes sense ?

@rhc54
Copy link
Contributor

rhc54 commented Feb 1, 2018

In PMIx, we do indeed prefix the regx with the name of the module that generated it, as you say, and we should do that going forward in OMPI. Since the regx framework isn't in prior versions, and there is no compatibility guarantee between major releases of OMPI, there is no problem in going that route.

The issue really is just with the OMPI v3.1 vs v3.0 release series. I don't believe we can make any changes to the regex code in those two without violating the promise.

@sam6258
Copy link
Contributor Author

sam6258 commented Feb 1, 2018

I removed the commit that created the reverse method ced40df and backported the simple fix to the forward method from commit a056fde

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be okay, and won't affect backward compatibility

jjhursey pushed a commit to jjhursey/openpmix that referenced this pull request Feb 1, 2018
Refs. open-mpi/ompi#4689
Refs. open-mpi/ompi#4748

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit 98073d7)
Signed-off-by: Joshua Hursey <[email protected]>
@ggouaillardet
Copy link
Contributor

@rhc54 back to the previously mentionned SC'17 slide, do we plan to support mpirun from v3.1.x interoperating with orted v2.x ? if yes, then there is some work to do since I already found two issues with oob/tcp. If not, do we even plan to support mpirun and orted from v3.1.x working with MPI app from v2.x ?

regx/fwd and regx/reverse are currently interoperable in a sense that only the nidmap_create callback differs (e.g. all other callbacks points to the same functions in regx/base) so i would stop worrying about it.

we cannot prefix the regex in the middle of the series by default (otherwise we would break interoperability). That being said, we can have future components prefix it, and it will be up to the user
to blacklist them when interoperability requires it.

@ggouaillardet
Copy link
Contributor

@sam6258 the commit you backported is still titled regx/fwd. Feel free to change that since there is no such framework/component in the v3.0.x branch. That is not a blocker so I will approve this PR too.

@rhc54
Copy link
Contributor

rhc54 commented Feb 2, 2018

We need to separate out the PMIx promise vs that of OMPI itself. I don't recall the precise slide, but the official OMPI promise is what was given on the wiki page - we only support "cross-version" support within a major release series (e.g., OMPI v3.1 with OMPI v3.0). We specifically exclude operations across major release series (e.g., between OMPI v3.x and OMPI v2.x).

PMIx goes further as we have to support cross-subsystem integration - but that is purely a PMIx promise and doesn't impose anything on OMPI.

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Feb 2, 2018

That is the infamous slide. From your previous reply, I now understand it was incorrect.
(e.g. we do not support mpirun from v3.1.x with orted or libmpi from v2.x, regardless these series use interoperable PMIx)

mpirun

@rhc54
Copy link
Contributor

rhc54 commented Feb 2, 2018

Yeah, it's a little complicated. Even though PMIx might interoperate, the problem is that there is no guarantee that the information being communicated is the same. IIRC, what we say is that we guarantee interoperation if the major is the same, and it might work if the major is different - but we cannot guarantee it.

I could be confused myself, so maybe @bwbarrett or @jsquyres should chime in. All I can say for certain is that the PMIx communication channel will work in that slide - I just can't guarantee the contents of that communication will meet all expectations/requirements.

@ggouaillardet
Copy link
Contributor

yep, we need to define what we want to achieve and where this should be discussed
(new issue ? wiki ? mailing list ?)

Starting at v3.0.0, oob/tcp expect the exact same version (e.g. major + minor + release + greek) in mca_oob_tcp_peer_recv_connect_ack() https://github.com/open-mpi/ompi/blob/master/orte/mca/oob/tcp/oob_tcp_connection.c#L889-L902

3.0.0 is already out and imho, we should do that before 3.0.1 and 3.1.0 are released.

master has version 4.0.0.a1, should it interoperate with v3 releases ? if yes, should its version be 3.999.0a1 or something else with major 3 ?

last but not least, i think we should have MTT and/or CI test for interoperability, since it is so easy to break.

@rhc54
Copy link
Contributor

rhc54 commented Feb 2, 2018

Up to the community as to where to discuss it - we had a long discussion about this not so long ago, and I thought it had been captured (but I don't recall exactly where).

As for the oob: only daemon-to-daemon communication uses the OOB, and we require the daemons to all be the same OMPI version. So I'm not sure why you feel the OOB is an issue? I do see that the apps open the oob/rml frameworks in master - IIRC, that was left solely so that @anandhis could test her rml/ofi code. Probably should be removed now.

@ggouaillardet
Copy link
Contributor

OK, let me see if I finally get it straight ...

in https://github.com/open-mpi/ompi/wiki/Container-Versioning

  • Scenario 2: Containers for compute nodes

In this scenario, orte and libmpi are both in the same container, so are likely to be from the same release / build. However, mpirun is run from outside the container and may be (likely is?) from a different release.

  • Version Testing
  1. MPI application built against revision B, mpirun from version A, orte and libmpi from revision B.

what does orte exactly means here ?

  • libopen-rte used by the app only ? (which implies mpirun and orted have the exact same version, and there are hence up to two Open MPI installations in the container, one for orted and its dependencies, and an other for the MPI app and its dependencies)
  • libopen-rte and orted ? (which implies mpirun and orted might have different versions)

@rhc54
Copy link
Contributor

rhc54 commented Feb 2, 2018

You probably need to move this off of this PR - the discussion is way beyond this poor old PR.

I honestly think you are over-thinking this. The promise was solely about the compatibility of the app to the launch infrastructure (mpirun + daemons). If you put an orted inside the container and try to use it with an mpirun outside the container, then we only guarantee compatibility if the two are within the same OMPI major release series. Period. And yes - that means the executable and any dynamically linked libraries - we don't distinguish between them.

IMO, anybody invoking more than one OMPI version in a container is doing something very, very wrong.

Refs. open-mpi#4689

Signed-off-by: Gilles Gouaillardet <[email protected]>
(back-ported from commit a056fde)
Signed-off-by: Scott Miller <[email protected]>
@jjhursey
Copy link
Member

jjhursey commented Feb 2, 2018

Yeah this PR is ready to go now.

If there is a question of if we should move the regx framework into the v3.1.x series then we should move the discussion to PR #4747. I would like to get PR #4747 merged soon (or some version with the bug fix in it) so I can do some scale testing with the Open MPI v3.1.x HEAD.

@jjhursey
Copy link
Member

jjhursey commented Feb 2, 2018

@bwbarrett @hppritcha This PR is good to go.

@bwbarrett bwbarrett merged commit 72baf57 into open-mpi:v3.0.x Feb 12, 2018
@sam6258 sam6258 deleted the regx_patch_3.0 branch March 22, 2018 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants