nidmap reverse prefix patch #4748

sam6258 · 2018-01-24T17:29:44Z

No description provided.

jjhursey

My only concern with this PR is of my own making. The tryreverse mechanism in the nidmap is only activated when we detect the multiple digit set case. I would like to keep that mechanism in this bug fix as it both fixes the bug and provides efficient packing for that case.
I'm interested in what @ggouaillardet thinks on that part of this PR.

ggouaillardet · 2018-01-25T00:28:42Z

let's KISS :-)

if I understand correctly, we search the digits forward, and if we detect a second set, we go reverse.
am I correct so far ?

if yes, then since

forward and reverse should lead to the same regex when there is one set of digits
you think reverse is generally a better fit than forward when there are multiple set of digits
why don't we always try the reverse method, and have less code to maintain ?

iirc, @rhc54 suggested backporting the new regx framework into the v3.0.x branch might be acceptable.
if approved by the release manager, that would be even easier to maintain.

if not, and given the fact that a regex is expanded the same way regardless digits were searched forward or reverse,
an other option would be to generate both regex (we can do that only if a second set of digits is found) and keep the shorter one.

ggouaillardet

reference to commit ids from the master branch are missing (git cherry-pick -x can do that automatically)
i do not think these should be signed-off-by once again (unless backported)

'-' is not an alpha character nor a digit, but it is a valid hostname character and should be handled as an alpha character, otherwise, nodes such as node-001 do not get "compressed" in the regex. Refs open-mpi#4621 Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit f3e2a31)

If we detect that someone has given us an incorrect node name, provide a helpful message telling them as it is almost certainly a typo. Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 3269f2d)

jjhursey · 2018-01-25T17:30:45Z

@ggouaillardet I thought we couldn't backport the regx framework to the v3.0.x series since it's already had one official release. That's why I asked Scott prepare this special version. However, if we can, then I agree that would be the better option. @hppritcha @bwbarrett what do you think?

The reverse method should handle all of the cases we need it to (obviously we will want to test this out to confirm). But @rhc54 's concern was that it's hard to prove that we won't break other, currently support hostname formats. That's why you all went down the road of creating a framework on master instead of just adopting the reverse method.

For simple hostnames (e.g., node123) the reverse and forward method will emit the same prefix and regex.
For more complex hostnames (e.g., a4r12n12) the forward method will produce a different regex than reverse (prefix [a] for forward, a4r12n for reverse). However, the reverse should be more compact if the last set of digits change more than the others (e.g., a4r12n12, a4r12n13, a4r12n14) where as the forward method won't be able to compress that example at all (as you noted here).

For the v3.0.x series we want the bug to be fixed (so we can launch >64 nodes with rsh launcher), and have an option for the user to get a better compression of the more complex hostname format (for master we recommend that users set -mca regx reverse). I think this PR achieves our goal while still trying to not break the simple hostname format processing - just in case.

ggouaillardet · 2018-01-26T07:08:23Z

@jjhursey

here are the words of @rhc54 i misinterpreted

There is no problem moving a new framework into v3.1, and I would be leery of moving even this PR into the v3.0 series.

At first, let's agree we should fix the bug in the v3.0 series
(iirc, it impacts jobs with 2 nodes or more)

To me, the rationale for a regex framework is there is not a single component that fits all.
On IBM clusters, the reverse component is likely the best fit, and i can easily name my nodes in such a way that fwd would be the best fit.

I do not have a strong opinion on how to move forward for the v3.0 branch

only the (fixed) fwd algorithm (pro: this is the default component and hence likely to be the most tested; cons: suboptimal on IBM cluster, SpectrumMPI will likely diverge)
only the reverse algorithm (pro: only one algo to maintain and best for IBM; cons: likely less tested)
this PR (pro: common case is the tested fwd algo, optimal on IBM clusters; cons: more code to maintain)
try both algo and choose the shortest regex (pros: optimal, quite easy to maintain (a bunch of copy/paste from master), optimal on all clusters; cons: a bit more code to maintain strictly speaking)

As far as i am concerned, i like 4) best, and feels less comfortable with 3). But once again, this is not a strong opinion.
More important, that should not prevent us from fixing the bug asap (e.g. merge 3 commits out of 4) and give us some time to think the best move.

so any tree spawn operation properly gets the number of children underneath us. This commit is a tiny subset of open-mpi/ompi@347ca41 that should have been back-ported into the v3.0.x branch Fixes open-mpi#4578 Thanks to Carlos Eduardo de Andrade for reporting. Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit e445d47)

rhc54 · 2018-01-26T14:14:21Z

Yeah, this is a tough one. I remain concerned about the change to the nidmap algorithm - as we pointed out on the original PR, it is easy to come up with patterns that would have worked for the "forward" approach but don't work for the "reverse" one. This makes me nervous that some portion of the community will see new problems emerge.

I therefore agree that it would be better to separate out ced40df from the other three commits as they are clear bug fixes and involve no debate.

Going with the regx framework has its own issues. I fear it can get confusing to users if 3.1 requires an MCA param to resolve the nidmap problem, but 3.0 doesn't - it breaks the compatibility promise we made. Likewise, introducing a change in 3.0.1 does the same thing, but could be considered a "bug fix" requirement.

I think a discussion with the RMs and general community is therefore required for the other commit, and probably for the commit of the new regx framework into v3.1 in #4747 as well due to the compatibility promise question.

jjhursey · 2018-01-31T19:42:48Z

If we take out ced40df then we will need a custom patch to fix the original bug - hostnames with multiple sets of digits produce an invalid regex pattern preventing tree based launching with the rsh launcher. The other commits here do not fix that specific bug.

In master the bug was fixed at the same time the regx framework was created, so we cannot cherry-pick it. So we will need to port it over. Not a problem.

I suggest that we remove ced40df from this PR, can then port over the minimal fix from the regx/forward component so that the v3.0.x series will work on machines with valid hostnames containing multiple sets of digits.

This means that it produces a correct, but suboptimal pattern for hostnames with multiple sets of digits. We can do that backport.

As an aside:
@ggouaillardet For master it might be interesting add a new component to the regx framework that implements what you have described in (4). That seems like an interesting approach.

ggouaillardet · 2018-02-01T01:13:18Z

@jjhursey let's do that !
so 31ae3ea has to be back ported.
at first glance, the code change is the same, and only the filename differ.

For master, I can simply create yet an other component with is the best of fwd and reverse.
That is possible since both components interpret the regex the same way.

Or I can do that a la routed

when creating the regex
- try all the components, and choose the shortest regex
- prepend the regex with the component name
when interpreting the regex
- open the component that was used to create the regex
- interpret the regex with that component

The latter approach is more generic, but i'd rather do it asap so it lands the v3.1.0 release
(otherwise we won't be able to use mpirun and libopen-rte from slightly different versions,
and iirc, we explained at SC'17 that we support that as long as PMIx does)

@rhc54 any advice ?

rhc54 · 2018-02-01T02:37:04Z

What may save us here is that ORTE uses the PMIx regex function to generate the regex given to the application procs. This means that the clients don't see the regex built by mpirun - only other daemons do, and we already require that they be of the same version. So we should be okay from that perspective regardless of the option you use since the daemons are not allowed to mix libopen-rte versions, and the ORTE-generated regex doesn't crossover between daemons and application procs.

However, since the PMIx regex generator is the same as the ORTE "forward" algorithm, it means that this regex will also get the "wrong" answer in the IBM case...which means that clients are going to compute incorrect host locations for their peers - not good. Guess that means we need to port the regx framework to PMIx as well, and it raises questions about the embedded PMIx code in the 3.0 and 3.1 releases.

jjhursey · 2018-02-01T03:15:32Z

@sam6258 Can you remove commit ced40df and backport 31ae3ea (then retest of course)?

@rhc54 Yeah we need to fix the PMIx regex as well. We at least need to fix the bug in the forward approach, and provide an opportunity for an optimized implementation (not saying reverse is optimal, just better than forward for these types of hostnames). The framework technique allows for optimization opportunities.

rhc54 · 2018-02-01T05:45:41Z

FWIW: PMIx already has a regx framework, so all that should be required there is to create a suitable component.

Refs. open-mpi/ompi#4689 Refs. open-mpi/ompi#4748 Signed-off-by: Gilles Gouaillardet <[email protected]>

ggouaillardet · 2018-02-01T13:31:29Z

@rhc54 can you please help me correctly interpret slide 74 (Cross-version mpirun interoperability) that was presented at SC'17 Open MPI BoF ?

Host OS is running mpirun from Open MPI v3.1.x
containers are running Open MPI v2.1.x

which orted version is running (and hence installed) on the containers ?

the very same version than mpirun (e.g. 3.1.x) ?
the one and only version that is installed and used by the application (e.g. v2.1.x) ?
something else (e.g. more relaxed rules) ?

I never really thought about it, and my first impression was the second option.
Now I have some second thoughts ...

rhc54 · 2018-02-01T14:07:02Z

I think you'll find this discussed a bit here. Bottom line: all three scenarios exist in the wild and need to be supported. The guarantee given by OMPI is that any combination of v3.x will work.

So the point here is that we have a problem - changing the nidmap mid-series means that containers (and static builds) using older v3.0 releases will fail when presented with a regex generated by the "reverse" (or your modified solution) algorithm. This breaks our compatibility promise.

However, in the case of the framework approach, one could also argue that the only time the "reverse" component is activated is when the user explicitly requests it. In such cases, one would expect they have done so because they know the default mechanism will generate an error. Thus, we could argue that compatibility is preserved as the containerized older version of OMPI wouldn't have worked anyway on that system.

I'm not sure what would happen if a regex from your alternative approach is given to an older version of nidmap for processing. I suspect it will fail - yes? If so, I don't believe we can use it in the v3.x release series.

ggouaillardet · 2018-02-01T14:22:43Z

Thanks for the explanation !

currently, the regx frameworks have callbacks to create and parse the regex.
In the case of fwd vs reverse, they use the same callbacks, except for creating the regex, so I do not think there is any issue here.

Generally speaking, then yes, I agree there is a problem.
I was thinking of prefixing the regex with the name of the module that was used to create it, but that would break compatibility.

A quite different approach is to have only one callback in the regx framework to create the regex
(e.g. no matter how it was created, there is only way to parse it, so any version can do it).
That would limit the options to achieve the shortest regex, but ensure compatibility between versions.

An intermediate approach would be to add a compatibility option that would disqualify regx components that have a different method of parsing the regex.

makes sense ?

rhc54 · 2018-02-01T14:31:03Z

In PMIx, we do indeed prefix the regx with the name of the module that generated it, as you say, and we should do that going forward in OMPI. Since the regx framework isn't in prior versions, and there is no compatibility guarantee between major releases of OMPI, there is no problem in going that route.

The issue really is just with the OMPI v3.1 vs v3.0 release series. I don't believe we can make any changes to the regex code in those two without violating the promise.

sam6258 · 2018-02-01T15:18:12Z

I removed the commit that created the reverse method ced40df and backported the simple fix to the forward method from commit a056fde

rhc54

Should be okay, and won't affect backward compatibility

Refs. open-mpi/ompi#4689 Refs. open-mpi/ompi#4748 Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit 98073d7) Signed-off-by: Joshua Hursey <[email protected]>

ggouaillardet · 2018-02-02T02:50:34Z

@rhc54 back to the previously mentionned SC'17 slide, do we plan to support mpirun from v3.1.x interoperating with orted v2.x ? if yes, then there is some work to do since I already found two issues with oob/tcp. If not, do we even plan to support mpirun and orted from v3.1.x working with MPI app from v2.x ?

regx/fwd and regx/reverse are currently interoperable in a sense that only the nidmap_create callback differs (e.g. all other callbacks points to the same functions in regx/base) so i would stop worrying about it.

we cannot prefix the regex in the middle of the series by default (otherwise we would break interoperability). That being said, we can have future components prefix it, and it will be up to the user
to blacklist them when interoperability requires it.

ggouaillardet · 2018-02-02T02:55:06Z

@sam6258 the commit you backported is still titled regx/fwd. Feel free to change that since there is no such framework/component in the v3.0.x branch. That is not a blocker so I will approve this PR too.

rhc54 · 2018-02-02T03:10:41Z

We need to separate out the PMIx promise vs that of OMPI itself. I don't recall the precise slide, but the official OMPI promise is what was given on the wiki page - we only support "cross-version" support within a major release series (e.g., OMPI v3.1 with OMPI v3.0). We specifically exclude operations across major release series (e.g., between OMPI v3.x and OMPI v2.x).

PMIx goes further as we have to support cross-subsystem integration - but that is purely a PMIx promise and doesn't impose anything on OMPI.

ggouaillardet · 2018-02-02T03:39:45Z

That is the infamous slide. From your previous reply, I now understand it was incorrect.
(e.g. we do not support mpirun from v3.1.x with orted or libmpi from v2.x, regardless these series use interoperable PMIx)

rhc54 · 2018-02-02T04:22:01Z

Yeah, it's a little complicated. Even though PMIx might interoperate, the problem is that there is no guarantee that the information being communicated is the same. IIRC, what we say is that we guarantee interoperation if the major is the same, and it might work if the major is different - but we cannot guarantee it.

I could be confused myself, so maybe @bwbarrett or @jsquyres should chime in. All I can say for certain is that the PMIx communication channel will work in that slide - I just can't guarantee the contents of that communication will meet all expectations/requirements.

ggouaillardet · 2018-02-02T05:19:27Z

yep, we need to define what we want to achieve and where this should be discussed
(new issue ? wiki ? mailing list ?)

Starting at v3.0.0, oob/tcp expect the exact same version (e.g. major + minor + release + greek) in mca_oob_tcp_peer_recv_connect_ack() https://github.com/open-mpi/ompi/blob/master/orte/mca/oob/tcp/oob_tcp_connection.c#L889-L902

3.0.0 is already out and imho, we should do that before 3.0.1 and 3.1.0 are released.

master has version 4.0.0.a1, should it interoperate with v3 releases ? if yes, should its version be 3.999.0a1 or something else with major 3 ?

last but not least, i think we should have MTT and/or CI test for interoperability, since it is so easy to break.

rhc54 · 2018-02-02T09:21:55Z

Up to the community as to where to discuss it - we had a long discussion about this not so long ago, and I thought it had been captured (but I don't recall exactly where).

As for the oob: only daemon-to-daemon communication uses the OOB, and we require the daemons to all be the same OMPI version. So I'm not sure why you feel the OOB is an issue? I do see that the apps open the oob/rml frameworks in master - IIRC, that was left solely so that @anandhis could test her rml/ofi code. Probably should be removed now.

ggouaillardet · 2018-02-02T09:40:47Z

OK, let me see if I finally get it straight ...

in https://github.com/open-mpi/ompi/wiki/Container-Versioning

Scenario 2: Containers for compute nodes

In this scenario, orte and libmpi are both in the same container, so are likely to be from the same release / build. However, mpirun is run from outside the container and may be (likely is?) from a different release.

Version Testing

MPI application built against revision B, mpirun from version A, orte and libmpi from revision B.

what does orte exactly means here ?

libopen-rte used by the app only ? (which implies mpirun and orted have the exact same version, and there are hence up to two Open MPI installations in the container, one for orted and its dependencies, and an other for the MPI app and its dependencies)
libopen-rte and orted ? (which implies mpirun and orted might have different versions)

rhc54 · 2018-02-02T09:57:34Z

You probably need to move this off of this PR - the discussion is way beyond this poor old PR.

I honestly think you are over-thinking this. The promise was solely about the compatibility of the app to the launch infrastructure (mpirun + daemons). If you put an orted inside the container and try to use it with an mpirun outside the container, then we only guarantee compatibility if the two are within the same OMPI major release series. Period. And yes - that means the executable and any dynamically linked libraries - we don't distinguish between them.

IMO, anybody invoking more than one OMPI version in a container is doing something very, very wrong.

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <[email protected]> (back-ported from commit a056fde) Signed-off-by: Scott Miller <[email protected]>

jjhursey · 2018-02-02T14:39:32Z

Yeah this PR is ready to go now.

If there is a question of if we should move the regx framework into the v3.1.x series then we should move the discussion to PR #4747. I would like to get PR #4747 merged soon (or some version with the bug fix in it) so I can do some scale testing with the Open MPI v3.1.x HEAD.

jjhursey · 2018-02-02T15:54:56Z

@bwbarrett @hppritcha This PR is good to go.

jjhursey self-requested a review January 24, 2018 17:33

jjhursey self-assigned this Jan 24, 2018

jjhursey added the bug label Jan 24, 2018

jjhursey added this to the v3.0.2 milestone Jan 24, 2018

jjhursey requested a review from ggouaillardet January 24, 2018 18:06

jjhursey approved these changes Jan 24, 2018

View reviewed changes

ggouaillardet requested changes Jan 25, 2018

View reviewed changes

ggouaillardet and others added 2 commits January 25, 2018 09:21

Detect/warn of illegal node names

ad4df4f

If we detect that someone has given us an incorrect node name, provide a helpful message telling them as it is almost certainly a typo. Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 3269f2d)

sam6258 force-pushed the regx_patch_3.0 branch from e1dbdcb to 4897df8 Compare January 25, 2018 14:48

bwbarrett added the Target: v3.0.x label Jan 30, 2018

ggouaillardet added a commit to ggouaillardet/pmix that referenced this pull request Feb 1, 2018

preg/native: correctly handle node names with multiple set of digits

98073d7

Refs. open-mpi/ompi#4689 Refs. open-mpi/ompi#4748 Signed-off-by: Gilles Gouaillardet <[email protected]>

sam6258 force-pushed the regx_patch_3.0 branch from 4897df8 to 2c6431b Compare February 1, 2018 15:16

rhc54 approved these changes Feb 1, 2018

View reviewed changes

jjhursey mentioned this pull request Feb 1, 2018

preg/native: correctly handle node names with multiple set of digits openpmix/openpmix#658

Merged

ggouaillardet approved these changes Feb 2, 2018

View reviewed changes

orte/nidmap: correctly handle node names with multiple set of digits

28d49dc

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <[email protected]> (back-ported from commit a056fde) Signed-off-by: Scott Miller <[email protected]>

sam6258 force-pushed the regx_patch_3.0 branch from 2c6431b to 28d49dc Compare February 2, 2018 13:19

bwbarrett merged commit 72baf57 into open-mpi:v3.0.x Feb 12, 2018

sam6258 deleted the regx_patch_3.0 branch March 22, 2018 19:09

nidmap reverse prefix patch #4748

nidmap reverse prefix patch #4748

Uh oh!

Conversation

sam6258 commented Jan 24, 2018

Uh oh!

jjhursey left a comment

Choose a reason for hiding this comment

Uh oh!

ggouaillardet commented Jan 25, 2018

Uh oh!

ggouaillardet left a comment

Choose a reason for hiding this comment

Uh oh!

jjhursey commented Jan 25, 2018

Uh oh!

ggouaillardet commented Jan 26, 2018

Uh oh!

rhc54 commented Jan 26, 2018

Uh oh!

jjhursey commented Jan 31, 2018

Uh oh!

ggouaillardet commented Feb 1, 2018

Uh oh!

rhc54 commented Feb 1, 2018

Uh oh!

jjhursey commented Feb 1, 2018

Uh oh!

rhc54 commented Feb 1, 2018

Uh oh!

ggouaillardet commented Feb 1, 2018

Uh oh!

rhc54 commented Feb 1, 2018

Uh oh!

ggouaillardet commented Feb 1, 2018

Uh oh!

rhc54 commented Feb 1, 2018

Uh oh!

sam6258 commented Feb 1, 2018

Uh oh!

rhc54 left a comment

Choose a reason for hiding this comment

Uh oh!

ggouaillardet commented Feb 2, 2018

Uh oh!

ggouaillardet commented Feb 2, 2018

Uh oh!

rhc54 commented Feb 2, 2018

Uh oh!

ggouaillardet commented Feb 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Feb 2, 2018

Uh oh!

ggouaillardet commented Feb 2, 2018

Uh oh!

rhc54 commented Feb 2, 2018

Uh oh!

ggouaillardet commented Feb 2, 2018

Uh oh!

rhc54 commented Feb 2, 2018

Uh oh!

jjhursey commented Feb 2, 2018

Uh oh!

jjhursey commented Feb 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggouaillardet commented Feb 2, 2018 •

edited

Loading