Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented May 19, 2016

This should never have gone into 2.0.0. The ofi provider's priority should
never be higher than ob1 if verbs or sockets is the provider.

This reverts commit 1b5637d.

Fixes open-mpi/ompi#1676

This should never have gone into 2.0.0. The ofi provider's priority should
never be higher than ob1 if verbs or sockets is the provider.

This reverts commit 1b5637d.

Fixes open-mpi/ompi#1676
@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

@yburette This should not have gone into 2.0.0.

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

:bot:assign: @jsquyres
:bot🏷️bug
:bot🏷️blocker
:bot:milestone:v2.0.0

@jsquyres
Copy link
Member

@yburette @hppritcha @matcabral Do you guys know/remember why the priority was set so high? Shouldn't OFI ensure to set the priority high only if certain libfabric providers are being used?

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

It was set to 30 on master along with psm2, psm, portals4, and mxm. The problem was we were supposed to leave it at 10 on 2.0.0. The commit that restored the master priority of ofi to be higher than ob1 was included in a cm priority update and wasn't supposed to be.

It is probably a good thing we hit this problem now because it shows we should drop the priority on master until the priority can be based on available providers (we never want to choose sockets over ob1 for example).

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1675/ for details.

@hppritcha
Copy link
Member

I defer to @yburette . All that I can say though is that compared to other things, OFI MTL has been rock solid. We could make priority based on the underlying provider rather than a fixed parameter. Even there, the verbs provider in OFI is maturing really rapidly.

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

@hppritcha Ideally thats what we want but we will need ofi/verbs + cm vs ob1/verbs performance numbers before letting ofi win.

@jsquyres
Copy link
Member

@hppritcha I don't think that the verbs support in libfabric is mature enough yet. Additionally, there's the issue of shared memory support in the MTL (which is nominally handled by PSm/psm2). Does that even exist yet for the verbs libfabric provider?

I thought we had some kind of include / exclude MCA parameter for libfabric providers in the fourth MTL. Do those not exist?

@hppritcha
Copy link
Member

@jsquyres yes there is a include/exclude mca parameter for providers.
I'm fine with reverting the commit though. We don't need to wait for @yburette .

@jsquyres
Copy link
Member

@hppritcha What's the default "include" setting for the providers -- should it default to something like psm,psm2? I.e., if the provider being used is in that list, the priority can go up to a higher value, but if the provider is not in that list, OFI should disqualify itself.

@yburette
Copy link
Member

@jsquyres @hjelmn @hppritcha Sorry guys, just saw this discussion.

I thought that the original idea was to have the higher priority if, and only if, the OFI provider was in the include list. Let me take a closer look.

@yburette
Copy link
Member

" (we never want to choose sockets over ob1 for example)."
@hjelmn I agree, and I thought that was taken care of by having the sockets provider in the exclude list.
I must have missed something: what is the difference between v2.x and master w.r.t. the way the components are queried and selected?

@yburette
Copy link
Member

@hjelmn Another question: which OFI provider ends up being selected?
I can see how the exclude list should be updated with some of the new OFI providers...
Could you please check that using --mca mtl_ofi_provider_exclude with the providers you don't want (e.g. sockets, mxm, verbs) solves the issue?

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

For the user (iWarp) verbs is selected (it doesn't even appear to have a matching implementation and should not win vs ob1 anyway), if verbs is disabled then sockets wins (ACK!).

@yburette
Copy link
Member

@hjelmn What about when you disable both?
--mca mtl_ofi_provider_exclude "sockets,mxm,verbs"

@jsquyres
Copy link
Member

How about changing from a default of excluding providers that we know we don't want to a default of including the providers that we know that we do want?

It feels like the risk is lower for including known good providers.

@hppritcha
Copy link
Member

I'd go with the including known good providers. That way we don't need to keep track of all the multitude of "layered" providers that seem to be appearing...

@jsquyres
Copy link
Member

@hjelmn @matcabral @yburette I therefore think that this PR should be closed and/or changed to set the "includes" as the default. What should be the value: psm,psm2,gni?

@yburette
Copy link
Member

@jsquyres @hppritcha I see your point, and I agree that now that we have the "layered" providers, it's going to become increasingly more difficult to maintain this exclude list.

I think the change should be fairly trivial -- i.e. include = "psm,psm2", exclude = NULL.

@hppritcha Do you want me to include the gni provider by default as well?

@hjelmn
Copy link
Member Author

hjelmn commented May 19, 2016

Whether or not to include gni is @hppritcha 's call. Since it is not on any Cray platform by default it is probably ok to include but I would like to see performance numbers vs ob1 before doing that. If ob1 looses in a performance comparison then that would be an indication that ob1 needs additional work.

@hppritcha
Copy link
Member

yes please include gni.

@yburette
Copy link
Member

@hjelmn @hppritcha @jsquyres
I have just created a new PR on master open-mpi/ompi#1680.
I can create a new PR for v2.x either once it's merged onto master or now, as you'd like.

@jsquyres
Copy link
Member

@hjelmn @yburette Ok, I think we're in a good state on master. What exactly do you want on v2.x? Right now (before this PR is merged), here's the current state:

  1. PSM2 MTL priority is 40
  2. PSM MTL priority is 30
  3. OFI MTL priority is 25
  4. OFI MTL excludes sockets,mxm providers

I believe that what we want is to leave all the priorities alone (i.e., ignore/close this PR), and get a v2.x PR of open-mpi/ompi@2f0cde7 (i.e., adjust the include/excludes for the OFI MTL).

Right?

@yburette
Copy link
Member

@hjelmn @jsquyres Unless I'm missing something, I think pulling the changes from master should fix this issue. Do you want me to go ahead and create a PR for v2.x?

@jsquyres
Copy link
Member

@yburette Yes, that would be great. Thanks!

@jsquyres jsquyres closed this May 23, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants