-
Couldn't load subscription status.
- Fork 928
Revert "openpmix: switch to v5.0 branch" #12408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Errrr....there isn't an |
|
Took a gander, and noted that the branch is already pointing at PMIx v5.0.2rc1 - which means that version also passed all these tests. So I suspect this concern is likely to be the same issues as over in the |
|
@rhc54 I did a bisect on pmix and found this commit openpmix/openpmix@6163f21 It exposed the mpi4py failure Is this a bug in pmix or ompi? |
|
A data point that helps is to set |
|
Thanks @rhc54 I exported PMIX_MCA_gds=hash. Interestingly, I didn't observe the same failure but a segfault. I turned on gds logging Thoughts? |
|
Something in your setup isn't correct - you cannot be attempting to fetch from the |
|
I did |
|
I suspect that is too late - need to do it prior to doing |
|
Got it - |
|
@samuelkgutierrez We may be hitting a race condition here with respect to the dictionary update when we get multiple nspaces exchanging their modex info. We may need to dig into it a little deeper - probably defer that to a later release, as we discussed before? Do you have time and/or want to look at it a bit now? |
|
Note that this doesn't seem to be a problem in a different PR (#12411), though that is hitting a different issue. So it may not be fully deterministic. |
I won't have time to look deeply into this anytime soon, so disabling the shared-memory component seems like a prudent thing to do. |
|
Does this problem still occur with head of master in OpenPMIx? I ask because @rhc54 fixed some shared-memory modex bugs last week. |
@rhc54 maybe the cross-version compatibility support is subtly broken? |
I don't think so - this isn't cross-version. Everyone is using the same version here. The problem appears when doing a comm_spawn and then creating the intercommunicator, though I'm really puzzled here as the two complaining procs are in the same nspace. Only thing that might involve PMIx (and I'm not convinced this is our problem yet) is that they formed an intercommunicator, which involves an exchange of modex info between the two jobs. Both nspaces define a PML key (which is a non-reserved key) that is included in the exchange - and it is the first non-reserved key in each case. So it might be that the dictionary update is changing the index in job 1 when job 2's data is brought into the picture, and that is somehow causing a problem. What I don't understand is why the two procs in job 1 are checking the PML key at this point - they already did so during job 1's MPI_Init, which obviously passed else there would be no intercommunicator being created. So why are they checking it again? Don't know, which is why I don't commit to this being a PMIx issue just yet. |
|
I think I'm going to put this into the "worry about it in the future" category for now. I have no idea what that test is doing, or when that error message appears (as I said, it makes no sense to show up in intercomm create). So I think this is something that can wait for higher priorities to be completed. If someone wants to provide more info, we can reevaluate things. |
|
@rhc54 Could you clarify about your decision about pmix 5.0.2? Is the plan to revert the commit? |
|
You mean the PMIx commit?? Absolutely not - we know that (or something like it) is needed! I don't have sufficient evidence to conclude that it is the root cause of any problem. I'm saying I am going to put this issue report in the "worry about it in the future" category - i.e., at a low priority - pending any new info. I'm buried in several other things right now, and trying to chase down a black-box test result isn't high on the list. |
|
Just to be clear since I may not have put that as well as I should. What I'm saying is that I'm swamped right now and for the next few weeks at least, so my time is highly limited. Pending someone else taking this on, there isn't a whole lot I can do. I don't really trust git bisect that much to correctly identify a problem - it has its uses, but it isn't all that precise as it has no idea how to identify a root cause. In this case, it is very hard to do much without someone telling us exactly what this test is doing, and how PMIx is involved. Just saying "this test fails" conveys no real information. This is especially suspicious since we see similar failures on Just can't be me, at least for a while. |
This reverts commit 34bd015.
242d23d to
2fe9b5a
Compare
|
Okay...but that means you are once again kicking all the PMIx/PRRTE fixes down the road. Doesn't matter to me, but that might annoy your users to keep having their concerns unaddressed for release after release. FWIW: in my opinion, nothing we've seen so far justifies pointing the finger at PMIx. Yes, it is possible it is a PMIx issue - but you are seeing very similar issues in |
This reverts commit 34bd015.
Run mpi4py test