Skip to content

Conversation

@wenduwan
Copy link
Contributor

This reverts commit 34bd015.

Run mpi4py test

@wenduwan wenduwan self-assigned this Mar 14, 2024
@github-actions github-actions bot added this to the v5.0.3 milestone Mar 14, 2024
@rhc54
Copy link
Contributor

rhc54 commented Mar 14, 2024

Errrr....there isn't an mpi4py test over here, @wenduwan, so I'm not sure what you were hoping to test with this PR. If you look at the main branch, you'll see that some of the errors are being exposed by PMIx/PRRTE updates as we fix other bugs, but not necessarily caused by those updates (e.g., the nextcid issue). Not sure what the problem is in the OMPI v5 branch, but it felt like the intercomm problems being seen in main might be involved?

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

Took a gander, and noted that the branch is already pointing at PMIx v5.0.2rc1 - which means that version also passed all these tests. So I suspect this concern is likely to be the same issues as over in the main branch - could be you're getting reports because tests are being turned "on" or we are uncovering other bugs (like we are seeing in the other branch).

@wenduwan
Copy link
Contributor Author

@rhc54 I did a bisect on pmix and found this commit openpmix/openpmix@6163f21

It exposed the mpi4py failure

[ip-172-31-16-140:1527241] [[9971,1],1] selected pml ob1, but peer [[9971,1],0] on unknown selected pml ��
[ip-172-31-16-140:1527241] OPAL ERROR: Unreachable in file communicator/comm.c at line 2385
[ip-172-31-16-140:1527241] 0: Error in ompi_get_rprocs
ERROR

======================================================================
ERROR: setUpClass (test_ulfm.TestULFMInter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/mpi4py/test/test_ulfm.py", line 197, in setUpClass
    INTERCOMM = MPI.Intracomm.Create_intercomm(
  File "src/mpi4py/MPI/Comm.pyx", line 2336, in mpi4py.MPI.Intracomm.Create_intercomm
    with nogil: CHKERR( MPI_Intercomm_create(
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error

----------------------------------------------------------------------
Ran 1667 tests in 33.733s

FAILED (errors=1, skipped=77)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
  Proc: [[9971,1],1]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Is this a bug in pmix or ompi?

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

A data point that helps is to set PMIX_MCA_gds=hash in the environment and re-run the test. I can think of one mechanism that might break it in PMIx, but this envar would turn that off.

@wenduwan
Copy link
Contributor Author

Thanks @rhc54

I exported PMIX_MCA_gds=hash. Interestingly, I didn't observe the same failure but a segfault. I turned on gds logging

p-172-31-16-140:1782386] [server/pmix_server.c:3647] GDS FETCH KV WITH hash
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] pmix:gds:hash fetch NULL for proc [prterun-ip-172-31-16-140-1782386@2,WILDCARD] on scope UNDEFINED
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] HASH:FETCH table internal id WILDCARD key NULL
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_SERVER_NSPACE
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_SERVER_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_JOBID
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NPROC_OFFSET
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NUM_NODES
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_JOB_SIZE
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_JOB_NUM_APPS
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_MAX_PROCS
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_MAPBY
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_RANKBY
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_BINDTO
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NSDIR
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_HWLOC_XML_V2
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_LOCAL_TOPO
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_HWLOC_XML_V1
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_HOSTNAME
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NODE_LIST
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT OMPI_APP_SIZES
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT OMPI_FIRST_RANKS
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_BFROPS_MODULE
[ip-172-31-16-140:1782386] FETCHING SESSION INFO
[ip-172-31-16-140:1782386] FETCHING NODE INFO
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] gds:hash:fetch_nodearray adding key pmix.alias
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] gds:hash:fetch_nodearray adding key pmix.node.size
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] gds:hash:fetch_nodearray adding key pmix.pmem
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] gds:hash:fetch_nodearray adding key pmix.lprocs
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] gds:hash:fetch_nodearray adding key pmix.lpeers
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] gds:hash:fetch_nodearray adding key pmix.lldr
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] gds:hash:fetch_nodearray adding key pmix.local.size
[ip-172-31-16-140:1782386] FETCHING APP INFO WITH 2 APPS
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] HASH:FETCH table internal id 0 key NULL
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_CPUSET
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_LOCALITY_STRING
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_DEVICE_DISTANCES
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_PROCDIR
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_GLOBAL_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_PARENT_ID
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_APPNUM
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_APP_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_LOCAL_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NODE_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NODEID
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_REINCARNATION
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_HOSTNAME
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_PROC_PID
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] HASH:FETCH table internal id 1 key NULL
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_CPUSET
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_LOCALITY_STRING
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_DEVICE_DISTANCES
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_PROCDIR
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_GLOBAL_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_PARENT_ID
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_APPNUM
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_APP_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_LOCAL_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NODE_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NODEID
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_REINCARNATION
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_HOSTNAME
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_PROC_PID
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] HASH:FETCH table internal id 2 key NULL
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_CPUSET
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_LOCALITY_STRING
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_DEVICE_DISTANCES
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_PROCDIR
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_GLOBAL_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_PARENT_ID
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_APPNUM
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_APP_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_LOCAL_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NODE_RANK
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_NODEID
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_REINCARNATION
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_HOSTNAME
[ip-172-31-16-140:1782386] [prterun-ip-172-31-16-140-1782386@0,0] FETCH NULL LOOKING AT PMIX_PROC_PID
[ip-172-31-16-140:1782386] [server/pmix_server.c:3647] GDS FETCH KV WITH shmem2
[ip-172-31-16-140:1782386] gds:shmem2:HERE AT pmix_gds_shmem2_fetch,544
[ip-172-31-16-140:1782386] gds:shmem2:pmix_gds_shmem2_fetch:[prterun-ip-172-31-16-140-1782386@0,0] key=NULL for proc=[prterun-ip-172-31-16-140-1782386@1,WILDCARD] on scope=UNDEFINED
Segmentation fault (core dumped)

Thoughts?

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

Something in your setup isn't correct - you cannot be attempting to fetch from the shmem2 component with that envar as it wouldn't be active. Is it in the environment prior to running mpirun?

@wenduwan
Copy link
Contributor Author

I did mpiexec -x PMIX_MCA_gds=hash

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

I suspect that is too late - need to do it prior to doing mpiexec

@wenduwan
Copy link
Contributor Author

Got it - PMIX_MCA_gds=hash mpiexec works without error

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

@samuelkgutierrez We may be hitting a race condition here with respect to the dictionary update when we get multiple nspaces exchanging their modex info. We may need to dig into it a little deeper - probably defer that to a later release, as we discussed before? Do you have time and/or want to look at it a bit now?

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

Note that this doesn't seem to be a problem in a different PR (#12411), though that is hitting a different issue. So it may not be fully deterministic.

@samuelkgutierrez
Copy link
Member

@samuelkgutierrez We may be hitting a race condition here with respect to the dictionary update when we get multiple nspaces exchanging their modex info. We may need to dig into it a little deeper - probably defer that to a later release, as we discussed before? Do you have time and/or want to look at it a bit now?

I won't have time to look deeply into this anytime soon, so disabling the shared-memory component seems like a prudent thing to do.

@samuelkgutierrez
Copy link
Member

Does this problem still occur with head of master in OpenPMIx? I ask because @rhc54 fixed some shared-memory modex bugs last week.

@samuelkgutierrez
Copy link
Member

@rhc54 I did a bisect on pmix and found this commit openpmix/openpmix@6163f21

It exposed the mpi4py failure

@rhc54 maybe the cross-version compatibility support is subtly broken?

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

@rhc54 maybe the cross-version compatibility support is subtly broken?

I don't think so - this isn't cross-version. Everyone is using the same version here.

The problem appears when doing a comm_spawn and then creating the intercommunicator, though I'm really puzzled here as the two complaining procs are in the same nspace. Only thing that might involve PMIx (and I'm not convinced this is our problem yet) is that they formed an intercommunicator, which involves an exchange of modex info between the two jobs.

Both nspaces define a PML key (which is a non-reserved key) that is included in the exchange - and it is the first non-reserved key in each case. So it might be that the dictionary update is changing the index in job 1 when job 2's data is brought into the picture, and that is somehow causing a problem.

What I don't understand is why the two procs in job 1 are checking the PML key at this point - they already did so during job 1's MPI_Init, which obviously passed else there would be no intercommunicator being created. So why are they checking it again?

Don't know, which is why I don't commit to this being a PMIx issue just yet.

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

I think I'm going to put this into the "worry about it in the future" category for now. I have no idea what that test is doing, or when that error message appears (as I said, it makes no sense to show up in intercomm create). So I think this is something that can wait for higher priorities to be completed. If someone wants to provide more info, we can reevaluate things.

@wenduwan
Copy link
Contributor Author

@rhc54 Could you clarify about your decision about pmix 5.0.2? Is the plan to revert the commit?

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

You mean the PMIx commit?? Absolutely not - we know that (or something like it) is needed! I don't have sufficient evidence to conclude that it is the root cause of any problem. I'm saying I am going to put this issue report in the "worry about it in the future" category - i.e., at a low priority - pending any new info. I'm buried in several other things right now, and trying to chase down a black-box test result isn't high on the list.

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2024

Just to be clear since I may not have put that as well as I should. What I'm saying is that I'm swamped right now and for the next few weeks at least, so my time is highly limited. Pending someone else taking this on, there isn't a whole lot I can do. I don't really trust git bisect that much to correctly identify a problem - it has its uses, but it isn't all that precise as it has no idea how to identify a root cause.

In this case, it is very hard to do much without someone telling us exactly what this test is doing, and how PMIx is involved. Just saying "this test fails" conveys no real information. This is especially suspicious since we see similar failures on main, but they aren't clearly related to PMIx. So if someone wants to do some investigating and provide a clear definition of the test and the failure, then someone might be more willing/able to take a look at it.

Just can't be me, at least for a while.

@wenduwan wenduwan force-pushed the v5.0.x_test_revert_pmix branch from 242d23d to 2fe9b5a Compare March 18, 2024 14:27
@wenduwan wenduwan marked this pull request as ready for review March 18, 2024 15:49
@wenduwan
Copy link
Contributor Author

@rhc54 Understood. For the next ompi release we can keep pmix pinned to 4.2.8 to allow more time to properly diagnose and fix the pmix issue.

@janjust Now that the mpi4py test is passing by pinning pmix, shall we merge this PR and go ahead with 5.0.3rc1 this week?

@rhc54
Copy link
Contributor

rhc54 commented Mar 18, 2024

Okay...but that means you are once again kicking all the PMIx/PRRTE fixes down the road. Doesn't matter to me, but that might annoy your users to keep having their concerns unaddressed for release after release.

FWIW: in my opinion, nothing we've seen so far justifies pointing the finger at PMIx. Yes, it is possible it is a PMIx issue - but you are seeing very similar issues in main, and we have concluded those are not PMIx related.

@wenduwan wenduwan closed this Mar 20, 2024
@wenduwan wenduwan deleted the v5.0.x_test_revert_pmix branch March 26, 2024 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants