Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jun 20, 2018

Provide complete coverage of PMIx data types in the opal transition layer, printing an OPAL_ERROR_LOG where we don't support one so we can see what is missing in the MTT tests. I've been unable to reproduce them locally.

Signed-off-by: Ralph Castain [email protected]

Provide complete coverage of PMIx data types in the opal transition layer, printing an OPAL_ERROR_LOG where we don't support one so we can see what is missing in the MTT tests. I've been unable to reproduce them locally.

Signed-off-by: Ralph Castain <[email protected]>
@rhc54 rhc54 added the bug label Jun 20, 2018
@rhc54 rhc54 added this to the v3.1.1 milestone Jun 20, 2018
@rhc54 rhc54 self-assigned this Jun 20, 2018
@rhc54 rhc54 requested a review from jjhursey June 20, 2018 16:05
@rhc54
Copy link
Contributor Author

rhc54 commented Jun 20, 2018

@bwbarrett I'm trying to debug MTT failures that I cannot reproduce locally. This should provide some debug.

@jsquyres @PeterGottesman Can you perhaps give this branch a whirl on your machines? Specifically, the dynamic/no-disconnect test is segfaulting and I have no idea why

@PeterGottesman
Copy link
Contributor

I'm not seeing any segfault, is it failing with specific parameters/configures?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 20, 2018

I'm not able to reproduce either - the failures are showing in your MTT report. See
https://mtt.open-mpi.org/index.php?do_redir=2639

@PeterGottesman
Copy link
Contributor

I am unable to reproduce with both this branch and the original build that segfaulted. I have attached the stack trace from the MTT run, and will update if I am able to reproduce

(gdb) bt
#0  0x00002aaaab5946f3 in pmix2x_value_unload () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libopen-pal.so.40
#1  0x00002aaaab59365a in pmix2x_event_hdlr () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libopen-pal.so.40
#2  0x00002aaaab86b10b in pmix_invoke_local_event_hdlr () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libpmix.so.2
#3  0x00002aaaab8722b3 in check_cached_events () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libpmix.so.2
#4  0x00002aaaab87057e in regevents_cbfunc () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libpmix.so.2
#5  0x00002aaaab950e47 in pmix_ptl_base_process_msg () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libpmix.so.2
#6  0x00002aaaab539edc in event_process_active_single_queue () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libopen-pal.so.40
#7  0x00002aaaab53a150 in event_process_active () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libopen-pal.so.40
#8  0x00002aaaab53a7a3 in opal_libevent2022_event_base_loop () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libopen-pal.so.40
#9  0x00002aaaab8c80b3 in progress_engine () from /home/mpiteam/scratches/community/2018-06-19cron/cZkT/installs/beoN/install/lib/libpmix.so.2
#10 0x00000037b42079d1 in start_thread () from /lib64/libpthread.so.0
#11 0x00000037b3ee8b6d in clone () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 20, 2018

FWIW: note that the segfault occurred while handling a PMIx event, which implies that something went wrong in the job. Did you see any reported failure of the job itself? We might need to test against something that fails and therefore generates an event.

@PeterGottesman
Copy link
Contributor

Yep, I am getting the following error:

--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
Parent sent: level 3 (pid:21681)
level = 4
Parent sent: level 2 (pid:21672)
level = 3
Parent sent: level 2 (pid:21680)
level = 3
[mpi005:21678] *** An error occurred in MPI_Comm_spawn
[mpi005:21678] *** reported by process [1155334150,0]
[mpi005:21678] *** on communicator MPI_COMM_SELF
[mpi005:21678] *** MPI_ERR_SPAWN: could not spawn processes
[mpi005:21678] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mpi005:21678] ***    and potentially your MPI job)
[savbu-usnic-a:05595] 7 more processes have sent help message help-orte-rmaps-base.txt / orte-rmaps-base:all-available-resources-used

I am still unable to duplicate the segfault however.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 26, 2018

@PeterGottesman Were you running this PR? Or the official 3.1.x branch? Minus this PR, I can't help you debug it.

Signed-off-by: Ralph Castain <[email protected]>
@PeterGottesman
Copy link
Contributor

The run in that issue is from 3.1.x, not this PR. I was still unable to reproduce the segfault, although I will attempt again when I get to the office this morning.

@bwbarrett
Copy link
Member

I forgot to follow up on this PR during today's call. Is this a patch that you want in a release, or just something we want to put in the release branch, debug the original problem, and revert?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 26, 2018

There shouldn't be a problem in having this in the release itself - all it does is ensure we (a) correctly handle all the supported data types and (b) generate a more meaningful error when we hit one that we don't support. If at the end of the day we prefer to revert it, I have no heartburn with that decision as we shouldn't be using those data types anyway - this just tries to help expose the source of the error.

@bwbarrett
Copy link
Member

@jsquyres @PeterGottesman I'm happy to merge this into v3.1.x (now that v3.1.1 has been tagged and is on its way to release), but someone needs to do a review.

@bwbarrett bwbarrett modified the milestones: v3.1.1, v3.1.2 Jun 29, 2018
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The giant list of naked integer<-->pmix_state translation is kinda scary. Ralph tells me that these correspond to ORTE values, and therefore we can't use named those enums down here in OPAL. In the future, someone could construct a translation table, but it's probably not worth it for these debugging-specific commits (that aren't even on master).

@bwbarrett bwbarrett merged commit 43a8e77 into open-mpi:v3.1.x Jul 2, 2018
@rhc54 rhc54 deleted the cmr31/probe branch September 21, 2018 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants