-
Notifications
You must be signed in to change notification settings - Fork 929
Try to debug the MTT failures #5309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Provide complete coverage of PMIx data types in the opal transition layer, printing an OPAL_ERROR_LOG where we don't support one so we can see what is missing in the MTT tests. I've been unable to reproduce them locally. Signed-off-by: Ralph Castain <[email protected]>
|
@bwbarrett I'm trying to debug MTT failures that I cannot reproduce locally. This should provide some debug. @jsquyres @PeterGottesman Can you perhaps give this branch a whirl on your machines? Specifically, the dynamic/no-disconnect test is segfaulting and I have no idea why |
|
I'm not seeing any segfault, is it failing with specific parameters/configures? |
|
I'm not able to reproduce either - the failures are showing in your MTT report. See |
|
I am unable to reproduce with both this branch and the original build that segfaulted. I have attached the stack trace from the MTT run, and will update if I am able to reproduce |
|
FWIW: note that the segfault occurred while handling a PMIx event, which implies that something went wrong in the job. Did you see any reported failure of the job itself? We might need to test against something that fails and therefore generates an event. |
|
Yep, I am getting the following error: I am still unable to duplicate the segfault however. |
|
@PeterGottesman Were you running this PR? Or the official 3.1.x branch? Minus this PR, I can't help you debug it. |
Signed-off-by: Ralph Castain <[email protected]>
|
The run in that issue is from 3.1.x, not this PR. I was still unable to reproduce the segfault, although I will attempt again when I get to the office this morning. |
|
I forgot to follow up on this PR during today's call. Is this a patch that you want in a release, or just something we want to put in the release branch, debug the original problem, and revert? |
|
There shouldn't be a problem in having this in the release itself - all it does is ensure we (a) correctly handle all the supported data types and (b) generate a more meaningful error when we hit one that we don't support. If at the end of the day we prefer to revert it, I have no heartburn with that decision as we shouldn't be using those data types anyway - this just tries to help expose the source of the error. |
|
@jsquyres @PeterGottesman I'm happy to merge this into v3.1.x (now that v3.1.1 has been tagged and is on its way to release), but someone needs to do a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The giant list of naked integer<-->pmix_state translation is kinda scary. Ralph tells me that these correspond to ORTE values, and therefore we can't use named those enums down here in OPAL. In the future, someone could construct a translation table, but it's probably not worth it for these debugging-specific commits (that aren't even on master).
Provide complete coverage of PMIx data types in the opal transition layer, printing an OPAL_ERROR_LOG where we don't support one so we can see what is missing in the MTT tests. I've been unable to reproduce them locally.
Signed-off-by: Ralph Castain [email protected]