Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jun 7, 2016

Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.

Add PMIx 2.0

Remove PMIx 1.1.4

Cleanup copying of component

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 8, 2016

@ggouaillardet @bosilca I could use some help here, folks. Everything seems to be working just fine with the event notification code, with the exception of MPI_Abort. The application procs that don't call "abort" are segfaulting, but I cannot get a complete core. I've tried a variety of tricks, but nothing I do seems to help get a core, and so I am having trouble identifying the source of the failure.

I'd appreciate any insight you can provide on where the fault is occurring.

@ggouaillardet
Copy link
Contributor

@rhc54 i will give it a try
did you configure with --disable-dlopen ?
on one hand, that could hide the bug, but on the other hand, you will not invoke callbacks in unmapped libs.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 8, 2016

@ggouaillardet Just gave it a try with -disable-dlopen, and as you suspected it hides the bug - no failures.

@ggouaillardet
Copy link
Contributor

i get a segfault in mpirun

in opal/mca/pmix/pmix2x/pmix/src/usock/usock_sendrecv.c:107

        PMIX_RELEASE(peer->info);
        /* reduce the number of local procs */
        --peer->info->nptr->server->nlocalprocs;

if configure'd with --enable-debug and peer->info was released by PMIX_RELEASED, it has been set to NULL and hence cannot be dereferenced.

i ll make a quick patch for that and see how things evolve ...

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 8, 2016

@ggouaillardet I think what is happening is that an event is still defined and firing after the component holding the callback function is unloaded. I thought I had some ideas on where it might be, but they proved incorrect. Still searching, so holler if you gain any insights

@ggouaillardet
Copy link
Contributor

holler holler holler

after the trivial patch to make mpirun happy, i might get some core, but not limited to the task(s) that do no invoke MPI_Abort()

i do not know how the MPI tasks are supposed to be killed (suicide in PMIx ? external kill by orted ?) but what happens is the tasks do receive SIGCONT SIGTERM and even SIGKILL from orted.

and guess who is trapping SIGTERM ?
the PSM signal handler, that was not unregistered when libinfinipath.so was unmapped ...

as a workaround, you can either export IPATH_NO_BACKTRACE=1 or mpirun --mca pml ob1

this PSM thing is getting really annoying, should we move to a more radical approach such as
do not use libinfinipath.so unless

  • --disable-dlopen is used
    or
  • --enable-legacy-infinipath is specified
    or
  • libinfinipath.so fixed this issue

@jsquyres any thoughts ? (a consistent behavior might be desired in libfabric too)

@jsquyres
Copy link
Member

jsquyres commented Jun 9, 2016

@ggouaillardet Hmm. I thought the SIGTERM issue was fixed by PSM...? @matcabral Was it fixed in libpsm[2] itself, and not in OMPI? E.g., does @ggouaillardet have an older version of the PSM[2] library?

@ggouaillardet
Copy link
Contributor

i quickly checked that, and the fix is only available from OFED 3.18-2rc1

i run on an up-to-date CentOS 7 box, and this fix might not even land before RHEL8 ...

i'd rather put it this way
should one expect Open MPI runs out of the box on a major and up-to-date distro ?

@jsquyres
Copy link
Member

jsquyres commented Jun 9, 2016

That's a fair point. @matcabral @yburette Is there a way to work around this in Open MPI if the user hasn't upgraded to OFED >= v3.18-2rc1?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 9, 2016

@ggouaillardet Wow - we definitely need someway of solving this more generally. I've wasted days of my time chasing this ghost again, and it keeps biting us. Putting this out there in the wild? We'll have users going nuts chasing false segfaults.

@ggouaillardet
Copy link
Contributor

I will inform RedHat tomorrow, they might be willing to backport the fix.

@ggouaillardet
Copy link
Contributor

one more thing ...
when SIGTERM is received, it invokes the PSM signal handler, which is unmapped, so this generates a SIGSEGV which invokes one more time the PSM signal handler that is still unmapped.
bottom line, the generated core file is truncated.
is this (truncated core file) a bug ?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 9, 2016

@ggouaillardet I don't think so - I think the core gets truncated because the lib causing the segfault is no longer loaded in memory.

One possible workaround: if I could detect that we have the older ofed, and/or that psm was considered and declined, then I could skip the sigterm and go directly to sigkill, thus avoiding the problem. Anyone have an idea on whether or not that is possible?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 9, 2016

FWIW: at least on my system, the --mca pml ob1 approach doesn't resolve the problem. I had to export the IPATH_NO_BACKTRACE envar, and then the problem completely disappears.

@jsquyres
Copy link
Member

jsquyres commented Jun 9, 2016

@rhc54 It should be possible for the PSM/PSM2 MTLs to detect the older version; it would be great if they could react accordingly (e.g., putenv something that tells their library not intercept SIGTERM...?).

@ggouaillardet
Copy link
Contributor

in your environment, libinfinipath.so might be pulled by libfabric.
the only added value of the PSM signal handler is to dump the stack trace into a text file.
there is no cleanup of any PSM related resources.

for the time being, the easiest option would be to
putenv("IPATH_NO_BACKTRACE=1");
in MPI_Init
of course, we can try to be a bit smarter
(only if not --disable-dlopen and no libfabric and with busted libinfinipath.so and if the environment variable is not already defined)

any thoughts ?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 9, 2016

I'll bug the folks over here - I think @jsquyres proposal makes the most sense, and hopefully is doable without great pain.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 9, 2016

Ah, just remembered - @matcabral is on vacation this week. Will bring this to his attention when he returns.

Thanks guys!

@ggouaillardet
Copy link
Contributor

keep in mind the PSM signal handlers are set in the library constructor.
that means the test (is the lib busted) must be performed at configure time
(e,g. at run time, it is already too late)
an ugly hack would be to flag the component as not to be closed
(e.g. do not invoke dlclose() on it, never). also, libfabric should to something similar

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 9, 2016

Crumb - that's right. Doing a putenv in the component won't solve the problem, will it? Sounds like it has to be in the environ prior to opening the component so they don't register that errhandler in the constructor. Which means that the configure logic has to pickup the situation, but the putenv has to go in MPI_Init (or somewhere before any libfabric-based component is opened). Sigh.

@ggouaillardet
Copy link
Contributor

yep, the component is linked with the PSM library, so the environment variable must be set before the component is dlopen'ed

@ggouaillardet
Copy link
Contributor

on second thought ...

the signal handler is set in the constructor, e.g. when the component is dlopen'ed

strictly speaking, the component can overwrite the signal handler (but not with a function defined in the component) if the PSM lib is busted.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 9, 2016

good point - but that would mean coordinating between the various components that dlopen either libfabric or the psm/psm2 libraries directly since we cannot know which one(s) might be touched (e.g., user could ignore some via mca param). Could become a rather invasive solution, so I imagine just detecting it at configure and pushing a global envar is the least-painful solution

@matcabral
Copy link
Contributor

matcabral commented Jun 9, 2016

Hey guys, I'm catching with this thread. @rhc54 I was on vacation the first days of the week (traveling far away south), now I'm just in a different time zone, GMT -3.

So, back to the signals hijacking. For libinfinipath (PSM gen1) the issue was solved in the library itself and the patch for OMPI should check the version when opening and rejecting the old one. I don't have the versions here, but will open bug to solve this.
I need to check for PSM2, but I think this is not an issue.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 9, 2016

@matcabral Thanks for checking in! The problem is that we cannot let the component even be loaded, and so checking the version and rejecting the old one is too late. What we need to do is detect the old version during configure, and then set that envar during MPI_Init so that the library does the right thing when loaded (even if it subsequently declines and unloads itself).

@jsquyres
Copy link
Member

jsquyres commented Jun 9, 2016

A few points for this discussion...

  1. Little known fact is that the MCA subsystem does allow you to specify some meta data to components (in a separate text file) that is read and processed before dlopen'ing the .so file. It might be possible to add something there along the lines of "set this env var before dlopen'ing the .so file"...? (DISCLAIMER: I haven't checked the specific OMPI/MCA code for this functionality in a loooooong time)
  2. Yes, the PSM/PSM2 components could probably know that a) they are using an old PSM/PSM library and therefore are in the "bad" case, and b) in such a case, simply overwrite (or delete) the "bad" signal handler (if it is detected -- e.g., obtain the signal handler, compare to see if it's the "bad" value, and if so, replace/delete it) during their component open/init.

@ggouaillardet
Copy link
Contributor

FYI
i opened https://bugzilla.redhat.com/show_bug.cgi?id=1344529
the issue was fixed in upstream psm at intel/psm@225bbc9

@ggouaillardet
Copy link
Contributor

fwiw, here is a simple patch that sets IPATH_NO_BACKTRACE=1 if not already set and PSM is busted

diff --git a/config/ompi_check_psm.m4 b/config/ompi_check_psm.m4
index 44a5834..e63923b 100644
--- a/config/ompi_check_psm.m4
+++ b/config/ompi_check_psm.m4
@@ -12,7 +12,7 @@ dnl Copyright (c) 2004-2006 The Regents of the University of California.
 dnl                         All rights reserved.
 dnl Copyright (c) 2006      QLogic Corp. All rights reserved.
 dnl Copyright (c) 2009-2016 Cisco Systems, Inc.  All rights reserved.
-dnl Copyright (c) 2015      Research Organization for Information Science
+dnl Copyright (c) 2015-2016 Research Organization for Information Science
 dnl                         and Technology (RIST). All rights reserved.
 dnl Copyright (c) 2016      Los Alamos National Security, LLC. All rights
 dnl                         reserved.
@@ -44,6 +44,7 @@ AC_DEFUN([OMPI_CHECK_PSM],[
    ompi_check_psm_$1_save_CPPFLAGS="$CPPFLAGS"
    ompi_check_psm_$1_save_LDFLAGS="$LDFLAGS"
    ompi_check_psm_$1_save_LIBS="$LIBS"
+   ompi_check_psm_$1_busted=0

    AS_IF([test "$with_psm" != "no"],
               [AS_IF([test ! -z "$with_psm" && test "$with_psm" != "yes"],
@@ -77,9 +78,24 @@ AC_DEFUN([OMPI_CHECK_PSM],[
                [AC_MSG_WARN([glob.h not found.  Can not build component.])
                ompi_check_psm_happy="no"])])

+       AS_IF([test "$ompi_check_psm_happy" = "yes"],
+              [AC_COMPILE_IFELSE([AC_LANG_SOURCE([
+#include <psm.h>
+
+#if PSM_VERNO < 0x0110
+#error busted PSM library
+#endif
+])],
+                                 [],
+                                 [ompi_check_psm_$1_busted=1])])
+
    OPAL_SUMMARY_ADD([[Transports]],[[Intel TrueScale (PSM)]],[$1],[$ompi_check_psm_happy])
     fi

+    AC_DEFINE_UNQUOTED([OMPI_PSM_BUSTED],
+                       [$ompi_check_psm_$1_busted],
+                       [Whether libinfinipath.so unsets its signal handler in the destructor])
+
     AS_IF([test "$ompi_check_psm_happy" = "yes"],
           [$1_LDFLAGS="[$]$1_LDFLAGS $ompi_check_psm_LDFLAGS"
       $1_CPPFLAGS="[$]$1_CPPFLAGS $ompi_check_psm_CPPFLAGS"
diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
index 5616992..6b3b3ca 100644
--- a/ompi/runtime/ompi_mpi_init.c
+++ b/ompi/runtime/ompi_mpi_init.c
@@ -489,6 +489,12 @@ int ompi_mpi_init(int argc, char **argv, int requested, int *provided)
         putenv(av);
     }

+#if OMPI_PSM_BUSTED
+    if (NULL == getenv("IPATH_NO_BACKTRACE")) {
+        putenv("IPATH_NO_BACKTRACE=1");
+    }
+#endif
+
     /* open the rte framework */
     if (OMPI_SUCCESS != (ret = mca_base_framework_open(&ompi_rte_base_framework, 0))) {
         error = "ompi_rte_base_open() failed";

@matcabral
Copy link
Contributor

@rhc54 how many other cases had required in the past analyzing signal handlers inside OMPI? it seems to me that this is an isolated case that is trying to workaround a bug in an old version of a specific library. Unless there are other uses, it could just be enough to add a few lines with clear comments on why they are there.

@jsquyres
Copy link
Member

@rhc54 Oy, I forgot about libfabric. Good point. Hmm. Perhaps we need a little infrastructure to register signal handlers that should definitely be removed after dlclose? E.g., MTL PSM (and OFI MTL?) can register the PSM signal handlers. When the MCA finally dlcloses those MTLs, it can ensure that those signal handlers are not set (i.e., if they are set, reset them to SIG_DFL).

@jsquyres
Copy link
Member

@matcabral I'm unaware of any other instances. But the "infrastructure" I'm referring to could be quite small / easy. Perhaps something like:

// I'm typing this off the top of my head -- types are made up
void mca_register_handlers_to_clear(mca_component_t *component, sig_handler_fn_t signal_handler);

These 2 pieces of info should be good enough for the MCA subsystem to check and see if those signal handlers are still present upon dlclose.

...actually, I think I've lost track in the thread here: are we talking about ensuring that those signal handlers are gone upon dlclose, or ensuring that they're gone during MTL PSM and OFI component open?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 13, 2016

I do think we need to avoid overreacting and creating too much work here. However, my point was just that adding protection specifically in the PSM/MTL component isn't sufficient. At least on my machine, the problem is coming in thru libfabric, so we need a solution that covers all impacted components.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 13, 2016

Just to help @jsquyres back on track - we are dealing with the problem where PSM hijacks the signal handlers upon dl_open, and doesn't deregister them when closed, thus leaving them pointing to invalid memory. The only time this surfaces is when someone hits those procs with a SIGTERM (or one of the other hijacked signals). It would never be seen during "normal" execution.

@jsquyres
Copy link
Member

@rhc54 Right, but where did we land: did we want to always / unconditionally un-register the PSM handlers (even during component init)? Or are we solely concerned with making sure they're deregistered when the library is unloaded?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 13, 2016

@jsquyres Hmmm...that's a good point. Technically, we avoid the segfault with the latter. However, leaving those handlers registered when the PSM library remains loaded means that the user will get unexpected behavior - i.e., instead of their handler being called, the PSM handler will execute. I gather the PSM handler creates a file and stuffs a backtrace into it, which means we litter the filesystem with files that the user (a) doesn't know about and (b) we cannot clean up.

So I'd vote that we always deregister them, but I'm not hard on that opinion

@matcabral
Copy link
Contributor

IMHO, the problem with major severity is the one to address: leaving the handlers pointing to dummy addresses after dlcose in the "old" psm lib. Which was addressed in the newer lib version.
The fact that PSM lib registers custom signal handlers is still present in newer version, and could be called "intended behavior", and can be disabled with the environment variable. I would accept that this behavior should probably be better documented. So, I would title this one as a documentation issue.

@yburette
Copy link
Member

(Only reading this today.)

As far as the OFI MTL is concerned, I think that we can catch this at the libfabric level. The PSM provider would make sure to de-register the handlers upon dlclose. Am I missing something or would this be enough?

@jsquyres
Copy link
Member

@matcabral After thinking about your response a bit, I have to disagree. I just realized that your PSM library can even effect my usNIC BTL (because it uses libfabric). If this only affected your setups, I would agree that Intel as the owning vendor/organization can do whatever you want. But I definitely do not want usNIC customers to have to adhere to the PSM library putting stack traces in Intel-specific places on the filesystem. Specifically: I want usNIC customers to see the Open MPI default behavior. Please make that possible without requiring my customers to have to set Intel-specific environment variables.

@jsquyres
Copy link
Member

@yburette I'm sorry, but that is not enough. It still forces PSM-specific behavior when the usNIC BTL is used (because libfabric has not yet been dlclosed).

@ggouaillardet
Copy link
Contributor

@jsquyres
i might be reading a bit too much between the lines here ...
if you said that setting signal handlers in the library constructor was a bad idea in the first place, since it might be tricky to restore them correctly when the library is unloaded, then i could agree with that.

what if we (OpenMPI) take the problem the other way around
PSM signal handlers have no added value for us (they simply dump the stack into a file, and OpenMPI already does that, but to stdout/stderr ...) so we could/should simply

if (NULL != getenv("IPATH_NO_BACKTRACE)) {
   putenv("IPATH_NO_BACKTRACE=1");
}

in ompi_mpi_init and JNI_OnLoad

that would do the trick for both mtl/psm and libfabric

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 14, 2016

that's the solution we were considering (Jeff and I discussed on the phone), but I'm talking to folks internally here to ensure we have both a short and long-term answer

@jsquyres
Copy link
Member

@ggouaillardet Yeah, per @rhc54's comment, I think it's going to come down to this. It's a horrid abstraction break (i.e., putting vendor/transport-specific code in the core), but we might be out of options here. 😧

(I forgot about JNI_OnLoad -- good tip; thanks)

I'll file a PR in the immediate future for setting both IPATH_NO_BACKTRACE=1 and HFI_NO_BACKTRACE=1 if they are not already set. I.e., users can override this behavior by explicitly setting these env vars to 0.

jsquyres added a commit to jsquyres/ompi that referenced this pull request Jun 14, 2016
Per discussion on open-mpi#1767 (and some
subsequent phone calls and off-issue email discussions), the PSM and
PSM2 libraries are hijacking signal handlers by default.
Specifically: unless the environment variables `IPATH_NO_BACKTRACE=1`
(for PSM / Intel TrueScale) and `HFI_NO_BACKTRACE=1` (for PSM2 / Intel
OmniPath) are set, the library constructors for these two libraries
will hijack various signal handlers for the purpose of invoking their
own error reporting mechanisms.

This may be a bit *surprising*, but is not a *problem*, per se.  The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).

This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present.  Having the PSM libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).

Finally, this signal hijacking subverts Open MPI's own signal
reporting mechanism, which may be a bit surprising for some users
(particularly those who do not have Intel TrueScale/OmniPath
hardware).  As such, we disable it by default so that Open MPI's own
error-reporting mechanisms are used.

This commit will set the following two environment variables to
disable the signal hijacking from the PSM/PSM2 libraries (if they are
not already set):

* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres <[email protected]>
jsquyres added a commit to jsquyres/ompi that referenced this pull request Jun 14, 2016
Per discussion on open-mpi#1767 (and some
subsequent phone calls and off-issue email discussions), the PSM and
PSM2 libraries are hijacking signal handlers by default.
Specifically: unless the environment variables `IPATH_NO_BACKTRACE=1`
(for PSM / Intel TrueScale) and `HFI_NO_BACKTRACE=1` (for PSM2 / Intel
OmniPath) are set, the library constructors for these two libraries
will hijack various signal handlers for the purpose of invoking their
own error reporting mechanisms.

This may be a bit *surprising*, but is not a *problem*, per se.  The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).

This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present.  Having the PSM libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).

Finally, this signal hijacking subverts Open MPI's own signal
reporting mechanism, which may be a bit surprising for some users
(particularly those who do not have Intel TrueScale/OmniPath
hardware).  As such, we disable it by default so that Open MPI's own
error-reporting mechanisms are used.

This commit sets the following two environment variables to disable
the signal hijacking from the PSM/PSM2 libraries (if they are not
already set):

* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres <[email protected]>
@jsquyres
Copy link
Member

@matcabral @rhc54 @yburette @ggouaillardet I just filed a master PR for this: #1781

Please review ASAP; I'd like to merge this to master and PR over to v1.10.3 and v2.0.0 so that we can make v2.0.0rc3 today. Thanks!

jsquyres added a commit to jsquyres/ompi that referenced this pull request Jun 14, 2016
Per discussion on open-mpi#1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default.  Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.

This may be a bit *surprising*, but is not a *problem*, per se.  The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).

This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale).  As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.

Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers.  This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).

This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present.  Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).

This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):

* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres <[email protected]>
jsquyres added a commit to jsquyres/ompi-release that referenced this pull request Jun 14, 2016
Per discussion on open-mpi/ompi#1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default.  Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.

This may be a bit *surprising*, but is not a *problem*, per se.  The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).

This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale).  As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.

Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers.  This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).

This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present.  Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).

This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):

* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres <[email protected]>

(cherry picked from commit open-mpi/ompi@5071602)
jsquyres added a commit to jsquyres/ompi-release that referenced this pull request Jun 14, 2016
Per discussion on open-mpi/ompi#1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default.  Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.

This may be a bit *surprising*, but is not a *problem*, per se.  The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).

This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale).  As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.

Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers.  This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).

This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present.  Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).

This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):

* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres <[email protected]>

(cherry picked from commit open-mpi/ompi@5071602)
…ror notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.

Add PMIx 2.0

Remove PMIx 1.1.4

Cleanup copying of component

Add missing file

Touchup a typo in the Makefile.am

Update the pmix ext114 component

Minor cleanups and resync to master

Update to latest PMIx 2.x

Update to the PMIx event notification branch latest changes
@rhc54
Copy link
Contributor Author

rhc54 commented Jun 16, 2016

bot:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 16, 2016

Just a heads-up: I am rerunning the tests on this in preparation for commit. So anybody who has concerns - please speak up now.

@rhc54 rhc54 merged commit 702a982 into open-mpi:master Jun 16, 2016
@rhc54 rhc54 deleted the topic/pmix2 branch June 16, 2016 22:27
bosilca pushed a commit to bosilca/ompi that referenced this pull request Oct 3, 2016
Per discussion on open-mpi#1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default.  Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.

This may be a bit *surprising*, but is not a *problem*, per se.  The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).

This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale).  As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.

Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers.  This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).

This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present.  Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).

This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):

* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants