v3.1.x: Add -g compilation flag for all files that are present in the stack when attaching with MPIR #6357

James-A-Clark · 2019-02-05T09:59:22Z

This PR adds the -g compilation flag for all files that are present in the stack from the MPI_Init() call. This is so that when a debugger attaches using MPIR, it can step out of this stack back into main. This cannot be done with certain aggressive optimisations and missing debug information.

This issue appeared when OpenMPI 3.1.2 was built with GCC 7.3.0 on Power8 with the following configuration line:
./configure --enable-mpirun-prefix-by-default

The stack that we get when attaching to the user process is this:

#0  pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  pmix2x_client_init (...) at pmix2x_client.c:151
#2  rte_init () at ess_pmi_module.c:130
#3  orte_init (...) at runtime/orte_init.c:271
#4  ompi_mpi_init (...) at runtime/ompi_mpi_init.c:513
#5  PMPI_Init (...) at pinit.c:66
#6  main (...) at hello.c:22

And without this patch, GDB can't unwind because of missing unwind information and an optimisation that puts a branch in the function preamble of the orte_init() function. Looking like this:

#0  pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  pmix2x_client_init () from /software/mpi/openmpi-3.1.2_gnu-7.3.0/lib/openmpi/mca_pmix_pmix2x.so
#2  rte_init () from /software/mpi/openmpi-3.1.2_gnu-7.3.0/lib/openmpi/mca_ess_pmi.so
#3  orte_init () from /software/mpi/openmpi-3.1.2_gnu-7.3.0/lib/libopen-rte.so.40
Backtrace stopped: frame did not save the PC

The user impact is that after attaching using MPIR, they can't get back to their call to MPI_Init(). This can be fixed minimally by compiling with -fasynchronous-unwind-tables, but the existing pattern of using -g for MPIR files (which implies unwind tables) was copied from the file orte/orted/Makefile.am and applied to other files in this stack.

This issue is occurring on 3.1.x and up, but I'm not sure about the branching model for PRs. Would I also have to submit a PR for future version branches? Or do changes to 3.1.x get automatically pulled in?

Thanks

ompiteam-bot · 2019-02-05T09:59:24Z

Can one of the admins verify this patch?

rhc54 · 2019-02-05T13:23:50Z

What is the performance impact of forcing that flag on all (even non-debug) builds?

jsquyres · 2019-02-05T15:10:22Z

I'm still a little uncertain for the need for this patch. Don't debuggers offer other / different ways to get out of MPI_INIT, even if the symbols aren't known?

E.g., set a breakpoint out in main and "continue" until execution gets there.

I ask for a few reasons:

Having users step through code in the deep, deep magic of MPI_INIT is... a little frightening.
The exact stack of code that is executed in MPI_INIT is highly system-dependent. Different components can/will be selected on different machines and different environments. Adding -g to this particular stack where you ran into the issue may change a) over time, and b) on different machines.

jsquyres · 2019-02-05T15:10:28Z

ok to test

James-A-Clark · 2019-02-05T17:14:35Z

@rhc54 the -g flag on its own shouldn't impact runtime. The binary size will increase slightly for these compilation units, but those affected sections of the ELF will only be loaded by debuggers, not during normal execution.

Optimisation flags also seem to be appended to the compile line after the debugger flags. So even if $(DEBUGGER_CFLAGS) ever included an -O0 it would be overwritten by a following -O option. The full compile line for orted_submit.c looks like this, with an -O3 following the -g:

/opt/gnu/7.3.0/bin/gcc -DHAVE_CONFIG_H -I. -I../opal/include -I../ompi/include -I../oshmem/include -I../opal/mca/hwloc/hwloc1117/hwloc/include/private/autogen -I../opal/mca/hwloc/hwloc1117/hwloc/include/hwloc/autogen -I../ompi/mpiext/cuda/c -I.. -I../orte/include -I/home/jclark/ompi/opal/mca/event/libevent2022/libevent -I/home/jclark/ompi/opal/mca/event/libevent2022/libevent/include -I/home/jclark/ompi/opal/mca/hwloc/hwloc1117/hwloc/include -DNDEBUG -frecord-gcc-switches -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -fno-strict-aliasing -pthread -g -O3 -DNDEBUG -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -finline-functions -fno-strict-aliasing -pthread -MT orted/liborted_mpir_la-orted_submit.lo -MD -MP -MF orted/.deps/liborted_mpir_la-orted_submit.Tpo -c orted/orted_submit.c  -fPIC -DPIC -o orted/.libs/liborted_mpir_la-orted_submit.o

@jsquyres This isn't about symbols exactly, to fix this bug we only really need async unwind information which is also used for exceptions. I chose to use the -g from $(DEBUGGER_CFLAGS) instead of -fasynchronous-unwind-tables because I wasn't sure of an easy way of making the latter portable and it was already done for other MPIR files in the build system.

The reason that we can't set a breakpoint in main and run to it is because, as you can see from the stack, GDB doesn't know where in main we started from. The user would normally have to parse which line in main this stack starts from, and then set a breakpoint on the line after. But with this bug, GDB is unable to show this info and stops unwinding at orte_init().

Having users step through code in the deep, deep magic of MPI_INIT is... a little frightening.

The exact commands that you would issue to GDB when using MPIR to attach would be:

frame 0
finish

Which would tell GDB to step-out up the stack until frame 0 was reached and bring you back to user code. The user wouldn't have to do any stepping in MPI internals per say.

The exact stack of code that is executed in MPI_INIT is highly system-dependent. Different components can/will be selected on different machines and different environments. Adding -g to this particular stack where you ran into the issue may change a) over time, and b) on different machines.
That's a good point. Initially I only added the flag to the one frame that GDB was having a problem unwinding out of. But we thought that we should be more thorough and add it to everything we saw in the stack in case a compiler change in the future optimises the link register saving out of other functions.

We're only worried about the stack that leads up to the MPIR holding point, and not everything under MPI_INIT. Do you think that stack would still be platform/environment dependent?

James-A-Clark · 2019-02-06T16:33:34Z

Tagging @shamisp

James-A-Clark · 2019-02-11T10:32:02Z

@rhc54 Is there a way that we can test this patch to make sure that it doesn't impact performance?

The -g flag was only added to a handful of files with a few functions in them each, which mostly seem related to startup and not runtime features. Hopefully this minimises the impact.

rhc54 · 2019-02-11T17:27:47Z

I don't have a way to do it myself, but I also agree with you that I can't see how this would impact application performance. I only raised the question in case someone out there in the MPI team sees something we don't 😄

jsquyres · 2019-02-13T16:30:41Z

What about a different approach: test in configure to see if -fasynchronous-unwind-tables works.

I assume we wouldn't want to add that flag everywhere, but we could sprinkle it around in the Right Places (potentially in, or as a supplement to $(DEBUGGER_CFLAGS)...? I'd have to cache back in exactly what/how DEBUGGER_CFLAGS is used...).

Specifically: configure can check if that flag works. If it is, add it to something that gets used in relevant Makefile.ams. If not, the value that gets used in relevant Makefile.ams is empty (or just doesn't contain that flag).

James-A-Clark · 2019-02-15T15:30:06Z

I've updated the PR to add a configure check for the compiler version and to add only the async unwind flag. I think there are still a few open questions:

Where do we want to put the flags? I've assumed everything in the orte/mca/ess folder and added it to the toplevel makefile. But some of the subfolders overwrite CFLAGS, so I had to add it again manually to pmi/Makefile. Do we need to do this to any of the other subfolders?
For the IBM compiler, although we're not seeing the issue currently, there also doesn't seem to be an equivalent flag. So I've added -g. The alternative would be no flag.

rhc54 · 2019-03-06T16:54:50Z

@jsquyres Can we move this forward?

jsquyres

So sorry for the gigantic delay on replying to this. ☹️

config/orte_setup_debugger_flags.m4

orte/mca/ess/Makefile.am

orte/mca/ess/pmi/Makefile.am

James-A-Clark · 2019-03-13T14:42:18Z

Thanks for the feedback. I've made the changes to detect which flags work, rather than doing it in compiler name. And also done the other CPPFLAGS -> CFLAGS fixes.

I didn't manage to do the flag detection with nested calls to _ORTE_SETUP_DEBUGGER_FLAGS_TRY_CFLAGS, that doesn't seem to be an option. But I did manage to do it with an extra "foundFlags" variable.

config/orte_setup_debugger_flags.m4

jsquyres

When all is said and done:

This PR needs to be squashed down to a single commit
The single commit needs to be applied to master before it can be applied to any release branches (i.e., another PR)
- Our normal way of doing things is to first commit fixes to master and then cherry-pick them to the relevant release branches
The commit on this PR needs to be marked as a cherry-pick from master
- I.e., with the git commit -x ... line in the commit message, (cherry-picked from xxx...). This helps us track to make sure that relevant commits got over from master to release branches.
Equivalent PRs for the v3.0.x and v4.0.x branches should also be opened.

rhc54 · 2019-03-27T16:13:34Z

ok to test

jsquyres · 2019-03-28T22:17:01Z

Ok, #6527 merged (the master version of this PR). Could you refresh this PR as a cherry-pick from that PR? And then file corresponding PRs for v3.0.x and v4.0.x?

Thank you!

James-A-Clark · 2019-04-01T13:49:10Z

Thanks Jeff, I made the new PRs but ended up making a new one for 3.1.x because it seemed cleaner.

PRs:
#6554
#6553
#6551

James-A-Clark force-pushed the v3.1.x branch 2 times, most recently from 20b9acf to e55bd3e Compare February 15, 2019 15:24

jsquyres added the Target: v3.1.x label Feb 25, 2019

jsquyres added this to the v3.1.4 milestone Feb 25, 2019

jsquyres added the ⚠️ WIP-DNM! label Feb 25, 2019

jsquyres requested changes Mar 7, 2019

View reviewed changes

config/orte_setup_debugger_flags.m4 Outdated Show resolved Hide resolved

orte/mca/ess/Makefile.am Outdated Show resolved Hide resolved

orte/mca/ess/pmi/Makefile.am Outdated Show resolved Hide resolved

James-A-Clark force-pushed the v3.1.x branch from c005af6 to d754387 Compare March 13, 2019 14:42

jsquyres reviewed Mar 13, 2019

View reviewed changes

config/orte_setup_debugger_flags.m4 Outdated Show resolved Hide resolved

config/orte_setup_debugger_flags.m4 Outdated Show resolved Hide resolved

config/orte_setup_debugger_flags.m4 Outdated Show resolved Hide resolved

jsquyres force-pushed the v3.1.x branch from 6837e3c to bb856e7 Compare March 14, 2019 12:06

jsquyres requested changes Mar 14, 2019

View reviewed changes

jsquyres changed the title ~~Add -g compilation flag for all files that are present in the stack when attaching with MPIR~~ v3.1.x: Add -g compilation flag for all files that are present in the stack when attaching with MPIR Mar 14, 2019

James-A-Clark mentioned this pull request Mar 27, 2019

Add compilation flag to allow unwinding through files that are present in the stack when attaching with MPIR #6527

Merged

James-A-Clark closed this Apr 1, 2019

James-A-Clark force-pushed the v3.1.x branch from bb856e7 to 16dbc83 Compare April 1, 2019 08:49

James-A-Clark deleted the v3.1.x branch April 1, 2019 08:51

Uh oh!

v3.1.x: Add -g compilation flag for all files that are present in the stack when attaching with MPIR #6357

v3.1.x: Add -g compilation flag for all files that are present in the stack when attaching with MPIR #6357

Uh oh!

Conversation

James-A-Clark commented Feb 5, 2019

Uh oh!

ompiteam-bot commented Feb 5, 2019

Uh oh!

rhc54 commented Feb 5, 2019

Uh oh!

jsquyres commented Feb 5, 2019

Uh oh!

jsquyres commented Feb 5, 2019

Uh oh!

James-A-Clark commented Feb 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

James-A-Clark commented Feb 6, 2019

Uh oh!

James-A-Clark commented Feb 11, 2019

Uh oh!

rhc54 commented Feb 11, 2019

Uh oh!

jsquyres commented Feb 13, 2019

Uh oh!

James-A-Clark commented Feb 15, 2019

Uh oh!

rhc54 commented Mar 6, 2019

Uh oh!

jsquyres left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

James-A-Clark commented Mar 13, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsquyres left a comment

Choose a reason for hiding this comment

Uh oh!

rhc54 commented Mar 27, 2019

Uh oh!

jsquyres commented Mar 28, 2019

Uh oh!

James-A-Clark commented Apr 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

James-A-Clark commented Feb 5, 2019 •

edited

Loading