Skip to content

GDB overlay support for RISC-V#9

Open
dmi391 wants to merge 1 commit intosifive:masterfrom
dmi391:gdb-riscv-overlay
Open

GDB overlay support for RISC-V#9
dmi391 wants to merge 1 commit intosifive:masterfrom
dmi391:gdb-riscv-overlay

Conversation

@dmi391
Copy link
Copy Markdown

@dmi391 dmi391 commented Sep 14, 2022

  1. Fixed problem with overlay support for RISC-V
    Now GDB-client supports overlay debugging in auto-mode.

To able GDB support overlay debugging it is necessary to initialize pointer gdbarch->overlay_update with function pointer simple_overlay_update(struct obj_section *osect): In file /gdb/riscv-tdep.c at the end of definition of riscv_gdbarch_init(...) should be called set_gdbarch_overlay_update(gdbarch, simple_overlay_update) - similarly with file /gdb/m32r-tdep.c.

Without this fix GDB-client can't update overlay table _ovly_table from target RAM and overlay debugging doesn't work:

(gdb) overlay list
No sections are mapped.
(gdb) overlay load
This target does not know how to read its overlay state.

With this fix GDB-client is able to support overlay debugging in auto-mode (GDB-client updates overlay table _ovly_table from target RAM):

(gdb) set verbose on
(gdb) overlay auto
Automatic overlay debugging enabled.
...
(gdb) overlay list
Section .ovly1, loaded at 0x10080444 - 0x100805d0, mapped at 0x10000060 - 0x100001ec
  1. Fixed problem with output of overlay GDB-commands

GDB-commands overlay auto, overlay manual, overlay off have incorrect output message. In file /gdb/symfile.c in functions overlay_auto_command(...), overlay_manual_command(...), overlay_off_command(...) in call printf_filtered(_("...")) it is necessary to add '\n' at the end of string. Otherwise output message of this GDB-commands unseparated with next GDB-command output message. This messages are displayed only with set verbose on.

Without this fix. Incorrect (unseparated):

(gdb) set verbose on
(gdb) overlay auto
<nothing>
(gdb) overlay list
Automatic overlay debugging enabled.No sections are mapped.

With this fix (added '\n'). Correct:

(gdb) set verbose on
(gdb) overlay auto
Automatic overlay debugging enabled.
(gdb) overlay list
No sections are mapped.

My overlay demo project for RISC-V: dmi391/overlay_demo
My custom build of GDB-client with this two fixes (it works correct): release gdb-riscv-ovly

1. Fixed problem with overlay support for RISC-V

To able GDB support overlay debugging it is necessary to initialize pointer `gdbarch->overlay_update` with function pointer `simple_overlay_update(struct obj_section *osect)`:
In file `/gdb/riscv-tdep.c` at the end of definition of `riscv_gdbarch_init(...)` should be called `set_gdbarch_overlay_update(gdbarch, simple_overlay_update)` - similarly with file `/gdb/m32r-tdep.c`.

Without this fix GDB-client can't update overlay table `_ovly_table` from target RAM and overlay debugging doesn't work:

    (gdb) overlay list
    No sections are mapped.
    (gdb) overlay load
    This target does not know how to read its overlay state.

With this fix GDB-client is able to support overlay debugging in auto-mode (GDB-client updates overlay table `_ovly_table` from target RAM):

    (gdb) set verbose on
    (gdb) overlay auto
    Automatic overlay debugging enabled.
    ...
    (gdb) overlay list
    Section .ovly1, loaded at 0x10080444 - 0x100805d0, mapped at 0x10000060 - 0x100001ec

2. Fixed problem with output of overlay GDB-commands

GDB-commands `overlay auto`, `overlay manual`, `overlay off` have incorrect output message.
In file `/gdb/symfile.c` in functions `overlay_auto_command(...)`, `overlay_manual_command(...)`, `overlay_off_command(...)` in call `printf_filtered(_("..."))` it is necessary to add '\n' at the end of string. Otherwise output message of this GDB-commands unseparated with next GDB-command output message.
This messages are displayed only with `set verbose on`.

Without this fix. Incorrect (unseparated):

    (gdb) set verbose on
    (gdb) overlay auto
    <nothing>
    (gdb) overlay list
    Automatic overlay debugging enabled.No sections are mapped.

With this fix (added '\n'). Correct:

    (gdb) set verbose on
    (gdb) overlay auto
    Automatic overlay debugging enabled.
    (gdb) overlay list
    No sections are mapped.
kito-cheng pushed a commit that referenced this pull request Sep 25, 2023
While working on a later patch, which changes gdb.base/foll-vfork.exp,
I noticed that sometimes I would hit this assert:

  x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.

I eventually tracked it down to a combination of schedule-multiple
mode being on, target-non-stop being off, follow-fork-mode being set
to child, and some bad timing.  The failing case is pretty simple, a
single threaded application performs a vfork, the child process then
execs some other application while the parent process (once the vfork
child has completed its exec) just exits.  As best I understand
things, here's what happens when things go wrong:

  1. The parent process performs a vfork, GDB sees the VFORKED event
  and creates an inferior and thread for the vfork child,

  2. GDB resumes the vfork child process.  As schedule-multiple is on
  and target-non-stop is off, this is translated into a request to
  start all processes (see user_visible_resume_ptid),

  3. In the linux-nat layer we spot that one of the threads we are
  about to start is a vfork parent, and so don't start that
  thread (see resume_lwp), the vfork child thread is resumed,

  4. GDB waits for the next event, eventually entering
  linux_nat_target::wait, which in turn calls linux_nat_wait_1,

  5. In linux_nat_wait_1 we eventually call
  resume_stopped_resumed_lwps, this should restart threads that have
  stopped but don't actually have anything interesting to report.

  6. Unfortunately, resume_stopped_resumed_lwps doesn't check for
  vfork parents like resume_lwp does, so at this point the vfork
  parent is resumed.  This feels like the start of the bug, and this
  is where I'm proposing to fix things, but, resuming the vfork parent
  isn't the worst thing in the world because....

  7. As the vfork child is still alive the kernel holds the vfork
  parent stopped,

  8. Eventually the child performs its exec and GDB is sent and EXECD
  event.  However, because the parent is resumed, as soon as the child
  performs its exec the vfork parent also sends a VFORK_DONE event to
  GDB,

  9. Depending on timing both of these events might seem to arrive in
  GDB at the same time.  Normally GDB expects to see the EXECD or
  EXITED/SIGNALED event from the vfork child before getting the
  VFORK_DONE in the parent.  We know this because it is as a result of
  the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
  handle_vfork_child_exec_or_exit for details).  Further the comment
  in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
  when we remain attached to the child (not the parent) we should not
  expect to see a VFORK_DONE,

  10. If both events arrive at the same time then GDB will randomly
  choose one event to handle first, in some cases this will be the
  VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
  expects that (a) the vfork child has finished, however, in this case
  this is not completely true, the child has finished, but GDB has not
  processed the event associated with the completion yet, and (b) upon
  seeing a VFORK_DONE GDB assumes we are remaining attached to the
  parent, and so resumes the parent process,

  11. GDB now handles the EXECD event.  In our case we are detaching
  from the parent, so GDB calls target_detach (see
  handle_vfork_child_exec_or_exit),

  12. While this has been going on the vfork parent is executing, and
  might even exit,

  13. In linux_nat_target::detach the first thing we do is stop all
  threads in the process we're detaching from, the result of the stop
  request will be cached on the lwp_info object,

  14. In our case the vfork parent has exited though, so when GDB
  waits for the thread, instead of a stop due to signal, we instead
  get a thread exited status,

  15. Later in the detach process we try to resume the threads just
  prior to making the ptrace call to actually detach (see
  detach_one_lwp), as part of the process to resume a thread we try to
  touch some registers within the thread, and before doing this GDB
  asserts that the thread is stopped,

  16. An exited thread is not classified as stopped, and so the assert
  triggers!

So there's two bugs I see here.  The first, and most critical one here
is in step #6.  I think that resume_stopped_resumed_lwps should not
resume a vfork parent, just like resume_lwp doesn't resume a vfork
parent.

With this change in place the vfork parent will remain stopped in step
instead GDB will only see the EXECD/EXITED/SIGNALLED event.  The
problems in #9 and #10 are therefore skipped and we arrive at #11,
handling the EXECD event.  As the parent is still stopped riscvarchive#12 doesn't
apply, and in riscvarchive#13 when we try to stop the process we will see that it
is already stopped, there's no risk of the vfork parent exiting before
we get to this point.  And finally, in riscvarchive#15 we are safe to poke the
process registers because it will not have exited by this point.

However, I did mention two bugs.

The second bug I've not yet managed to actually trigger, but I'm
convinced it must exist: if we forget vforks for a moment, in step riscvarchive#13
above, when linux_nat_target::detach is called, we first try to stop
all threads in the process GDB is detaching from.  If we imagine a
multi-threaded inferior with many threads, and GDB running in non-stop
mode, then, if the user tries to detach there is a chance that thread
could exit just as linux_nat_target::detach is entered, in which case
we should be able to trigger the same assert.

But, like I said, I've not (yet) managed to trigger this second bug,
and even if I could, the fix would not belong in this commit, so I'm
pointing this out just for completeness.

There's no test included in this commit.  In a couple of commits time
I will expand gdb.base/foll-vfork.exp which is when this bug would be
exposed.  Unfortunately there are at least two other bugs (separate
from the ones discussed above) that need fixing first, these will be
fixed in the next commits before the gdb.base/foll-vfork.exp test is
expanded.

If you do want to reproduce this failure then you will for certainly
need to run the gdb.base/foll-vfork.exp test in a loop as the failures
are all very timing sensitive.  I've found that running multiple
copies in parallel makes the failure more likely to appear, I usually
run ~6 copies in parallel and expect to see a failure after within
10mins.
kito-cheng pushed a commit that referenced this pull request Dec 15, 2025
I decided to try to build and test gdb on Windows.

I found a page on the wiki [1] suggesting three ways of building gdb:
- MinGW,
- MinGW on Cygwin, and
- Cygwin.

I picked Cygwin, because I've used it before (though not recently).

I managed to install Cygwin and sufficient packages to build gdb and start the
testsuite.

However, testsuite progress ground to a halt at gdb.base/branch-to-self.exp.
[ AFAICT, similar problems reported here [2]. ]

I managed to reproduce this hang by running just the test-case.

I attempted to kill the hanging processes by:
- first killing the inferior process, using the cygwin "kill -9" command, and
- then killing the gdb process, likewise.

But the gdb process remained, and I had to point-and-click my way through task
manager to actually kill the gdb process.

I investigated this by attaching to the hanging gdb process.  Looking at the
main thread, I saw it was stopped in a call to WaitForSingleObject, with
the dwMilliseconds parameter set to INFINITE.

The backtrace in more detail:
...
(gdb) bt
 #0  0x00007fff196fc044 in ntdll!ZwWaitForSingleObject () from
     /cygdrive/c/windows/SYSTEM32/ntdll.dll
 #1  0x00007fff16bbcdcf in WaitForSingleObjectEx () from
     /cygdrive/c/windows/System32/KERNELBASE.dll
 #2  0x0000000100998065 in wait_for_single (handle=0x1b8, howlong=4294967295) at
     gdb/windows-nat.c:435
 #3  0x0000000100999aa7 in
     windows_nat_target::do_synchronously(gdb::function_view<bool ()>)
       (this=this@entry=0xa001c6fe0, func=...) at gdb/windows-nat.c:487
 #4  0x000000010099a7fb in windows_nat_target::wait_for_debug_event_main_thread
     (event=<optimized out>, this=0xa001c6fe0)
     at gdb/../gdbsupport/function-view.h:296
 #5  windows_nat_target::kill (this=0xa001c6fe0) at gdb/windows-nat.c:2917
 #6  0x00000001008f2f86 in target_kill () at gdb/target.c:901
 #7  0x000000010091fc46 in kill_or_detach (from_tty=0, inf=0xa000577d0)
     at gdb/top.c:1658
 #8  quit_force (exit_arg=<optimized out>, from_tty=from_tty@entry=0)
     at gdb/top.c:1759
 #9  0x00000001004f9ea8 in quit_command (args=args@entry=0x0,
     from_tty=from_tty@entry=0) at gdb/cli/cli-cmds.c:483
 #10 0x000000010091c6d0 in quit_cover () at gdb/top.c:295
 #11 0x00000001005e3d8a in async_disconnect (arg=<optimized out>)
     at gdb/event-top.c:1496
 riscvarchive#12 0x0000000100499c45 in invoke_async_signal_handlers ()
     at gdb/async-event.c:233
 riscvarchive#13 0x0000000100eb23d6 in gdb_do_one_event (mstimeout=mstimeout@entry=-1)
     at gdbsupport/event-loop.cc:198
 riscvarchive#14 0x00000001006df94a in interp::do_one_event (mstimeout=-1,
     this=<optimized out>) at gdb/interps.h:87
 riscvarchive#15 start_event_loop () at gdb/main.c:402
 riscvarchive#16 captured_command_loop () at gdb/main.c:466
 riscvarchive#17 0x00000001006e2865 in captured_main (data=0x7ffffcba0) at gdb/main.c:1346
 riscvarchive#18 gdb_main (args=args@entry=0x7ffffcc10) at gdb/main.c:1365
 riscvarchive#19 0x0000000100f98c70 in main (argc=10, argv=0xa000129f0) at gdb/gdb.c:38
...

In the docs [3], I read that using an INFINITE argument to WaitForSingleObject
might cause a system deadlock.

This prompted me to try this simple change in wait_for_single:
...
   while (true)
     {
-      DWORD r = WaitForSingleObject (handle, howlong);
+      DWORD r = WaitForSingleObject (handle,
+                                     howlong == INFINITE ? 100 : howlong);
+      if (howlong == INFINITE && r == WAIT_TIMEOUT)
+        continue;
...
with the timeout of 0.1 second estimated to be:
- small enough for gdb to feel reactive, and
- big enough not to consume too much cpu cycles with looping.

And indeed, the test-case, while still failing, now finishes in ~50 seconds.

While there may be an underlying bug that triggers this behaviour, the failure
mode is so severe that I consider it a bug in itself.

Fix this by avoiding calling WaitForSingleObject with INFINITE argument.

Tested on x86_64-cygwin, by running the testsuite past the test-case.

Approved-By: Pedro Alves <pedro@palves.net>

PR tdep/32894
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=32894

[1] https://sourceware.org/gdb/wiki/BuildingOnWindows
[2] https://sourceware.org/pipermail/gdb-patches/2025-May/217949.html
[3] https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitforsingleobject
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
For background, see this thread:

  https://inbox.sourceware.org/gdb-patches/20250612144607.27507-1-tdevries@suse.de

Tom describes the issue clearly in the above thread, here's what he
said:

  Once in a while, when running test-case gdb.base/bp-cmds-continue-ctrl-c.exp,
  I run into:
  ...
  Breakpoint 2, foo () at bp-cmds-continue-ctrl-c.c:23^M
  23        usleep (100);^M
  ^CFAIL: $exp: run: stop with control-c (unexpected) (timeout)
  FAIL: $exp: run: stop with control-c
  ...

  This is PR python/32167, observed both on x86_64-linux and powerpc64le-linux.

  This is not a timeout due to accidental slowness, gdb actually hangs.

  The backtrace at the hang is (on cfarm120 running AlmaLinux 9.6):
  ...
  (gdb) bt
   #0  0x00007fffbca9dd94 in __lll_lock_wait () from
       /lib64/glibc-hwcaps/power10/libc.so.6
   #1  0x00007fffbcaa6ddc in pthread_mutex_lock@@GLIBC_2.17 () from
       /lib64/glibc-hwcaps/power10/libc.so.6
   #2  0x000000001067aee8 in __gthread_mutex_lock ()
       at /usr/include/c++/11/ppc64le-redhat-linux/bits/gthr-default.h:749
   #3  0x000000001067afc8 in __gthread_recursive_mutex_lock ()
       at /usr/include/c++/11/ppc64le-redhat-linux/bits/gthr-default.h:811
   #4  0x000000001067b0d4 in std::recursive_mutex::lock ()
       at /usr/include/c++/11/mutex:108
   #5  0x000000001067b380 in std::lock_guard<std::recursive_mutex>::lock_guard ()
       at /usr/include/c++/11/bits/std_mutex.h:229
   #6  0x0000000010679d3c in set_quit_flag () at gdb/extension.c:865
   #7  0x000000001066b6dc in handle_sigint () at gdb/event-top.c:1264
   #8  0x00000000109e3b3c in handler_wrapper () at gdb/posix-hdep.c:70
   #9  <signal handler called>
   #10 0x00007fffbcaa6d14 in pthread_mutex_lock@@GLIBC_2.17 () from
       /lib64/glibc-hwcaps/power10/libc.so.6
   #11 0x000000001067aee8 in __gthread_mutex_lock ()
       at /usr/include/c++/11/ppc64le-redhat-linux/bits/gthr-default.h:749
   riscvarchive#12 0x000000001067afc8 in __gthread_recursive_mutex_lock ()
       at /usr/include/c++/11/ppc64le-redhat-linux/bits/gthr-default.h:811
   riscvarchive#13 0x000000001067b0d4 in std::recursive_mutex::lock ()
       at /usr/include/c++/11/mutex:108
   riscvarchive#14 0x000000001067b380 in std::lock_guard<std::recursive_mutex>::lock_guard ()
       at /usr/include/c++/11/bits/std_mutex.h:229
   riscvarchive#15 0x00000000106799cc in set_active_ext_lang ()
       at gdb/extension.c:775
   riscvarchive#16 0x0000000010b287ac in gdbpy_enter::gdbpy_enter ()
       at gdb/python/python.c:232
   riscvarchive#17 0x0000000010a8e3f8 in bpfinishpy_handle_stop ()
       at gdb/python/py-finishbreakpoint.c:414
  ...

  What happens here is the following:
  - the gdbpy_enter constructor attempts to set the current extension language
    to python using set_active_ext_lang
  - set_active_ext_lang attempts to lock ext_lang_mutex
  - while doing so, it is interrupted by sigint_wrapper (the SIGINT handler),
    handling a SIGINT
  - sigint_wrapper calls handle_sigint, which calls set_quit_flag, which also
    tries to lock ext_lang_mutex
  - since std::recursive_mutex::lock is not async-signal-safe, things go wrong,
    resulting in a hang.

  The hang bisects to commit 8bb8f83 ("Fix gdb.interrupt race"), which
  introduced the lock, making PR python/32167 a regression since gdb 15.1.

  Commit 8bb8f83 fixes PR dap/31263, a race reported by ThreadSanitizer:
  ...
  WARNING: ThreadSanitizer: data race (pid=615372)

    Read of size 1 at 0x00000328064c by thread T19:
      #0 set_active_ext_lang(extension_language_defn const*) gdb/extension.c:755
      #1 scoped_disable_cooperative_sigint_handling::scoped_disable_cooperative_sigint_handling()
         gdb/extension.c:697
      #2 gdbpy_interrupt gdb/python/python.c:1106
      #3 cfunction_vectorcall_NOARGS <null>

    Previous write of size 1 at 0x00000328064c by main thread:
      #0 scoped_disable_cooperative_sigint_handling::scoped_disable_cooperative_sigint_handling()
         gdb/extension.c:704
      #1 fetch_inferior_event() gdb/infrun.c:4591
      ...

    Location is global 'cooperative_sigint_handling_disabled' of size 1 at 0x00000328064c

    ...

  SUMMARY: ThreadSanitizer: data race gdb/extension.c:755 in \
    set_active_ext_lang(extension_language_defn const*)
  ...

  The problem here is that gdb.interrupt is called from a worker thread, and its
  implementation, gdbpy_interrupt races with the main thread on some variable.

The fix presented here is based on the fix that Tom proposed, but
fills in the missing Mingw support.

The problem is basically split into two: hosts that support unix like
signals, and Mingw, which doesn't support signals.

For signal supporting hosts, I've adopted the approach that Tom
suggests, gdbpy_interrupt uses kill() to send SIGINT to the GDB
process.  This is then handled in the main thread as if the user had
pressed Ctrl+C.  For these hosts no locking is required, so the
existing lock is removed.  However, everywhere the lock currently
exists I've added an assert:

    gdb_assert (is_main_thread ());

If this assert ever triggers then we're setting or reading the quit
flag on a worker thread, this will be a problem without the mutex.

For Mingw, the current mutex is retained.  This is fine as there are
no signals, so no chance of the mutex acquisition being interrupted by
a signal, and so, deadlock shouldn't be an issue.

To manage the complexity of when we need an assert, and when we need
the mutex, I've created 'struct ext_lang_guard', which can be used as
a RAII object.  This object either performs the assertion check, or
acquires the mutex, depending on the host.

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=32167
Co-Authored-By: Tom de Vries <tdevries@suse.de>
Approved-By: Tom Tromey <tom@tromey.com>
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
While reviewing and testing another patch I set a breakpoint on an
gnu ifunc function, then restarted the inferior, and this assert
triggered:

  ../../src/gdb/breakpoint.c:14747: internal-error: breakpoint_free_objfile: Assertion `loc->symtab == nullptr' failed.

The backtrace at the time of the assert is:

  #6  0x00000000005ffee0 in breakpoint_free_objfile (objfile=0x4064b30) at ../../src/gdb/breakpoint.c:14747
  #7  0x0000000000c33ff2 in objfile::~objfile (this=0x4064b30, __in_chrg=<optimized out>) at ../../src/gdb/objfiles.c:478
  #8  0x0000000000c38da6 in std::default_delete<objfile>::operator() (this=0x7ffc1a49d538, __ptr=0x4064b30) at /usr/include/c++/9/bits/unique_ptr.h:81
  #9  0x0000000000c3782a in std::unique_ptr<objfile, std::default_delete<objfile> >::~unique_ptr (this=0x7ffc1a49d538, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/unique_ptr.h:292
  #10 0x0000000000caf1bd in owning_intrusive_list<objfile, intrusive_base_node<objfile> >::erase (this=0x3790d68, i=...) at ../../src/gdb/../gdbsupport/owning_intrusive_list.h:111
  #11 0x0000000000cacd0c in program_space::remove_objfile (this=0x3790c80, objfile=0x4064b30) at ../../src/gdb/progspace.c:192
  riscvarchive#12 0x0000000000c33e1c in objfile::unlink (this=0x4064b30) at ../../src/gdb/objfiles.c:408
  riscvarchive#13 0x0000000000c34fb9 in objfile_purge_solibs (pspace=0x3790c80) at ../../src/gdb/objfiles.c:729
  riscvarchive#14 0x0000000000edf6f7 in no_shared_libraries (pspace=0x3790c80) at ../../src/gdb/solib.c:1359
  riscvarchive#15 0x0000000000fb3f6c in target_pre_inferior () at ../../src/gdb/target.c:2466
  riscvarchive#16 0x0000000000a724d7 in run_command_1 (args=0x0, from_tty=0, run_how=RUN_NORMAL) at ../../src/gdb/infcmd.c:390
  riscvarchive#17 0x0000000000a72a97 in run_command (args=0x0, from_tty=0) at ../../src/gdb/infcmd.c:514
  riscvarchive#18 0x00000000006bbb3d in do_simple_func (args=0x0, from_tty=0, c=0x39124b0) at ../../src/gdb/cli/cli-decode.c:95
  riscvarchive#19 0x00000000006c1021 in cmd_func (cmd=0x39124b0, args=0x0, from_tty=0) at ../../src/gdb/cli/cli-decode.c:2827

The function breakpoint_free_objfile is being called when an objfile
representing a shared library is being unloaded ahead of the inferior
being restarted, the function is trying to remove references to
anything that could itself reference the objfile that is being
deleted.

The assert is making the claim that, for a bp_location, which has a
single address, the objfile of the symtab associated with the location
will be the same as the objfile associated with the section of the
location.

This seems reasonable to me now, as it did when I added the assert in
commit:

  commit 5066f36
  Date:   Mon Nov 11 21:45:17 2024 +0000

      gdb: do better in breakpoint_free_objfile

The bp_location::section is maintained, according to the comments in
breakpoint.h, to aid overlay debugging (is that even used any more),
and looking at the code, this does appear to be the case.

The problem in the above case arises when we are dealing with an ifunc
function.  What happens is that we end up with a section from one
objfile, but a symtab from a different objfile.

This problem originates from minsym_found (in linespec.c).  The user
asked for 'break gnu_ifunc' where 'gnu_ifunc' is an ifunc function.
What this means is that gnu_ifunc is actually a resolver function that
returns the address of the actual function to use.

In this particular test case, the resolver function is in a shared
library, and the actual function to use is in the main executable.

So, when GDB looks for 'gnu_ifunc' is finds the minimal_symbol with
that name, and spots that this has type mst_text_gnu_ifunc.  GDB then
uses this to figure out the actual address of the function that will
be run.

GDB then creates the symtab_and_line using the _real_ address and the
symtab in which that address lies, in our case this will all be
related to the main executable objfile.

But, finally, in minsym_found, GDB fills in the symtab_and_line's
section field, and this is done using the section containing the
original minimal_symbol, which is from the shared library objfile.

The minimal symbol and section are then use to initialise the
bp_location object, and this is how we end up in, what I think, is an
unexpected state.

So what to do about this?

The symtab_and_line::msymbol field is _only_ set within minsym_found,
and is then _only_ used to initialise the bp_location::msymbol field.

The bp_location::msymbol field is _only_ used in the function
set_breakpoint_location_function, and we only really care about the
msymbol type, we check to see if it's an ifunc symbol or not.  This
allows us to set the name of the function correctly.

The bp_location::section is used, as far as I can tell, extensively
for overlay handling.  It would seem to me, that this section should
be the section containing the actual breakpoint address.  If the
question we're asking is, is this breakpoint mapped in or not?  Then
surely we need to ask about the section holding the breakpoint's
address, and not the section holding some other code (e.g. the
resolver function).  In fact, in a memory constrained environment,
you'd expect the resolver functions to get mapped out pretty early on,
but while the actual functions might still be mapped in.

Finally, symtab_and_line::section.  This is mostly set using calls to
find_pc_overlay.  The minsym_found function is one of the few places
where we do things differently.  In the places where the section is
used, it is (almost?) always used in conjunction with the
symtab_and_line::pc to lookup information, e.g. calls to
block_for_pc_sect, or find_pc_sect_containing_function.  In all these
cases, it appears to me that the assumption is that the section will
be the section that contains the address.

So, where does this leave us?

I think what we need to do is update minsym_found to just use
find_pc_overlay, which is how the symtab_and_line::section is set in
most other cases.  What this actually means in practise is that the
section field will be set to NULL (see find_pc_overlay in symfile.c).
But given that this is how the section is computed in most other
cases, I don't see why it should be especially problematic for this
case.  In reality, I think this just means that the section is
calculated via a call to find_pc_section when it's needed, as an
example, see lookup_minimal_symbol_by_pc_section (minsyms.c).

I do wonder if we should be doing better when creating the
symtab_and_line, and insist that the section be calculated correctly
at that point, but I really don't want to open that can of worms right
now, so I think just changing minsym_found to "do it just like
everyone else" should be good enough.

I've extended the existing ifunc test to expose this issue, the
updated test fails without this patch, and passes with.

Approved-By: Simon Marchi <simon.marchi@efficios.com>
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
A bug was reported to Red Hat where GDB was crashing with an assertion
failure, the assertion message is:

  ../../gdb/regcache.c:432: internal-error: get_thread_regcache: Assertion `thread->state != THREAD_EXITED' failed.

The backtrace for the crash is:

  #5  0x000055a21da8a880 in internal_vproblem(internal_problem *, const char *, int, const char *, typedef __va_list_tag __va_list_tag *) (problem=problem@entry=0x55a21e289060 <internal_error_problem>, file=<optimized out>, line=<optimized out>, fmt=<optimized out>, ap=ap@entry=0x7ffec7576be0) at ../../gdb/utils.c:477
  #6  0x000055a21da8aadf in internal_verror (file=<optimized out>, line=<optimized out>, fmt=<optimized out>, ap=ap@entry=0x7ffec7576be0) at ../../gdb/utils.c:503
  #7  0x000055a21dcbd055 in internal_error_loc (file=file@entry=0x55a21dd33b71 "../../gdb/regcache.c", line=line@entry=432, fmt=<optimized out>) at ../../gdbsupport/errors.cc:57
  #8  0x000055a21d8baaa9 in get_thread_regcache (thread=thread@entry=0x55a258de3a50) at ../../gdb/regcache.c:432
  #9  0x000055a21d74fa18 in print_signal_received_reason (uiout=0x55a258b649b0, siggnal=GDB_SIGNAL_TRAP) at ../../gdb/infrun.c:9287
  #10 0x000055a21d7daad9 in mi_interp::on_signal_received (this=0x55a258af5f60, siggnal=GDB_SIGNAL_TRAP) at ../../gdb/mi/mi-interp.c:372
  #11 0x000055a21d76ef99 in interps_notify<void (interp::*)(gdb_signal), gdb_signal&> (method=&virtual table offset 88, this adjustment 974682) at ../../gdb/interps.c:369
  riscvarchive#12 0x000055a21d76e58f in interps_notify_signal_received (sig=<optimized out>, sig@entry=GDB_SIGNAL_TRAP) at ../../gdb/interps.c:378
  riscvarchive#13 0x000055a21d75074d in notify_signal_received (sig=GDB_SIGNAL_TRAP) at ../../gdb/infrun.c:6818
  riscvarchive#14 0x000055a21d755af0 in normal_stop () at ../../gdb/gdbthread.h:432
  riscvarchive#15 0x000055a21d768331 in fetch_inferior_event () at ../../gdb/infrun.c:4753

The user is using a build of GDB with 32-bit ARM support included, and
they gave the following description for what they were doing at the
time of the crash:

  Suspended the execution of the firmware in Eclipse.  The gdb was
  connected to JLinkGDBServer with activated FreeRTOS awareness JLink
  plugin.

So they are remote debugging with a non-gdbserver target.

Looking in normal_stop() we see this code:

  /* As we're presenting a stop, and potentially removing breakpoints,
     update the thread list so we can tell whether there are threads
     running on the target.  With target remote, for example, we can
     only learn about new threads when we explicitly update the thread
     list.  Do this before notifying the interpreters about signal
     stops, end of stepping ranges, etc., so that the "new thread"
     output is emitted before e.g., "Program received signal FOO",
     instead of after.  */
  update_thread_list ();

  if (last.kind () == TARGET_WAITKIND_STOPPED && stopped_by_random_signal)
    notify_signal_received (inferior_thread ()->stop_signal ());

Which accounts for the transition from frame riscvarchive#14 to frame riscvarchive#13.  But it
is the update_thread_list() call which interests me.  This call asks
the target (remote target in this case) for the current thread list,
and then marks threads exited based on the answer.

And so, if a (badly behaved) target (incorrectly) removes a thread
from the thread list, then the update_thread_list() call will mark the
impacted thread as exited, even if GDB is currently handling a signal
stop event for that target.

My guess for what's going on here then is this:

  1. Thread receives a signal.
  2. Remote target sends GDB a stop with signal packet.
  3. Remote decides that the thread is going away soon, and marks the
     thread as exited.
  4. GDB asks for the thread list.
  5. Remote sends back the thread list, which doesn't include the
     event thread, as the remote things this thread has exited.
  6. GDB marks the thread as exited, and then proceeds to try and
     print the signal stop event for the event thread.
  7. Printing the signal stop requires reading registers, which
     requires a regache.  We can only get a regcache for a non-exited
     thread, and so GDB raises an assertion.

Using the gdbreplay test frame work I was able to reproduce this
failure using gdbserver.  I create an inferior with two threads, the
main thread sends a signal to the second thread, GDB sees the signal
arrive and prints this information for the user.

Having captured the trace of this activity, I then find the thread
list reply in the log file, and modify it to remove the second thread.

Now, when I replay the modified log file I see the same assertion
complaining about an attempt to get a regcache for an exited thread.

I'm not entirely sure the best way to fix this.  Clearly the problem
here is a bad remote target.  But, replies from a remote target
should (in my opinion) not be considered trusted, as a consequence, we
should not be asserting based on data coming from a remote.  Instead,
we should be giving warnings or errors and have GDB handle the bad
data as best it can.

This is the second attempt to fix this issue, my first patch can be
seen here:

  https://inbox.sourceware.org/gdb-patches/062e438c8677e2ab28fac6183d2ea6d444cb9121.1747567717.git.aburgess@redhat.com

In the first patch I was to checking in normal_stop, immediately after
the call to update_thread_list, to see if the current thread was now
marked as exited.  However CI testing showed an issue with this
approach; I was already checking for many different TARGET_WAITKIND_*
kinds where the "is the current thread exited" question didn't make
sense, and it turns out that the list of kinds in my first attempt was
already insufficient.

Rather than trying to just adding to the list, in this revised patch
I'm proposing to move the "is this thread exited" check inside the
block which handles signal stop events.

Right now, the only part of normal_stop which I know relies on the
current thread not being exited is the call to notify_signal_received,
so before calling notify_signal_received I check to see if the current
thread is now exited.  If it is then I print a warning to indicate
that the thread has unexpectedly exited and that the current
command (continue/step/etc) has been cancelled, I then change the
current event type to TARGET_WAITKIND_SPURIOUS.

GDB's output now looks like this in all-stop mode:

  (gdb) continue
  Continuing.
  [New Thread 3483690.3483693]
  [Thread 3483690.3483693 exited]
  warning: Thread 3483690.3483693 unexpectedly exited after non-exit event
  [Switching to Thread 3483690.3483693]
  (gdb)

The non-stop output is identical, except we don't switch thread (stop
events never trigger a thread switch in non-stop mode).

The include test makes use of the gdbreplay framework, and tests in
all-stop and non-stop modes.  I would like to do more extensive
testing of GDB's state after the receiving the unexpected thread list,
but due to using gdbreplay for testing, this is quite hard.  Many
commands, especially those looking at thread state, are likely to
trigger additional packets being sent to the remote, which causes
gdbreplay to bail out as the new packet doesn't match the original
recorded state.  However, I really don't think it is a good idea to
change gdbserver in order to "fake" this error case, so for now, using
gdbreplay is the best idea I have.

Bug: https://bugzilla.redhat.com/show_bug.cgi?id=2366461
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
If an expression is evaluated with 'EVAL_AVOID_SIDE_EFFECTS', we're
essentially interested in compatibility of the operands.  If there is an
operand of reference type, this would give us a memory value that would
cause a failure if GDB attempts to access the contents.

GDB fails to evaluate binary expressions for the following example:

  struct
  {
    int &get () { return x; };

    int x = 1;
  } v_struct;

The GDB output is:

  (gdb) print v_struct3.get () == 1 && v_struct3.get () == 2
  Cannot access memory at address 0x0
  (gdb) print v_struct3.get () == 1 || v_struct3.get () == 2
  Cannot access memory at address 0x0

Likewise, GDB fails to resolve the type for some expressions:

  (gdb) ptype v_struct.get ()
  type = int &
  (gdb) ptype v_struct.get () == 1
  Cannot access memory at address 0x0
  (gdb) ptype v_struct.get () + 1
  Cannot access memory at address 0x0
  (gdb) ptype v_struct.get () && 1
  Cannot access memory at address 0x0
  (gdb) ptype v_struct.get () || 1
  Cannot access memory at address 0x0
  (gdb) ptype !v_struct.get ()
  Cannot access memory at address 0x0
  (gdb) ptype v_struct.get () ? 2 : 3
  Cannot access memory at address 0x0
  (gdb) ptype v_struct.get () | 1
  Cannot access memory at address 0x0

Expression evaluation uses helper functions such as 'value_equal',
'value_logical_not', etc.  These helper functions do not take a 'noside'
argument and if one of their value arguments was created from a function
call that returns a reference type when noside == EVAL_AVOID_SIDE_EFFECTS,
GDB attempts to read from an invalid memory location.  Consider the
following call stack of the 'ptype v_struct.get () + 1' command at the time
of assertion when the memory error is raised:

  #0  memory_error (err=TARGET_XFER_E_IO, memaddr=0) at gdb/corefile.c:114
  #1  read_value_memory (val=.., bit_offset=0, stack=false, memaddr=0,
      buffer=.. "", length=4) at gdb/valops.c:1075
  #2  value::fetch_lazy_memory (this=..) at gdb/value.c:3996
  #3  value::fetch_lazy (this=..) at gdb/value.c:4135
  #4  value::contents_writeable (this=..) at gdb/value.c:1329
  #5  value::contents (this=..) at gdb/value.c:1319
  #6  value_as_mpz (val=..) at gdb/value.c:2685
  #7  scalar_binop (arg1=.., arg2=.., op=BINOP_ADD) at gdb/valarith.c:1240
  #8  value_binop (arg1=.., arg2=.., op=BINOP_ADD) at gdb/valarith.c:1489
  #9  eval_op_add (expect_type=0x0, exp=.., noside=EVAL_AVOID_SIDE_EFFECTS,
      arg1=.., arg2=..) at gdb/eval.c:1333
  #10 expr::add_operation::evaluate (this=.., expect_type=0x0, exp=..,
      noside=EVAL_AVOID_SIDE_EFFECTS) at gdb/expop.h:1209
  #11 expression::evaluate (this=.., expect_type=0x0,
      noside=EVAL_AVOID_SIDE_EFFECTS) at gdb/eval.c:110
  riscvarchive#12 expression::evaluate_type (this=..) at gdb/expression.h:242

'add_operation::evaluate' calls the helper 'eval_op_add' which attempts
to read from the unresolved memory location.  Convert to the target type
to avoid such problems.  The patch is implemented in 'expop.h' for the
following reasons:

  * Support templated classes without explicit helpers, e.g.,
    'binop_operation' and 'comparison_operation'.
  * Stripping references in 'binop_promote' requires additional
    refactoring beyond this patch as we would need to carry on the
    'noside' parameter.

The above failures are resolved with the patch:

  (gdb) print v_struct.get () == 1 && v_struct3.get () == 2
  $1 = false
  (gdb) print v_struct.get () == 1 || v_struct3.get () == 2
  $2 = true
  (gdb) ptype v_struct.get ()
  type = int &
  (gdb) ptype v_struct.get () == 1
  type = bool
  (gdb) ptype v_struct.get () + 1
  type = int
  (gdb) ptype v_struct.get () && 1
  type = bool
  (gdb) ptype v_struct.get () || 1
  type = bool
  (gdb) ptype !v_struct.get ()
  type = bool
  (gdb) ptype v_struct.get () ? 2 : 3
  type = int
  (gdb) ptype v_struct.get () | 1
  type = int

Co-Authored-By: Tankut Baris Aktemur <tankut.baris.aktemur@intel.com>
Approved-By: Tom Tromey <tom@tromey.com>
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
PR gdb/33512 reports an assertion failure in test-case
gdb.ada/access_to_packed_array.exp on i386-linux:
...
(gdb) maint print symbols
gdb/frame.c:3400: internal-error: reinflate: \
  Assertion `m_cached_level >= -1' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) FAIL: $exp: \
  maint print symbols (GDB internal error)
...

I haven't been able to reproduce the failure by running the test-case on
x86_64-linux with target board unix/-m32, but I'm able to reproduce on
x86_64-linux by using the exec attached to the PR:
...
$ cat gdb.in
file foo
maint expand-symtabs
maint print symbols
$ gdb -q -batch -ex "set trace-commands on" -x gdb.in
   ...
         c_to: array (gdb/frame.c:3395: internal-error: reinflate: \
	                Assertion `m_cached_level >= -1' failed.
...

The problem happens when trying to print variable c_to:
...
 <4><f227>: Abbrev Number: 3 (DW_TAG_variable)
    <f228>   DW_AT_name        : c_to
    <f230>   DW_AT_type        : <0xf214>
...
with type:
...
 <4><f214>: Abbrev Number: 7 (DW_TAG_array_type)
    <f215>   DW_AT_type        : <0x9f39>
 <5><f21d>: Abbrev Number: 12 (DW_TAG_subrange_type)
    <f21e>   DW_AT_type        : <0x9d6c>
    <f222>   DW_AT_upper_bound : <0xf209>
...
with upper bound:
...
 <4><f209>: Abbrev Number: 89 (DW_TAG_variable)
    <f20a>   DW_AT_name        : system__os_lib__copy_file__copy_to__TTc_toSP1___U
    <f20e>   DW_AT_type        : <0x9d6c>
    <f212>   DW_AT_artificial  : 1
    <f212>   DW_AT_location    : 1 byte block: 57       (DW_OP_reg7 (edi))
...

The backtrace at the point of the assertion failure is:
...
 (gdb) bt
 #0  __pthread_kill_implementation (threadid=<optimized out>,
     signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
 #1  0x00007ffff62a8e7f in __pthread_kill_internal (signo=6,
     threadid=<optimized out>) at pthread_kill.c:78
 #2  0x00007ffff6257842 in __GI_raise (sig=sig@entry=6)
     at ../sysdeps/posix/raise.c:26
 #3  0x00007ffff623f5cf in __GI_abort () at abort.c:79
 #4  0x00000000010e7ac6 in dump_core () at gdb/utils.c:223
 #5  0x00000000010e81b8 in internal_vproblem(internal_problem *, const char *, int, const char *, typedef __va_list_tag __va_list_tag *) (
     problem=0x2ceb0c0 <internal_error_problem>,
     file=0x1ad5a90 "gdb/frame.c", line=3395,
     fmt=0x1ad5a08 "%s: Assertion `%s' failed.", ap=0x7fffffffc3c0)
     at gdb/utils.c:475
 #6  0x00000000010e82ac in internal_verror (
     file=0x1ad5a90 "gdb/frame.c", line=3395,
     fmt=0x1ad5a08 "%s: Assertion `%s' failed.", ap=0x7fffffffc3c0)
     at gdb/utils.c:501
 #7  0x00000000019be79f in internal_error_loc (
     file=0x1ad5a90 "gdb/frame.c", line=3395,
     fmt=0x1ad5a08 "%s: Assertion `%s' failed.")
     at gdbsupport/errors.cc:57
 #8  0x00000000009b5c16 in frame_info_ptr::reinflate (this=0x7fffffffc878)
     at gdb/frame.c:3395
 #9  0x00000000009b66f9 in frame_info_ptr::operator-> (this=0x7fffffffc878)
     at gdb/frame.h:290
 #10 0x00000000009b4bd5 in get_frame_arch (this_frame=...)
     at gdb/frame.c:3075
 #11 0x000000000081dd89 in dwarf_expr_context::fetch_result (
     this=0x7fffffffc810, type=0x410d600, subobj_type=0x410d600,
     subobj_offset=0, as_lval=true)
     at gdb/dwarf2/expr.c:1006
 riscvarchive#12 0x000000000081e2ef in dwarf_expr_context::evaluate (this=0x7fffffffc810,
     addr=0x7ffff459ce6b "W\aF\003", len=1, as_lval=true,
     per_cu=0x7fffd00053f0, frame=..., addr_info=0x7fffffffcc30, type=0x0,
     subobj_type=0x0, subobj_offset=0)
     at gdb/dwarf2/expr.c:1136
 riscvarchive#13 0x0000000000877c14 in dwarf2_locexpr_baton_eval (dlbaton=0x3e99c18,
     frame=..., addr_stack=0x7fffffffcc30, valp=0x7fffffffcab0,
     push_values=..., is_reference=0x7fffffffc9b0)
     at gdb/dwarf2/loc.c:1604
 riscvarchive#14 0x0000000000877f71 in dwarf2_evaluate_property (prop=0x3e99ce0,
     initial_frame=..., addr_stack=0x7fffffffcc30, value=0x7fffffffcab0,
     push_values=...) at gdb/dwarf2/loc.c:1668
 riscvarchive#15 0x00000000009def76 in resolve_dynamic_range (dyn_range_type=0x3e99c50,
     addr_stack=0x7fffffffcc30, frame=..., rank=0, resolve_p=true)
     at gdb/gdbtypes.c:2198
 riscvarchive#16 0x00000000009e0ded in resolve_dynamic_type_internal (type=0x3e99c50,
     addr_stack=0x7fffffffcc30, frame=..., top_level=true)
     at gdb/gdbtypes.c:2934
 riscvarchive#17 0x00000000009e1079 in resolve_dynamic_type (type=0x3e99c50, valaddr=...,
     addr=0, in_frame=0x0) at gdb/gdbtypes.c:2989
 riscvarchive#18 0x0000000000488ebc in ada_discrete_type_low_bound (type=0x3e99c50)
     at gdb/ada-lang.c:710
 riscvarchive#19 0x00000000004eb734 in print_range (type=0x3e99c50, stream=0x30157b0,
     bounds_preferred_p=0) at gdb/ada-typeprint.c:156
 riscvarchive#20 0x00000000004ebffe in print_array_type (type=0x3e99d10, stream=0x30157b0,
     show=1, level=9, flags=0x1bdcf20 <type_print_raw_options>)
     at gdb/ada-typeprint.c:381
 riscvarchive#21 0x00000000004eda3c in ada_print_type (type0=0x3e99d10,
     varstring=0x401f710 "c_to", stream=0x30157b0, show=1, level=9,
     flags=0x1bdcf20 <type_print_raw_options>)
     at gdb/ada-typeprint.c:1015
 riscvarchive#22 0x00000000004b4627 in ada_language::print_type (
     this=0x2f949b0 <ada_language_defn>, type=0x3e99d10,
     varstring=0x401f710 "c_to", stream=0x30157b0, show=1, level=9,
     flags=0x1bdcf20 <type_print_raw_options>)
     at gdb/ada-lang.c:13681
 riscvarchive#23 0x0000000000f74646 in print_symbol (gdbarch=0x3256270, symbol=0x3e99db0,
     depth=9, outfile=0x30157b0) at gdb/symmisc.c:545
 riscvarchive#24 0x0000000000f737e6 in dump_symtab_1 (symtab=0x3ddd7e0, outfile=0x30157b0)
     at gdb/symmisc.c:313
 riscvarchive#25 0x0000000000f73a69 in dump_symtab (symtab=0x3ddd7e0, outfile=0x30157b0)
     at gdb/symmisc.c:370
 riscvarchive#26 0x0000000000f7420f in maintenance_print_symbols (args=0x0, from_tty=0)
     at gdb/symmisc.c:481
 riscvarchive#27 0x00000000006c7fde in do_simple_func (args=0x0, from_tty=0, c=0x321e270)
     at gdb/cli/cli-decode.c:94
 riscvarchive#28 0x00000000006ce65a in cmd_func (cmd=0x321e270, args=0x0, from_tty=0)
     at gdb/cli/cli-decode.c:2826
 riscvarchive#29 0x0000000001005b78 in execute_command (p=0x3f48fe3 "", from_tty=0)
     at gdb/top.c:564
 riscvarchive#30 0x0000000000966095 in command_handler (
     command=0x3f48fd0 "maint print symbols")
     at gdb/event-top.c:613
 riscvarchive#31 0x0000000001005141 in read_command_file (stream=0x3011a40)
     at gdb/top.c:333
 riscvarchive#32 0x00000000006e2a64 in script_from_file (stream=0x3011a40,
     file=0x7fffffffe21f "gdb.in")
     at gdb/cli/cli-script.c:1705
 riscvarchive#33 0x00000000006bb88c in source_script_from_stream (stream=0x3011a40,
     file=0x7fffffffe21f "gdb.in", file_to_open=0x7fffffffd760 "gdb.in")
     at gdb/cli/cli-cmds.c:706
 riscvarchive#34 0x00000000006bba12 in source_script_with_search (
     file=0x7fffffffe21f "gdb.in", from_tty=0, search_path=0)
     at gdb/cli/cli-cmds.c:751
 riscvarchive#35 0x00000000006bbab2 in source_script (file=0x7fffffffe21f "gdb.in",
     from_tty=0) at gdb/cli/cli-cmds.c:760
 riscvarchive#36 0x0000000000b835cb in catch_command_errors (
     command=0x6bba7e <source_script(char const*, int)>,
     arg=0x7fffffffe21f "gdb.in", from_tty=0, do_bp_actions=false)
     at gdb/main.c:510
 riscvarchive#37 0x0000000000b83803 in execute_cmdargs (cmdarg_vec=0x7fffffffd980,
     file_type=CMDARG_FILE, cmd_type=CMDARG_COMMAND, ret=0x7fffffffd8c8)
     at gdb/main.c:606
 riscvarchive#38 0x0000000000b84d79 in captured_main_1 (context=0x7fffffffdb90)
     at gdb/main.c:1349
 riscvarchive#39 0x0000000000b84fe4 in captured_main (context=0x7fffffffdb90)
     at gdb/main.c:1372
 riscvarchive#40 0x0000000000b85092 in gdb_main (args=0x7fffffffdb90)
     at gdb/main.c:1401
 riscvarchive#41 0x000000000041a382 in main (argc=9, argv=0x7fffffffdcc8)
     at gdb/gdb.c:38
 (gdb)
...

The immediate problem is in dwarf_expr_context::fetch_result where we're
calling get_frame_arch:
...
      switch (this->m_location)
	{
	case DWARF_VALUE_REGISTER:
	  {
	    gdbarch *f_arch = get_frame_arch (this->m_frame);
...
with a null frame:
...
(gdb) p this->m_frame.is_null ()
$1 = true
(gdb)
...

Fix this using ensure_have_frame in dwarf_expr_context::execute_stack_op for
DW_OP_reg<n> and DW_OP_regx, getting us instead:
...
         c_to: array (<>) of character; computed at runtime
...

Tested on x86_64-linux.

Approved-By: Tom Tromey <tom@tromey.com>

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=33512
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
On ppc64le-linux (AlmaLinux 9.6) with python 3.9 and test-case
gdb.python/py-failed-init.exp I run into:
...
builtin_spawn $gdb -nw -nx -q -iex set height 0 -iex set width 0 \
  -data-directory $build/gdb/data-directory -iex set interactive-mode on^M
Python path configuration:^M
  PYTHONHOME = 'foo'^M
  PYTHONPATH = (not set)^M
  program name = '/usr/bin/python'^M
  isolated = 0^M
  environment = 1^M
  user site = 1^M
  import site = 1^M
  sys._base_executable = '/usr/bin/python'^M
  sys.base_prefix = 'foo'^M
  sys.base_exec_prefix = 'foo'^M
  sys.platlibdir = 'lib64'^M
  sys.executable = '/usr/bin/python'^M
  sys.prefix = 'foo'^M
  sys.exec_prefix = 'foo'^M
  sys.path = [^M
    'foo/lib64/python39.zip',^M
    'foo/lib64/python3.9',^M
    'foo/lib64/python3.9/lib-dynload',^M
  ]^M
Fatal Python error: init_fs_encoding: failed to get the Python codec of the \
  filesystem encoding^M
Python runtime state: core initialized^M
ModuleNotFoundError: No module named 'encodings'^M
^M
Current thread 0x00007fffabe18480 (most recent call first):^M
<no Python frame>^M
ERROR: (eof) GDB never initialized.
Couldn't send python print (1) to GDB.
UNRESOLVED: gdb.python/py-failed-init.exp: gdb-command<python print (1)>
Couldn't send quit to GDB.
UNRESOLVED: gdb.python/py-failed-init.exp: quit
...

The test-case expects gdb to present a prompt, but instead gdb calls exit
with this back trace:
...
(gdb) bt
 #0  0x00007ffff6e4bfbc in exit () from /lib64/glibc-hwcaps/power10/libc.so.6
 #1  0x00007ffff7873fc4 in fatal_error.lto_priv () from /lib64/libpython3.9.so.1.0
 #2  0x00007ffff78aae60 in Py_ExitStatusException () from /lib64/libpython3.9.so.1.0
 #3  0x00007ffff78c0e58 in Py_InitializeEx () from /lib64/libpython3.9.so.1.0
 #4  0x0000000010b6cab4 in py_initialize_catch_abort () at gdb/python/python.c:2456
 #5  0x0000000010b6cfac in py_initialize () at gdb/python/python.c:2540
 #6  0x0000000010b6d104 in do_start_initialization () at gdb/python/python.c:2595
 #7  0x0000000010b6eaac in gdbpy_initialize (extlang=0x11b7baf0 <extension_language_python>)
     at gdb/python/python.c:2968
 #8  0x000000001069d508 in ext_lang_initialization () at gdb/extension.c:319
 #9  0x00000000108f9280 in captured_main_1 (context=0x7fffffffe870)
     at gdb/main.c:1100
 #10 0x00000000108fa3cc in captured_main (context=0x7fffffffe870)
     at gdb/main.c:1372
 #11 0x00000000108fa4d8 in gdb_main (args=0x7fffffffe870) at gdb/main.c:1401
 riscvarchive#12 0x000000001001d1d8 in main (argc=3, argv=0x7fffffffece8) at gdb/gdb.c:38
...

This may be a python issue [1].

The problem doesn't happen if we use the PyConfig approach instead of the
py_initialize_catch_abort approach.

Fix this by using the PyConfig approach starting 3.9 (previously, starting
3.10 to avoid Py_SetProgramName deprecation in 3.11).

It's possible that we have the same problem and need the same fix for 3.8, but
I don't have a setup to check that.  Add a todo in a comment.

Tested on ppc64le-linux.

Approved-By: Tom Tromey <tom@tromey.com>

[1] python/cpython#107827
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
New in v2: make the test try with indexes by default

This patch fixes a crash caused by GDB trying to read from a section not
read in.  The bug happens in those specific circumstances:

 - reading a type unit from .dwo
 - that type unit has a stub in the main file
 - there is a GDB index (.gdb_index) present

This crash is the cause of the following test failure, with the
cc-with-gdb-index target board:

    $ make check TESTS="gdb.dwarf2/fission-reread.exp" RUNTESTFLAGS="--target_board=cc-with-gdb-index"
    Running /home/smarchi/src/binutils-gdb/gdb/testsuite/gdb.dwarf2/fission-reread.exp ...
    ERROR: GDB process no longer exists

Or, manually:

    $ ./gdb -nx -q --data-directory=data-directory /home/smarchi/build/binutils-gdb/gdb/testsuite/outputs/gdb.dwarf2/fission-reread/fission-reread -ex "p 1"
    Reading symbols from /home/smarchi/build/binutils-gdb/gdb/testsuite/outputs/gdb.dwarf2/fission-reread/fission-reread...

    Fatal signal: Segmentation fault

For this last one, you need to interrupt the test (e.g. add a return)
before the test deletes the .dwo file.

The backtrace at the moment of the crash is:

    #0  0x0000555566968f7f in bfd_getl32 (p=0x0) at /home/simark/src/binutils-gdb/bfd/libbfd.c:846
    #1  0x00005555642e561d in read_initial_length (abfd=0x7d1ff1eb0e40, buf=0x0, bytes_read=0x7bfff0962fa0, handle_nonstd=true) at /home/simark/src/binutils-gdb/gdb/dwarf2/leb.c:92
    #2  0x00005555647ca9ea in read_unit_head (header=0x7d0ff1e068b0, info_ptr=0x0, section=0x7c3ff1dea7d0, section_kind=ruh_kind::COMPILE) at /home/simark/src/binutils-gdb/gdb/dwarf2/unit-head.c:44
    #3  0x000055556452e37e in dwarf2_per_cu::get_header (this=0x7d0ff1e06880) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:18531
    #4  0x000055556452e574 in dwarf2_per_cu::addr_size (this=0x7d0ff1e06880) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:18544
    #5  0x000055556406af91 in dwarf2_cu::addr_type (this=0x7d7ff1e20880) at /home/simark/src/binutils-gdb/gdb/dwarf2/cu.c:124
    #6  0x0000555564534e48 in set_die_type (die=0x7e0ff1f23dd0, type=0x7e0ff1f027f0, cu=0x7d7ff1e20880, skip_data_location=false) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:19020
    #7  0x00005555644dcc7b in read_structure_type (die=0x7e0ff1f23dd0, cu=0x7d7ff1e20880) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:11239
    #8  0x000055556451c834 in read_type_die_1 (die=0x7e0ff1f23dd0, cu=0x7d7ff1e20880) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:16878
    #9  0x000055556451c5e0 in read_type_die (die=0x7e0ff1f23dd0, cu=0x7d7ff1e20880) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:16861
    #10 0x0000555564526f3a in get_signatured_type (die=0x7e0ff1f0ffb0, signature=10386129560629316377, cu=0x7d7ff1e0f480) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:17998
    #11 0x000055556451c23b in lookup_die_type (die=0x7e0ff1f0ffb0, attr=0x7e0ff1f10008, cu=0x7d7ff1e0f480) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:16808
    riscvarchive#12 0x000055556451b2e9 in die_type (die=0x7e0ff1f0ffb0, cu=0x7d7ff1e0f480) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:16684
    riscvarchive#13 0x000055556451457f in new_symbol (die=0x7e0ff1f0ffb0, type=0x0, cu=0x7d7ff1e0f480, space=0x0) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:16089
    riscvarchive#14 0x00005555644c52a4 in read_variable (die=0x7e0ff1f0ffb0, cu=0x7d7ff1e0f480) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:9119
    riscvarchive#15 0x0000555564494072 in process_die (die=0x7e0ff1f0ffb0, cu=0x7d7ff1e0f480) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:5197
    riscvarchive#16 0x000055556449c88e in read_file_scope (die=0x7e0ff1f0fdd0, cu=0x7d7ff1e0f480) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:6125
    riscvarchive#17 0x0000555564493671 in process_die (die=0x7e0ff1f0fdd0, cu=0x7d7ff1e0f480) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:5098
    riscvarchive#18 0x00005555644912f5 in process_full_comp_unit (cu=0x7d7ff1e0f480) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:4851
    riscvarchive#19 0x0000555564485e18 in process_queue (per_objfile=0x7d6ff1e71100) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:4161
    riscvarchive#20 0x000055556446391d in dw2_do_instantiate_symtab (per_cu=0x7ceff1de42d0, per_objfile=0x7d6ff1e71100, skip_partial=true) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:1650
    riscvarchive#21 0x0000555564463b3c in dw2_instantiate_symtab (per_cu=0x7ceff1de42d0, per_objfile=0x7d6ff1e71100, skip_partial=true) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:1671
    riscvarchive#22 0x00005555644687fd in dwarf2_base_index_functions::expand_all_symtabs (this=0x7c1ff1e04990, objfile=0x7d5ff1e46080) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:1990
    riscvarchive#23 0x0000555564381050 in cooked_index_functions::expand_all_symtabs (this=0x7c1ff1e04990, objfile=0x7d5ff1e46080) at /home/simark/src/binutils-gdb/gdb/dwarf2/cooked-index.h:237
    riscvarchive#24 0x0000555565df5b0d in objfile::expand_all_symtabs (this=0x7d5ff1e46080) at /home/simark/src/binutils-gdb/gdb/symfile-debug.c:372
    riscvarchive#25 0x0000555565eafc4a in maintenance_expand_symtabs (args=0x0, from_tty=1) at /home/simark/src/binutils-gdb/gdb/symmisc.c:914

The main file contains a stub (skeleton) for a compilation unit and a
stub for a type unit.   The .dwo file contains a compilation unit and a
type unit matching those stubs.  When doing the initial scan of the main
file, the DWARF reader parses the CU/TU list from the GDB index
(.gdb_index), and thus creates a signatured_type object based on that.
The section field of this signatured_type points to the .debug_types
section in the main file, the one containing the stub.  And because GDB
trusts the GDB index, it never needs to look at that .debug_types
section in the main file.  That section remains not read in.

When expanding the compilation unit, GDB encounters a type unit
reference (by signature) corresponding to the type in the type unit.  We
get in lookup_dwo_signatured_type, trying to see if there is a type unit
matching that signature in the current .dwo file.  We proceed to read
and expand that type unit, until we eventually get to a
dwarf2_cu::addr_type() call, which does:

     int addr_size = this->per_cu->addr_size ();

dwarf2_per_cu::addr_size() tries to read the header from the section
pointed to by dwarf2_per_cu::section which, if you recall, is the
.debug_types section in the main file that was never read in.  That
causes the segfault.

All this was working fine before these patches of mine, that tried to do
some cleanups:

    a47e229 ("gdb/dwarf: pass section offset to dwarf2_per_cu_data constructor")
    c44ab62 ("gdb/dwarf: pass section to dwarf2_per_cu_data constructor")
    39ee8c9 ("gdb/dwarf: pass unit length to dwarf2_per_cu_data constructor")

Before these patches, the fill_in_sig_entry_from_dwo_entry function
(called from lookup_dwo_signatured_type, among others) would overwrite
some dwarf2_per_cu fields (including the section) to point to the .dwo,
rather than represent what's in the main file.  Therefore, the header
would have been read from the unit in the .dwo file, and things would
have been fine.

When doing these changes, I mistakenly assumed that the section written
by fill_in_sig_entry_from_dwo_entry was the same as the section already
there, which is why I removed the statements overwriting the section
field (and the two others).  To my defense, here's the comment on
dwarf2_per_cu::section:

    /* The section this CU/TU lives in.
       If the DIE refers to a DWO file, this is always the original die,
       not the DWO file.  */
    struct dwarf2_section_info *section = nullptr;

I would prefer to not reintroduce the behavior of overwriting the
section info in dwarf2_per_cu, because:

 1. I find it confusing, I like the invariant of dwarf2_per_cu::section
    points to the stub, and dwarf2_cu::section points to where we
    actually read the debug info from.
 2. The dwarf2_per_bfd::all_units vector is nowadays sorted by (section,
    section offset).  If we change the section and section offset of a
    dwarf2_per_cu, then we can no longer do binary searches in it, we
    would have to re-sort the vector (not a big deal, but still adds to
    the confusion).

One possible fix would be to make sure that the section is read in when
reading the header, probably in dwarf2_per_cu::get_header.  An approach
like that was proposed by Andrew initially, here:

  https://inbox.sourceware.org/gdb-patches/60ba2b019930fd6164f8e6ab6cb2e396c32c6ac2.1759486109.git.aburgess@redhat.com/

It would work, but there is a more straightforward fix for this
particular problem, I believe.  In dwarf2_cu, we have access to the
header read from the unit we're actually reading the DWARF from.  In the
DWO case, that is the header read from the .dwo file.  We can get the
address size from there instead of going through the dwarf2_per_cu
object.  This is what this patch does.

However, there are other case where we get the address (or offset) size
from the dwarf2_per_cu in the DWARF expression evaluator (expr.c,
loc.c), that could cause a similar crash. The next patch handles these
cases.

Modify the gdb.dwarf2/fission-reread.exp test so that it tries running
with an index even with the standard board (that part was originally
written by Andrew).

Finally, just to put things in context, having a stub in the main file
for a type unit is obsolete.  It happened in the gcc 4.x days, until
this commit:

    commit 4dd7c3b285daf030da0ff9c0d5e2f79b24943d1e
    Author: Cary Coutant <ccoutant@google.com>
    Date:   Fri Aug 8 20:33:26 2014 +0000

        Remove skeleton type units that were being produced with -gsplit-dwarf.

In DWARF 5, split type units don't have stubs, only split compilations
units do.

Change-Id: Icc5014276c75bf3126ccb43a4424e96ca1a51f06
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=33307
Co-Authored-By: Andrew Burgess <aburgess@redhat.com>
Approved-By: Andrew Burgess <aburgess@redhat.com>
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
New in v2:

 - make the test try with indexes by default
 - using uint8_t instead of unsigned char

In some specific circumstances, it is possible for GDB to read a type
unit from a .dwo file without ever reading in the section of the stub in
the main file.  In that case, calling any of these methods:

  - dwarf2_per_cu::addr_size()
  - dwarf2_per_cu::offset_size()
  - dwarf2_per_cu::ref_addr_size()

will cause a crash, because they will try to read the unit header from
the not-read-in section buffer.  See the previous patch for more
details.

The remaining calls to these methods are in the loc.c and expr.c
files.  That is, in the location and expression machinery.  It is
possible to set things up to cause them to trigger a crash, as shown by
the new test, when running it with the cc-with-gdb-index board:

    $ make check TESTS="gdb.dwarf2/fission-type-unit-locexpr.exp" RUNTESTFLAGS="--target_board=cc-with-gdb-index"
    Running /home/simark/src/binutils-gdb/gdb/testsuite/gdb.dwarf2/fission-type-unit-locexpr.exp ...
    ERROR: GDB process no longer exists

The backtrace at the moment of the crash is:

    #0  0x0000555566968b1f in bfd_getl32 (p=0x78) at /home/simark/src/binutils-gdb/bfd/libbfd.c:846
    #1  0x00005555642e51b7 in read_initial_length (abfd=0x7d1ff1eb0e40, buf=0x78 <error: Cannot access memory at address 0x78>, bytes_read=0x7bfff09daca0, handle_nonstd=true)
        at /home/simark/src/binutils-gdb/gdb/dwarf2/leb.c:92
    #2  0x00005555647ca584 in read_unit_head (header=0x7d0ff1e06c70, info_ptr=0x78 <error: Cannot access memory at address 0x78>, section=0x7c3ff1dea7d0, section_kind=ruh_kind::COMPILE)
        at /home/simark/src/binutils-gdb/gdb/dwarf2/unit-head.c:44
    #3  0x000055556452df18 in dwarf2_per_cu::get_header (this=0x7d0ff1e06c40) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:18531
    #4  0x000055556452e10e in dwarf2_per_cu::addr_size (this=0x7d0ff1e06c40) at /home/simark/src/binutils-gdb/gdb/dwarf2/read.c:18544
    #5  0x0000555564314ac3 in dwarf2_locexpr_baton_eval (dlbaton=0x7bfff0c9a508, frame=..., addr_stack=0x7bfff0b59150, valp=0x7bfff0c9a430, push_values=..., is_reference=0x7bfff0d33030)
        at /home/simark/src/binutils-gdb/gdb/dwarf2/loc.c:1593
    #6  0x0000555564315bd2 in dwarf2_evaluate_property (prop=0x7bfff0c9a450, initial_frame=..., addr_stack=0x7bfff0b59150, value=0x7bfff0c9a430, push_values=...) at /home/simark/src/binutils-gdb/gdb/dwarf2/loc.c:1668
    #7  0x0000555564a14ee1 in resolve_dynamic_field (field=..., addr_stack=0x7bfff0b59150, frame=...) at /home/simark/src/binutils-gdb/gdb/gdbtypes.c:2758
    #8  0x0000555564a15e24 in resolve_dynamic_struct (type=0x7e0ff1f02550, addr_stack=0x7bfff0b59150, frame=...) at /home/simark/src/binutils-gdb/gdb/gdbtypes.c:2839
    #9  0x0000555564a17061 in resolve_dynamic_type_internal (type=0x7e0ff1f02550, addr_stack=0x7bfff0b59150, frame=..., top_level=true) at /home/simark/src/binutils-gdb/gdb/gdbtypes.c:2972
    #10 0x0000555564a17899 in resolve_dynamic_type (type=0x7e0ff1f02550, valaddr=..., addr=0x4010, in_frame=0x7bfff0d32e60) at /home/simark/src/binutils-gdb/gdb/gdbtypes.c:3019
    #11 0x000055556675fb34 in value_from_contents_and_address (type=0x7e0ff1f02550, valaddr=0x0, address=0x4010, frame=...) at /home/simark/src/binutils-gdb/gdb/value.c:3674
    riscvarchive#12 0x00005555666ce911 in get_value_at (type=0x7e0ff1f02550, addr=0x4010, frame=..., lazy=1) at /home/simark/src/binutils-gdb/gdb/valops.c:992
    riscvarchive#13 0x00005555666ceb89 in value_at_lazy (type=0x7e0ff1f02550, addr=0x4010, frame=...) at /home/simark/src/binutils-gdb/gdb/valops.c:1039
    riscvarchive#14 0x000055556491909f in language_defn::read_var_value (this=0x5555725fce40 <minimal_language_defn>, var=0x7e0ff1f02500, var_block=0x7e0ff1f025d0, frame_param=...)
        at /home/simark/src/binutils-gdb/gdb/findvar.c:504
    riscvarchive#15 0x000055556491961b in read_var_value (var=0x7e0ff1f02500, var_block=0x7e0ff1f025d0, frame=...) at /home/simark/src/binutils-gdb/gdb/findvar.c:518
    riscvarchive#16 0x00005555666d1861 in value_of_variable (var=0x7e0ff1f02500, b=0x7e0ff1f025d0) at /home/simark/src/binutils-gdb/gdb/valops.c:1384
    riscvarchive#17 0x00005555647f7099 in evaluate_var_value (noside=EVAL_NORMAL, blk=0x7e0ff1f025d0, var=0x7e0ff1f02500) at /home/simark/src/binutils-gdb/gdb/eval.c:533
    riscvarchive#18 0x00005555647f740c in expr::var_value_operation::evaluate (this=0x7c2ff1e3b690, expect_type=0x0, exp=0x7c2ff1e3aa00, noside=EVAL_NORMAL) at /home/simark/src/binutils-gdb/gdb/eval.c:559
    riscvarchive#19 0x00005555647f3347 in expression::evaluate (this=0x7c2ff1e3aa00, expect_type=0x0, noside=EVAL_NORMAL) at /home/simark/src/binutils-gdb/gdb/eval.c:109
    riscvarchive#20 0x000055556543ac2f in process_print_command_args (args=0x7fffffffe728 "global_var", print_opts=0x7bfff0be4a30, voidprint=true) at /home/simark/src/binutils-gdb/gdb/printcmd.c:1328
    riscvarchive#21 0x000055556543ae65 in print_command_1 (args=0x7fffffffe728 "global_var", voidprint=1) at /home/simark/src/binutils-gdb/gdb/printcmd.c:1341
    riscvarchive#22 0x000055556543b707 in print_command (exp=0x7fffffffe728 "global_var", from_tty=1) at /home/simark/src/binutils-gdb/gdb/printcmd.c:1408

The problem to solve is: in order to evaluate a location expression, we
need to know some information (the various sizes) found in the unit
header.  In that context, it's not possible to get it from
dwarf2_cu::header, like the previous patch did: at the time the
expression is evaluated, the corresponding dwarf2_cu might have been
freed.  We don't want to re-build a dwarf2_cu just for that, it would be
very inefficient.  We could force reading in the dwarf2_per_cu::section
section (in the main file), but we never needed to read that section
before, so it would be better to avoid reading it unnecessarily.

My initial attempt was to store this information in baton objects
(dwarf2_locexpr_baton & co), so that it can be retrieved when the time
comes to evaluate the expressions.  However, it quickly became obvious
that storing it there would be redundant and wasteful.

I instead opted to store this information directly inside dwarf2_per_cu,
making it easily available when evaluating expressions.  These fields
initially have the value 0, and are set by cutu_reader whenever the
unit is parsed.  The various getters (dwarf2_per_cu::addr_size & al) now
just return these fields.

Doing so allows removing anything related to reading the header from
dwarf2_per_cu, which I think is a nice simplification.  This means that
nothing ever needs to read the header from just a dwarf2_per_cu.

It also happens to shrink the dwarf2_per_cu object size a bit, going
from:

    (top-gdb) p sizeof(dwarf2_per_cu)
    $1 = 176

to

    (top-gdb) p sizeof(dwarf2_per_cu)
    $1 = 120

I placed the new fields at this strange location in dwarf2_per_cu
because there happened to be a nice 3 bytes hole there (on Linux amd64
at least).

The new test set things up as described previously.  Note that the crash
only occurs if using the cc-with-gdb-index board.

Change-Id: I50807a1bbb605f0f92606a9e61c026e3376a4fcf
Approved-By: Andrew Burgess <aburgess@redhat.com>
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
If breakpoint commands contain detach or kill, then gdb tries to access
freed memory:

(gdb) b main
Breakpoint 1 at 0x111d: file main.c, line 21.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>detach
>end
(gdb) run
Starting program: /home/src/lappy/binutils-gdb.git/gdb/testsuite/gdb.base/main
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/../lib/libthread_db.so.1".

main () at main.c:21
21        return 0;
[Inferior 1 (process 241852) detached]
=================================================================
==241817==ERROR: AddressSanitizer: heap-use-after-free on address 0x7b7a3de0b760 at pc 0x55fcb92613fe bp 0x7ffec2d524f0 sp 0x7ffec2d524e0
READ of size 8 at 0x7b7a3de0b760 thread T0
    #0 0x55fcb92613fd in bpstat_do_actions_1 ../../gdb/breakpoint.c:4898
    #1 0x55fcb92617da in bpstat_do_actions() ../../gdb/breakpoint.c:5012
    #2 0x55fcba3180e7 in inferior_event_handler(inferior_event_type) ../../gdb/inf-loop.c:71
    #3 0x55fcba3ba1e1 in fetch_inferior_event() ../../gdb/infrun.c:4769

0x7b7a3de0b760 is located 0 bytes inside of 56-byte region [0x7b7a3de0b760,0x7b7a3de0b798)
freed by thread T0 here:
    #0 0x7f1a43522a2d in operator delete(void*, unsigned long) /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_new_delete.cpp:155
    #1 0x55fcb925d5cd in bpstat_clear(bpstat**) ../../gdb/breakpoint.c:4646
    #2 0x55fcbb69ea6a in clear_thread_inferior_resources ../../gdb/thread.c:185
    #3 0x55fcbb69f4cb in set_thread_exited(thread_info*, std::optional<unsigned long>, bool) ../../gdb/thread.c:244
    #4 0x55fcba368d64 in operator() ../../gdb/inferior.c:269
    #5 0x55fcba375e2b in clear_and_dispose<inferior::clear_thread_list()::<lambda(thread_info*)> > ../../gdb/../gdbsupport/intrusive_list.h:529
    #6 0x55fcba368f19 in inferior::clear_thread_list() ../../gdb/inferior.c:265
    #7 0x55fcba3694ba in exit_inferior(inferior*) ../../gdb/inferior.c:322
    #8 0x55fcba369e35 in detach_inferior(inferior*) ../../gdb/inferior.c:358
    #9 0x55fcba319d9f in inf_ptrace_target::detach_success(inferior*) ../../gdb/inf-ptrace.c:214
    #10 0x55fcba56a2f6 in linux_nat_target::detach(inferior*, int) ../../gdb/linux-nat.c:1582
    #11 0x55fcba62121c in thread_db_target::detach(inferior*, int) ../../gdb/linux-thread-db.c:1381
    riscvarchive#12 0x55fcbb5ca49e in target_detach(inferior*, int) ../../gdb/target.c:2557
    riscvarchive#13 0x55fcba356ba4 in detach_command(char const*, int) ../../gdb/infcmd.c:2894
    riscvarchive#14 0x55fcb9597eea in do_simple_func ../../gdb/cli/cli-decode.c:94
    riscvarchive#15 0x55fcb95b10b5 in cmd_func(cmd_list_element*, char const*, int) ../../gdb/cli/cli-decode.c:2831
    riscvarchive#16 0x55fcbb6f5282 in execute_command(char const*, int) ../../gdb/top.c:563
    riscvarchive#17 0x55fcb95eedb9 in execute_control_command_1 ../../gdb/cli/cli-script.c:526
    riscvarchive#18 0x55fcb95f04dd in execute_control_command(command_line*, int) ../../gdb/cli/cli-script.c:702
    riscvarchive#19 0x55fcb9261175 in bpstat_do_actions_1 ../../gdb/breakpoint.c:4940
    riscvarchive#20 0x55fcb92617da in bpstat_do_actions() ../../gdb/breakpoint.c:5012
    riscvarchive#21 0x55fcba3180e7 in inferior_event_handler(inferior_event_type) ../../gdb/inf-loop.c:71
    riscvarchive#22 0x55fcba3ba1e1 in fetch_inferior_event() ../../gdb/infrun.c:4769

previously allocated by thread T0 here:
    #0 0x7f1a435218cd in operator new(unsigned long) /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_new_delete.cpp:86
    #1 0x55fcb927061f in build_bpstat_chain(address_space const*, unsigned long, target_waitstatus const&) ../../gdb/breakpoint.c:5880
    #2 0x55fcba3d63b6 in handle_signal_stop ../../gdb/infrun.c:7083
    #3 0x55fcba3d01c7 in handle_inferior_event ../../gdb/infrun.c:6574
    #4 0x55fcba3b9918 in fetch_inferior_event() ../../gdb/infrun.c:4713

This checks after executing commands of each breakpoint if the bpstat
was deleted already, and stops any further processing immediately.
Now the result looks like this:

(gdb) b main
Breakpoint 1 at 0x111d: file main.c, line 21.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>detach
>end
(gdb) run
Starting program: /home/src/lappy/binutils-gdb.git/gdb/testsuite/gdb.base/main
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/../lib/libthread_db.so.1".

main () at main.c:21
21        return 0;
[Inferior 1 (process 242940) detached]
(gdb)

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=14354
Approved-By: Andrew Burgess <aburgess@redhat.com>
kito-cheng pushed a commit that referenced this pull request Jan 23, 2026
This patch adds a new test that checks for a bug that was, if not
fixed, then at least, worked around, by commit:

  commit a736ff7
  Date:   Sat Sep 27 22:29:24 2025 -0600

      Clean up iterate_over_symtabs

The bug was reported against Fedora GDB which, at the time the bug was
reported, is based off GDB 16, and so doesn't include the above
commit.  The bug report can be found here:

  https://bugzilla.redhat.com/show_bug.cgi?id=2403580

To summarise the bug report: a user is inspecting an application
backtrace.  The original bug report was from a core file, but the same
issue will trigger for a live inferior.  It's the inspection of the
stack frames which is important.  The user moves up the stack with the
'up' command and eventually finds an interesting frame.  They use
'list' to view the source code at the current location, this works and
displays lines 6461 to 6470 from the source file '../glib/gmain.c'.
The user then does 'list 6450' to try and display some earlier lines
from the same source file, at which point GDB gives the message:

   warning: 6445  ../glib/gmain.c: No such file or directory

So GDB initially manages to find the source file, but for the very
next command, GDB now claims that the source file doesn't exist.

As I said, commit a736ff7 appears to fix this issue, but it
wasn't clear to me (from the commit message) if this commit was
intended to fix any bugs, or if the bug was being hidden by this
commit.  I've spent some time trying to understand what's going on,
and have come up with this test case.

I think there might still be an issue in GDB, but I do think that the
above commit really is making it so that the issue (if it is an issue)
doesn't occur in that particular situation any more, so I think we can
consider the above commit a fix, and testing for this bug is worth
while to ensure it doesn't get reintroduced.

In order to trigger this bug we need these high level requirements:

  1. Multiple shared libraries compiled from the same source tree.  In
     this case it was glib, but the test in this commit uses a much
     smaller library.

  2. Common DWARF must be pulled from the libraries using the 'dwz'
     tool.

  3. Debuginfod must be in use for at least downloading the source
     code.  In the original bug, and in the test presented here,
     debuginfod is used for fetching both the debug info, and the
     source code for the library.

There are some additional specific requirements for the DWARF in order
to trigger the bug, but to make discussing this easier, lets look at
the structure of the test presented here.  When discussing the source
files I'll drop the solib-with-dwz- prefix, e.g. when I mention
'foo.c' I really mean 'solib-with-dwz-foo.c'.

There are three shared libraries built for this test, libbar.so,
libfoo.so, and libfoo-2.so.  The source file bar.c is used to create
libbar.so, and foo.c is used to create libfoo.so and libfoo-2.so.

The main test executable is built from main.c, and links against
libbar.so and libfoo.so.  libfoo-2.so is not used by the main
executable, and just exists to trigger some desired behaviour from the
dwz tool.

The debug information for each shared library is extracted into a
corresponding .debug file, and the dwz tool is used to extract common
debug from the three .debug files into a file called 'common.dwz'.

Given all this then, in order to trigger the bug, the following
additional requirements must be met:

  4. libbar.so must NOT make use of foo.c.  In this test libbar.so is
     built from bar.c (and some headers) only.

  5. A reference to foo.c must be placed into common.dwz.  This is why
     libfoo-2.so exists, as this library is almost identical to
     libfoo.so, there is lots of shared DWARF between libfoo.so and
     libfoo-2.so which can be moved into common.dwz, this shared DWARF
     includes references to foo.c, so an entry for foo.c is added to
     the file table list in common.dwz.

  6. There must be a DWARF construct within libbar.so.debug that
     references common.dwz, and which causes GDB to parse the line
     table from within common.dwz.  For more details on this, see
     below.

  7. We need libbar.so to appear before libfoo.so in GDB's
     comunit_symtab lists.  This means that GDB will scan the symtabs
     for libbar.so before checking the symtabs of libfoo.so.  I
     achieve this by mentioning libbar.so first when building the
     executable, but this is definitely the most fragile part of the
     test.

To satisfy requirement (6) the inline function 'add_some_int' is added
to the test.  This function appears in both libbar.so and libfoo.so,
this means that the DW_TAG_subprogram representing the abstract
instance tree will be moved into common.dwz.  However, as this is an
inline function, the DW_TAG_inlined_subroutine DIEs for each concrete
instance, will be left in libbar.so.debug and libfoo.so.debug, with a
DW_AT_abstract_origin that points into common.dwz.

When GDB parses libbar.so.debug it finds the DW_TAG_inlined_subroutine
and begins processing it.  It sees the DW_AT_abstract_origin and so
jumps into common.dwz to read the DIEs that define the inline
function.  Here is the DWARF from libbar.so.debug for the inlined
instance:

 <2><91>: Abbrev Number: 3 (DW_TAG_inlined_subroutine)
    <92>   DW_AT_abstract_origin: <alt 0x1b>
    <96>   DW_AT_low_pc      : 0x1121
    <9e>   DW_AT_high_pc     : 31
    <9f>   DW_AT_call_file   : 1
    <a0>   DW_AT_call_line   : 26
    <a1>   DW_AT_call_column : 15
 <3><a2>: Abbrev Number: 5 (DW_TAG_formal_parameter)
    <a3>   DW_AT_abstract_origin: <alt 0x2c>
    <a7>   DW_AT_location    : 2 byte block: 91 68      (DW_OP_fbreg: -24)
 <3><aa>: Abbrev Number: 5 (DW_TAG_formal_parameter)
    <ab>   DW_AT_abstract_origin: <alt 0x25>
    <af>   DW_AT_location    : 2 byte block: 91 6c      (DW_OP_fbreg: -20)

And here's the DWARF from common.dwz for the abstract instance tree:

 <1><1b>: Abbrev Number: 7 (DW_TAG_subprogram)
    <1c>   DW_AT_name        : (indirect string, offset: 0x18a): add_some_int
    <20>   DW_AT_decl_file   : 1
    <21>   DW_AT_decl_line   : 24
    <22>   DW_AT_decl_column : 1
    <23>   DW_AT_prototyped  : 1
    <23>   DW_AT_type        : <0x14>
    <24>   DW_AT_inline      : 3        (declared as inline and inlined)
 <2><25>: Abbrev Number: 8 (DW_TAG_formal_parameter)
    <26>   DW_AT_name        : a
    <28>   DW_AT_decl_file   : 1
    <29>   DW_AT_decl_line   : 24
    <2a>   DW_AT_decl_column : 19
    <2b>   DW_AT_type        : <0x14>
 <2><2c>: Abbrev Number: 8 (DW_TAG_formal_parameter)
    <2d>   DW_AT_name        : b
    <2f>   DW_AT_decl_file   : 1
    <30>   DW_AT_decl_line   : 24
    <31>   DW_AT_decl_column : 26
    <32>   DW_AT_type        : <0x14>

While processing the common.dwz DIEs GDB sees the DW_AT_decl_file
attributes, and this triggers a read of the file table within
common.dwz, which creates symtabs for any files mentioned, if the
symtabs don't already exist.

But, and this is the important bit, when doing this, GDB is creating a
compunit_symtab for libbar.so.debug, so any symtabs created will be
attached to the libbar.so.debug objfile.

Remember requirement (5), the file list in common.dwz mentions
'foo.c', so even though libbar.so doesn't use 'foo.c' we end up with a
symtab for 'foo.c' created within the compunit_symtab for
libbar.so.debug!

I don't think this is ideal.  This wastes memory and time; we have
more symtabs to search through even if, as I'll discuss below, we
usually end up ignoring these symtabs.

The exact path that triggers this weird symtab creation starts with a
call to 'new_symbol' (dwarf2/read.c) for the DW_TAG_formal_parameter
in the abstract instance tree.  These include DW_AT_decl_file, which
is read in 'new_symbol'.  In 'new_symbol' GDB spots that the
line_header has not yet been read in, so handle_DW_AT_stmt_list is
called which reads the file/line table and then calls
'dwarf_decode_lines' (line_program.c), which then creates symtabs for
all the files mentioned.

This symtab creation issue still exists today in GDB, though I've not
been able to find any real issues that this is causing after commit
a736ff7 fixed the issue I'm discussing here.

So, having tricked GDB into creating a misplaced symtab, what problem
did this cause prior to commit a736ff7?

To answer this, we need to take a diversion to understand how a
command like 'list 6450' works.  The two interesting functions are
create_sals_line_offset and decode_digits_list_mode, which is called
from the former.  The create_sals_line_offset is called indirectly
from list_command via the initial call to decode_line_1.

In create_sals_line_offset, if the incoming linespec doesn't specify a
specific symtab, then GDB uses the name of the default symtab to
lookup every symtab with a matching name, this is done with the line:

  ls->file_symtabs
    = collect_symtabs_from_filename (self->default_symtab->filename (),
                                     self->search_pspace);

In our case, when the default symtab is 'foo.c', this is going to
return multiple symtabs, these will include the correct 'foo.c' symtab
from libfoo.so, but will also include the misplaced 'foo.c' symtab
from libbar.so.  This is where the ordering is important.  As list
will only ever list one file, at a later point in this process we're
going to toss out everything except the first result.  So, to trigger
the bug, it is critical that the FIRST result returned here be the
misplaced 'foo.c' symtab from libbar.so.  In the test I try to ensure
this by mentioning libbar.so before libfoo.so when building the
executable, which currently means we get back the misplaced symtab
first, but this could change in the future and wouldn't necessarily
mean that the problem has gone away.

Having got the symtab list GDB then calls decode_digits_list_mode
which iterates over the symtabs and converts them into symtab_and_line
objects, at the heart of which is a call to find_line_symtab, which
checks if a given symtab has a line table entry for the desired line.
If it does then the symtab is returned.  If it doesn't then GDB looks
for another symtab with the same name that does have a line table
entry.  If no suitably named symtab has an exact match, then the
symtab with the closest line above the required line is returned.  If
no symtab has a matching line table entry then find_line_symtab
returns NULL.

Remember, the misplaced symtab was only created as a side effect of
trying to attach the DW_TAG_formal_parameter symbol to a symtab.
The actual line table for libbar.so (in libbar.so.debug) has no line
table entries for 'foo.c'.  What this means is that the line table for
'foo.c' attached to libbar.so.debug is empty.  So normally what
happens is that find_line_symtab will instead find a line table entry
for 'foo.c' in libfoo.so.debug that does have a suitable line table
entry, and will switch GDB back to that symtab, effectively avoiding
the problem.  However, that is not what happens in the bug case.  In
the bug case find_line_symtab returns NULL, which means that
decode_digits_list_mode just uses the original symtab, in this case
the symtab for 'foo.c' from libbar.so.debug.

In the original bug, the code is compiled with -O2, and this
optimisation has left the line table covering the problem file pretty
sparse.  In fact, there are no line table entries for any line after
the line that the user is trying to list.  This is why
find_line_symtab doesn't find a better alternative symtab, and instead
just returns NULL.

In the test I've replicated this by having a comment at the end of the
source file, and asking GDB to list a line within this comment.  The
result is that there are no line table entries for that line in any
'foo.c' symtab, and so find_line_symtab returns NULL.

After decode_digits_list_mode sees the NULL from find_line_symtab, it
just uses the initial symtab.

After this we eventually return back to list_command (cli/cli-cmds.c)
with a list of symtab_and_line objects.  The first entry in this list
is for the symtab 'foo.c' from libbar.so.  In list_command we call
filter_sals which throws away everything but the first entry as all
the symtabs have the same filename (and are in the same program
space).

Using the symtab we build an absolute path to the source file.

Now, if the source is installed locally, GDB performs no additional
checks; we found a symtab, the symtab gave us a source filename, if
the source file exists on disk, then the requires lines are listed for
the user.

But if the source file doesn't exist on disk, then we are going to ask
debuginfod for the source file.  To do that we use two pieces of
information; the absolute path to the source file, which we have; and
the build-id of an objfile, this is the objfile that owns the symtab
we are trying to get the source for.  In this case libbar.so.  And so
we send the build-id and filename to debuginfod.

Now debuginfod isn't going to just serve any file to anyone, that
would be a security issue for the server.  Instead, debuginfod scans
the DWARF and builds up its own model of which objfiles use which
source files, and for a given build-id, debuginfod will only serve
back files that the objfile matching that build-id, actually uses.
So, in this case, when we ask for 'foo.c' from libbar.so, debuginfod
correctly realises the 'foo.c' is not part of libbar.so, and refuses
to send the file back.

And this is how the original bug occurred.

So, why does commit a736ff7 fix this problem?  The answer is in
iterate_over_symtabs, which is used by collect_symtabs_from_filename
to find the matching symtabs.

Prior to this commit, iterate_over_symtabs had two phases, first a
call to iterate_over_some_symtabs which walks over compunit_symtabs
that already exist looking for matches, during this phase only the
symtab filenames are considered.  The second phase uses
objfile::map_symtabs_matching_filename to look through the objfiles
and expand new symtabs that match the required name.  In our case, by
the time iterate_over_symtabs is called, all of the interesting
symtabs have already been expanded, so we only perform the filename
check in iterate_over_some_symtabs, this passes, and so 'foo.c' from
libbar.so is considered a suitable symtab.

After commit a736ff7 the initial call to
iterate_over_some_symtabs has been removed from iterate_over_symtabs,
and only the objfile::map_symtabs_matching_filename call remains.
This ends up in cooked_index_functions::search (dwarf2/read.c) to
search for matching symtabs.

The first think cooked_index_functions::search does is setup a vector
of CUs to skip by calling dw_search_file_matcher, this then calls
dw2_get_file_names to get the file and line table for a CU, this
function in turn creates a cutu_reader object, passing true for the
'skip_partial' argument to its constructor.

As our 'foo.c' symtab was created from within the dwz extracted DWARF,
then it is associated with the DW_TAG_partial_unit that held the
DW_TAG_subprogram DIEs that were being processed when the misplaced
symtab was original created; this is a partial unit.  As this is a
partial unit, and the skip_partial flag was passed true, the
cutu_reader::is_dummy function will return true.

Back in dw2_get_file_names, if cutu_reader::is_dummy is true then
dw2_get_file_names_reader is never called, and the file names are
never read.  This means that back in dw_search_file_matcher, the file
data, returned from dw2_get_file_names is NULL, and so this CU is
marked to be skipped.  Which is exactly what we want, this misplaced
symtab, which was created for a partial unit and associated with
libbar.so, is skipped and never considered as a possible match.

There is a remaining problem, which is marked in the test with an
xfail.  That is, when the test does the 'list LINENO', GDB still tries
to download the source for 'foo.c' from libbar.so.  The reason for
this is that, while it is true that the initial
collect_symtabs_from_filename call no longer returns 'foo.c' from
libbar.so, when decode_digits_list_mode calls find_line_symtab for the
correct 'foo.c' from libfoo.so, it is still the case that there is no
exact match for LINENO in that symtabs line table.

As a result, GDB looks through all the other symtabs for 'foo.c' to
see if any are a better match.  Checking if another symtab is a
possible better match requires a full comparison of the symtabs source
file name, which in this case triggers an attempt to download the
source file from debuginfod.  Here's the backtrace at the time of the
rogue source download request, which appears as an xfail in the test
presented here:

  #0  debuginfod_source_query (build_id=..., build_id_len=..., srcpath=..., destname=...) at ../../src/gdb/debuginfod-support.c:332
  #1  0x0000000000f0bb3b in open_source_file (s=...) at ../../src/gdb/source.c:1152
  #2  0x0000000000f0be42 in symtab_to_fullname (s=...) at ../../src/gdb/source.c:1214
  #3  0x0000000000f6dc40 in find_line_symtab (sym_tab=..., line=..., index=...) at ../../src/gdb/symtab.c:3314
  #4  0x0000000000aea319 in decode_digits_list_mode (self=..., ls=..., line=...) at ../../src/gdb/linespec.c:3939
  #5  0x0000000000ae4684 in create_sals_line_offset (self=..., ls=...) at ../../src/gdb/linespec.c:2039
  #6  0x0000000000ae557f in convert_linespec_to_sals (state=..., ls=...) at ../../src/gdb/linespec.c:2289
  #7  0x0000000000ae6546 in parse_linespec (parser=..., arg=..., match_type=...) at ../../src/gdb/linespec.c:2647
  #8  0x0000000000ae7605 in location_spec_to_sals (parser=..., locspec=...) at ../../src/gdb/linespec.c:3045
  #9  0x0000000000ae7c7f in decode_line_1 (locspec=..., flags=..., search_pspace=..., default_symtab=..., default_line=...) at ../../src.dev-m/gdb/linespec.c:3167

I think that this might not be what we really want to do here.  After
downloading the source file we'll end up with a filename within the
debuginfod download cache, which will be different for each
objfile (the cache partitions downloads based on build-id).  So if two
symtabs originate from the same source file, but are in two different
objfiles, then, when the source is on disk, the filenames for these
symtabs will be identical, and the symtabs will be considered
equivalent by find_line_symtab.  But when debuginfod is downloading
the source the source paths will be different, and find_line_symtab
will consider the symtabs different.  This doesn't seem right to me.
But I'm going to leave worrying about that for another day.

Given this last bug, I am of the opinion that the misplaced symtab is
likely a bug, though after commit a736ff7, the only issue I can
find is the extra debuginfod download request, which isn't huge.  But
still, maybe just reducing the number of symtabs would be worth it?

But this patch isn't about fixing any bugs, it's about adding a test
case for an issue that was a problem, but isn't any longer.

Approved-By: Tom Tromey <tom@tromey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant