Skip to content

Conversation

@jasonmolenda
Copy link
Collaborator

The Mach-O load commands have an LC_SYMTAB / struct symtab_command which represents the offset of the symbol table (nlist records) and string table for this binary. In a mach-o binary on disk, these are file offsets. If a mach-o binary is loaded in memory with all segments consecutive, the symoff and stroff are the offsets from the TEXT segment (aka the mach-o header) virtual address to the virtual address of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the segments will be separated -- they will have different slide values. And it is possible for the LINKEDIT segment to be greater than 4GB away from the TEXT segment in the virtual address space, so these 32-bit offsets cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol table and string table, instead of reusing the 32-bit ones in the symtab_command structure.

rdar://140432279

The Mach-O load commands have an LC_SYMTAB / struct symtab_command
which represents the offset of the symbol table (nlist records) and
string table for this binary.  In a mach-o binary on disk, these are
file offsets.  If a mach-o binary is loaded in memory with all
segments consecutive, the `symoff` and `stroff` are the offsets from
the TEXT segment (aka the mach-o header) virtual address to the
virtual address of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values.
And it is possible for the LINKEDIT segment to be greater than 4GB
away from the TEXT segment in the virtual address space, so these
32-bit offsets cannot express the offset from TEXT segment to these
tables.

Create separate uint64_t variables to track the offset to the
symbol table and string table, instead of reusing the 32-bit ones
in the symtab_command structure.

rdar://140432279
@llvmbot
Copy link
Member

llvmbot commented Nov 27, 2024

@llvm/pr-subscribers-lldb

Author: Jason Molenda (jasonmolenda)

Changes

The Mach-O load commands have an LC_SYMTAB / struct symtab_command which represents the offset of the symbol table (nlist records) and string table for this binary. In a mach-o binary on disk, these are file offsets. If a mach-o binary is loaded in memory with all segments consecutive, the symoff and stroff are the offsets from the TEXT segment (aka the mach-o header) virtual address to the virtual address of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the segments will be separated -- they will have different slide values. And it is possible for the LINKEDIT segment to be greater than 4GB away from the TEXT segment in the virtual address space, so these 32-bit offsets cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol table and string table, instead of reusing the 32-bit ones in the symtab_command structure.

rdar://140432279


Full diff: https://github.com/llvm/llvm-project/pull/117832.diff

1 Files Affected:

  • (modified) lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp (+20-6)
diff --git a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
index 079fd905037d45..5f047d84d53e73 100644
--- a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
+++ b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
@@ -2244,6 +2244,18 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
   // code.
   typedef AddressDataArray<lldb::addr_t, bool, 100> FunctionStarts;
 
+  // The virtual address offset from TEXT to the symbol/string tables
+  // in the LINKEDIT section.  The LC_SYMTAB symtab_command `symoff` and
+  // `stroff` are uint32_t's that give the file offset in the binary.
+  // If the binary is laid down in memory with all segments consecutive,
+  // then these are the offsets from the mach-o header aka TEXT segment
+  // to the tables' virtual addresses.
+  // But if the binary is loaded in virtual address space with different
+  // slides for the segments (e.g. a shared cache), the LINKEDIT may be
+  // more than 4GB away from TEXT, and a 32-bit offset is not sufficient.
+  offset_t symbol_table_offset_from_TEXT = 0;
+  offset_t string_table_offset_from_TEXT = 0;
+
   // Record the address of every function/data that we add to the symtab.
   // We add symbols to the table in the order of most information (nlist
   // records) to least (function starts), and avoid duplicating symbols
@@ -2282,6 +2294,8 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
       if (m_data.GetU32(&offset, &symtab_load_command.symoff, 4) ==
           nullptr) // fill in symoff, nsyms, stroff, strsize fields
         return;
+      string_table_offset_from_TEXT = symtab_load_command.stroff;
+      symbol_table_offset_from_TEXT = symtab_load_command.symoff;
       break;
 
     case LC_DYLD_INFO:
@@ -2403,9 +2417,9 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
 
       const addr_t linkedit_file_offset = linkedit_section_sp->GetFileOffset();
       const addr_t symoff_addr = linkedit_load_addr +
-                                 symtab_load_command.symoff -
+                                 symbol_table_offset_from_TEXT -
                                  linkedit_file_offset;
-      strtab_addr = linkedit_load_addr + symtab_load_command.stroff -
+      strtab_addr = linkedit_load_addr + string_table_offset_from_TEXT -
                     linkedit_file_offset;
 
       // Always load dyld - the dynamic linker - from memory if we didn't
@@ -2473,17 +2487,17 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
       lldb::addr_t linkedit_offset = linkedit_section_sp->GetFileOffset();
       lldb::offset_t linkedit_slide =
           linkedit_offset - m_linkedit_original_offset;
-      symtab_load_command.symoff += linkedit_slide;
-      symtab_load_command.stroff += linkedit_slide;
+      symbol_table_offset_from_TEXT += linkedit_slide;
+      string_table_offset_from_TEXT += linkedit_slide;
       dyld_info.export_off += linkedit_slide;
       dysymtab.indirectsymoff += linkedit_slide;
       function_starts_load_command.dataoff += linkedit_slide;
       exports_trie_load_command.dataoff += linkedit_slide;
     }
 
-    nlist_data.SetData(m_data, symtab_load_command.symoff,
+    nlist_data.SetData(m_data, symbol_table_offset_from_TEXT,
                        nlist_data_byte_size);
-    strtab_data.SetData(m_data, symtab_load_command.stroff,
+    strtab_data.SetData(m_data, string_table_offset_from_TEXT,
                         strtab_data_byte_size);
 
     // We shouldn't have exports data from both the LC_DYLD_INFO command

@jasonmolenda
Copy link
Collaborator Author

I have two criticisms of the patch as-is. We read the LC_SYMTAB load command into symtab_load_command which has six fields, the first two being the usual cmd and cmdsize. We use all of the next four: ncmds, strsize, symoff, stroff. I'm taking the uses of two of these fields out of symtab_load_command, and putting them in 64-bit locals. But I think the smarter move is probably to read the load command into a temporary object, and then create four local variables with the resized fields as needed, or have a locally defined struct with the resized fields.

The second criticism is that I don't have a good way to test it. I have a request to have ProcessMachCore treat an LC_SEGMENT that has a virtual address & size, but no file size as an all-zeroes segment. In which case it would be possible to create a mach-o corefile that is larger than 4GB in size, but actually only uses a couple hundred kb on disk (and doesn't fill the CI filesystems), and then we'd have to hand-write a mach-o file with a LINKEDIT 4GB away from the TEXT segment. There's a couple pieces that don't exist to do all of this right now, though.

@Michael137
Copy link
Member

Is changing struct symtab_command to have the right-sized fields a no-go?

@jasonmolenda
Copy link
Collaborator Author

Is changing struct symtab_command to have the right-sized fields a no-go?

Yes, this is one possibility, and maybe the best. The current structure reflects the in-binary layout, so it can be read (with endian fixing) on one go.

@jasonmolenda
Copy link
Collaborator Author

Updated patch to have a local symtab_command structure in ObjectFileMachO with larger offset fields, and read the load command fields into this structure. Simplifies the patch; the exiting slide calculations can remain unmodified now that 64-bit offsets are being used.

Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @Michael137's suggestion and the name makes it obvious what's going on. LGTM.

@jasonmolenda jasonmolenda merged commit 448ac7d into llvm:main Nov 28, 2024
7 checks passed
@jasonmolenda jasonmolenda deleted the use-offset_t-for-symbol-table-and-string-table-offsets branch November 28, 2024 18:32
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Nov 28, 2024
The Mach-O load commands have an LC_SYMTAB / struct symtab_command which
represents the offset of the symbol table (nlist records) and string
table for this binary. In a mach-o binary on disk, these are file
offsets. If a mach-o binary is loaded in memory with all segments
consecutive, the `symoff` and `stroff` are the offsets from the TEXT
segment (aka the mach-o header) virtual address to the virtual address
of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values. And
it is possible for the LINKEDIT segment to be greater than 4GB away from
the TEXT segment in the virtual address space, so these 32-bit offsets
cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol
table and string table, instead of reusing the 32-bit ones in the
symtab_command structure.

rdar://140432279
(cherry picked from commit 448ac7d)
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Nov 28, 2024
The Mach-O load commands have an LC_SYMTAB / struct symtab_command which
represents the offset of the symbol table (nlist records) and string
table for this binary. In a mach-o binary on disk, these are file
offsets. If a mach-o binary is loaded in memory with all segments
consecutive, the `symoff` and `stroff` are the offsets from the TEXT
segment (aka the mach-o header) virtual address to the virtual address
of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values. And
it is possible for the LINKEDIT segment to be greater than 4GB away from
the TEXT segment in the virtual address space, so these 32-bit offsets
cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol
table and string table, instead of reusing the 32-bit ones in the
symtab_command structure.

rdar://140432279
(cherry picked from commit 448ac7d)
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Nov 28, 2024
The Mach-O load commands have an LC_SYMTAB / struct symtab_command which
represents the offset of the symbol table (nlist records) and string
table for this binary. In a mach-o binary on disk, these are file
offsets. If a mach-o binary is loaded in memory with all segments
consecutive, the `symoff` and `stroff` are the offsets from the TEXT
segment (aka the mach-o header) virtual address to the virtual address
of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values. And
it is possible for the LINKEDIT segment to be greater than 4GB away from
the TEXT segment in the virtual address space, so these 32-bit offsets
cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol
table and string table, instead of reusing the 32-bit ones in the
symtab_command structure.

rdar://140432279
(cherry picked from commit 448ac7d)
JDevlieghere added a commit to swiftlang/llvm-project that referenced this pull request Dec 2, 2024
…-vm-offset-shared-cache-segments-61

[lldb][Mach-O] Handle shared cache binaries correctly (llvm#117832)
jasonmolenda added a commit to swiftlang/llvm-project that referenced this pull request Dec 2, 2024
…-vm-offset-shared-cache-segments

[lldb][Mach-O] Handle shared cache binaries correctly (llvm#117832)
adrian-prantl pushed a commit to adrian-prantl/llvm-project that referenced this pull request Dec 5, 2024
The Mach-O load commands have an LC_SYMTAB / struct symtab_command which
represents the offset of the symbol table (nlist records) and string
table for this binary. In a mach-o binary on disk, these are file
offsets. If a mach-o binary is loaded in memory with all segments
consecutive, the `symoff` and `stroff` are the offsets from the TEXT
segment (aka the mach-o header) virtual address to the virtual address
of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values. And
it is possible for the LINKEDIT segment to be greater than 4GB away from
the TEXT segment in the virtual address space, so these 32-bit offsets
cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol
table and string table, instead of reusing the 32-bit ones in the
symtab_command structure.

rdar://140432279
(cherry picked from commit 448ac7d)
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Sep 19, 2025
The Mach-O file format has several load commands which specify the
location of data in the file in UInt32 offsets.  lldb uses these
same structures to track the offsets of the binary in virtual address
space when it is running.  Normally a binary is loaded in memory
contiguously, so this is fine, but on Darwin systems there is a
"system shared cache" where all system libraries are combined into
one region of memory and pre-linked.  The shared cache has the TEXT
segments for every binary loaded contiguously, then the DATA segments,
and finally a shared common LINKEDIT segment for all binaries.  The
virtual address offset from the TEXT segment for a libray to the
LINKEDIT may exceed 4GB of virtual address space depending on the
structure of the shared cache, so this use of a UInt32 offset will
not work.

There was an initial instance of this issue that I fixed last November
in llvm#117832 where I
fixed this issue for the LC_SYMTAB / `symtab_command` structure.  But
we have the same issue now with three additional structures;
`linkedit_data_command`, `dyld_info_command`, and `dysymtab_command`.
For all of these we can see the pattern of `dyld_info.export_off +=
linkedit_slide` applied to the offset fields in ObjectFileMachO.

This defines local structures that mirror the Mach-O structures,
except that it uses UInt64 offset fields so we can reuse the same
field for a large virtual address offset at runtime.  I defined
ctor's from the genuine structures, as well as operator= methods
so the structures can be read from the Mach-O binary into the
standard object, then copied into our local expanded versions of
them.  These structures are ABI in Mach-O and cannot change their
layout.

The alternative is to create local variables alongside these Mach-O
load command objects for the offsets that we care about, adjust
those by the correct VA offsets, and only use those local variables
instead of the fields in the objects.  I took the approach of the
local enhanced structure in November and I think it is the cleaner
approach.

rdar://160384968
jasonmolenda added a commit that referenced this pull request Sep 20, 2025
The Mach-O file format has several load commands which specify the
location of data in the file in UInt32 offsets. lldb uses these same
structures to track the offsets of the binary in virtual address space
when it is running. Normally a binary is loaded in memory contiguously,
so this is fine, but on Darwin systems there is a "system shared cache"
where all system libraries are combined into one region of memory and
pre-linked. The shared cache has the TEXT segments for every binary
loaded contiguously, then the DATA segments, and finally a shared common
LINKEDIT segment for all binaries. The virtual address offset from the
TEXT segment for a libray to the LINKEDIT may exceed 4GB of virtual
address space depending on the structure of the shared cache, so this
use of a UInt32 offset will not work.

There was an initial instance of this issue that I fixed last November
in #117832 where I fixed this
issue for the LC_SYMTAB / `symtab_command` structure. But we have the
same issue now with three additional structures;
`linkedit_data_command`, `dyld_info_command`, and `dysymtab_command`.
For all of these we can see the pattern of `dyld_info.export_off +=
linkedit_slide` applied to the offset fields in ObjectFileMachO.

This defines local structures that mirror the Mach-O structures, except
that it uses UInt64 offset fields so we can reuse the same field for a
large virtual address offset at runtime. I defined ctor's from the
genuine structures, as well as operator= methods so the structures can
be read from the Mach-O binary into the standard object, then copied
into our local expanded versions of them. These structures are ABI in
Mach-O and cannot change their layout.

The alternative is to create local variables alongside these Mach-O load
command objects for the offsets that we care about, adjust those by the
correct VA offsets, and only use those local variables instead of the
fields in the objects. I took the approach of the local enhanced
structure in November and I think it is the cleaner approach.

rdar://160384968
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Sep 20, 2025
The Mach-O file format has several load commands which specify the
location of data in the file in UInt32 offsets. lldb uses these same
structures to track the offsets of the binary in virtual address space
when it is running. Normally a binary is loaded in memory contiguously,
so this is fine, but on Darwin systems there is a "system shared cache"
where all system libraries are combined into one region of memory and
pre-linked. The shared cache has the TEXT segments for every binary
loaded contiguously, then the DATA segments, and finally a shared common
LINKEDIT segment for all binaries. The virtual address offset from the
TEXT segment for a libray to the LINKEDIT may exceed 4GB of virtual
address space depending on the structure of the shared cache, so this
use of a UInt32 offset will not work.

There was an initial instance of this issue that I fixed last November
in llvm/llvm-project#117832 where I fixed this
issue for the LC_SYMTAB / `symtab_command` structure. But we have the
same issue now with three additional structures;
`linkedit_data_command`, `dyld_info_command`, and `dysymtab_command`.
For all of these we can see the pattern of `dyld_info.export_off +=
linkedit_slide` applied to the offset fields in ObjectFileMachO.

This defines local structures that mirror the Mach-O structures, except
that it uses UInt64 offset fields so we can reuse the same field for a
large virtual address offset at runtime. I defined ctor's from the
genuine structures, as well as operator= methods so the structures can
be read from the Mach-O binary into the standard object, then copied
into our local expanded versions of them. These structures are ABI in
Mach-O and cannot change their layout.

The alternative is to create local variables alongside these Mach-O load
command objects for the offsets that we care about, adjust those by the
correct VA offsets, and only use those local variables instead of the
fields in the objects. I took the approach of the local enhanced
structure in November and I think it is the cleaner approach.

rdar://160384968
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Sep 22, 2025
The Mach-O file format has several load commands which specify the
location of data in the file in UInt32 offsets. lldb uses these same
structures to track the offsets of the binary in virtual address space
when it is running. Normally a binary is loaded in memory contiguously,
so this is fine, but on Darwin systems there is a "system shared cache"
where all system libraries are combined into one region of memory and
pre-linked. The shared cache has the TEXT segments for every binary
loaded contiguously, then the DATA segments, and finally a shared common
LINKEDIT segment for all binaries. The virtual address offset from the
TEXT segment for a libray to the LINKEDIT may exceed 4GB of virtual
address space depending on the structure of the shared cache, so this
use of a UInt32 offset will not work.

There was an initial instance of this issue that I fixed last November
in llvm#117832 where I fixed this
issue for the LC_SYMTAB / `symtab_command` structure. But we have the
same issue now with three additional structures;
`linkedit_data_command`, `dyld_info_command`, and `dysymtab_command`.
For all of these we can see the pattern of `dyld_info.export_off +=
linkedit_slide` applied to the offset fields in ObjectFileMachO.

This defines local structures that mirror the Mach-O structures, except
that it uses UInt64 offset fields so we can reuse the same field for a
large virtual address offset at runtime. I defined ctor's from the
genuine structures, as well as operator= methods so the structures can
be read from the Mach-O binary into the standard object, then copied
into our local expanded versions of them. These structures are ABI in
Mach-O and cannot change their layout.

The alternative is to create local variables alongside these Mach-O load
command objects for the offsets that we care about, adjust those by the
correct VA offsets, and only use those local variables instead of the
fields in the objects. I took the approach of the local enhanced
structure in November and I think it is the cleaner approach.

rdar://160384968
(cherry picked from commit 3e57a0d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants