Skip to content

Conversation

@jasonmolenda
Copy link
Collaborator

The Mach-O file format has several load commands which specify the location of data in the file in UInt32 offsets. lldb uses these same structures to track the offsets of the binary in virtual address space when it is running. Normally a binary is loaded in memory contiguously, so this is fine, but on Darwin systems there is a "system shared cache" where all system libraries are combined into one region of memory and pre-linked. The shared cache has the TEXT segments for every binary loaded contiguously, then the DATA segments, and finally a shared common LINKEDIT segment for all binaries. The virtual address offset from the TEXT segment for a libray to the LINKEDIT may exceed 4GB of virtual address space depending on the structure of the shared cache, so this use of a UInt32 offset will not work.

There was an initial instance of this issue that I fixed last November in #117832 where I fixed this issue for the LC_SYMTAB / symtab_command structure. But we have the same issue now with three additional structures; linkedit_data_command, dyld_info_command, and dysymtab_command. For all of these we can see the pattern of dyld_info.export_off += linkedit_slide applied to the offset fields in ObjectFileMachO.

This defines local structures that mirror the Mach-O structures, except that it uses UInt64 offset fields so we can reuse the same field for a large virtual address offset at runtime. I defined ctor's from the genuine structures, as well as operator= methods so the structures can be read from the Mach-O binary into the standard object, then copied into our local expanded versions of them. These structures are ABI in Mach-O and cannot change their layout.

The alternative is to create local variables alongside these Mach-O load command objects for the offsets that we care about, adjust those by the correct VA offsets, and only use those local variables instead of the fields in the objects. I took the approach of the local enhanced structure in November and I think it is the cleaner approach.

rdar://160384968

The Mach-O file format has several load commands which specify the
location of data in the file in UInt32 offsets.  lldb uses these
same structures to track the offsets of the binary in virtual address
space when it is running.  Normally a binary is loaded in memory
contiguously, so this is fine, but on Darwin systems there is a
"system shared cache" where all system libraries are combined into
one region of memory and pre-linked.  The shared cache has the TEXT
segments for every binary loaded contiguously, then the DATA segments,
and finally a shared common LINKEDIT segment for all binaries.  The
virtual address offset from the TEXT segment for a libray to the
LINKEDIT may exceed 4GB of virtual address space depending on the
structure of the shared cache, so this use of a UInt32 offset will
not work.

There was an initial instance of this issue that I fixed last November
in llvm#117832 where I
fixed this issue for the LC_SYMTAB / `symtab_command` structure.  But
we have the same issue now with three additional structures;
`linkedit_data_command`, `dyld_info_command`, and `dysymtab_command`.
For all of these we can see the pattern of `dyld_info.export_off +=
linkedit_slide` applied to the offset fields in ObjectFileMachO.

This defines local structures that mirror the Mach-O structures,
except that it uses UInt64 offset fields so we can reuse the same
field for a large virtual address offset at runtime.  I defined
ctor's from the genuine structures, as well as operator= methods
so the structures can be read from the Mach-O binary into the
standard object, then copied into our local expanded versions of
them.  These structures are ABI in Mach-O and cannot change their
layout.

The alternative is to create local variables alongside these Mach-O
load command objects for the offsets that we care about, adjust
those by the correct VA offsets, and only use those local variables
instead of the fields in the objects.  I took the approach of the
local enhanced structure in November and I think it is the cleaner
approach.

rdar://160384968
@llvmbot
Copy link
Member

llvmbot commented Sep 19, 2025

@llvm/pr-subscribers-lldb

Author: Jason Molenda (jasonmolenda)

Changes

The Mach-O file format has several load commands which specify the location of data in the file in UInt32 offsets. lldb uses these same structures to track the offsets of the binary in virtual address space when it is running. Normally a binary is loaded in memory contiguously, so this is fine, but on Darwin systems there is a "system shared cache" where all system libraries are combined into one region of memory and pre-linked. The shared cache has the TEXT segments for every binary loaded contiguously, then the DATA segments, and finally a shared common LINKEDIT segment for all binaries. The virtual address offset from the TEXT segment for a libray to the LINKEDIT may exceed 4GB of virtual address space depending on the structure of the shared cache, so this use of a UInt32 offset will not work.

There was an initial instance of this issue that I fixed last November in #117832 where I fixed this issue for the LC_SYMTAB / symtab_command structure. But we have the same issue now with three additional structures; linkedit_data_command, dyld_info_command, and dysymtab_command. For all of these we can see the pattern of dyld_info.export_off += linkedit_slide applied to the offset fields in ObjectFileMachO.

This defines local structures that mirror the Mach-O structures, except that it uses UInt64 offset fields so we can reuse the same field for a large virtual address offset at runtime. I defined ctor's from the genuine structures, as well as operator= methods so the structures can be read from the Mach-O binary into the standard object, then copied into our local expanded versions of them. These structures are ABI in Mach-O and cannot change their layout.

The alternative is to create local variables alongside these Mach-O load command objects for the offsets that we care about, adjust those by the correct VA offsets, and only use those local variables instead of the fields in the objects. I took the approach of the local enhanced structure in November and I think it is the cleaner approach.

rdar://160384968


Full diff: https://github.com/llvm/llvm-project/pull/159849.diff

2 Files Affected:

  • (modified) lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp (+33-43)
  • (modified) lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.h (+137)
diff --git a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
index 924e34053d411..fada1fda2b4bc 100644
--- a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
+++ b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
@@ -2156,10 +2156,10 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
   LLDB_LOG(log, "Parsing symbol table for {0}", file_name);
   Progress progress("Parsing symbol table", file_name);
 
-  llvm::MachO::linkedit_data_command function_starts_load_command = {0, 0, 0, 0};
-  llvm::MachO::linkedit_data_command exports_trie_load_command = {0, 0, 0, 0};
-  llvm::MachO::dyld_info_command dyld_info = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
-  llvm::MachO::dysymtab_command dysymtab = m_dysymtab;
+  LinkeditDataCommandLargeOffsets function_starts_load_command;
+  LinkeditDataCommandLargeOffsets exports_trie_load_command;
+  DyldInfoCommandLargeOffsets dyld_info;
+  DysymtabCommandLargeOffsets dysymtab(m_dysymtab);
   SymtabCommandLargeOffsets symtab_load_command;
   // The data element of type bool indicates that this entry is thumb
   // code.
@@ -2196,32 +2196,24 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
       break;
     // Watch for the symbol table load command
     switch (lc.cmd) {
-    case LC_SYMTAB:
-      // struct symtab_command {
-      //   uint32_t        cmd;            /* LC_SYMTAB */
-      //   uint32_t        cmdsize;        /* sizeof(struct symtab_command) */
-      //   uint32_t        symoff;         /* symbol table offset */
-      //   uint32_t        nsyms;          /* number of symbol table entries */
-      //   uint32_t        stroff;         /* string table offset */
-      //   uint32_t        strsize;        /* string table size in bytes */
-      // };
-      symtab_load_command.cmd = lc.cmd;
-      symtab_load_command.cmdsize = lc.cmdsize;
-      symtab_load_command.symoff = m_data.GetU32(&offset);
-      symtab_load_command.nsyms = m_data.GetU32(&offset);
-      symtab_load_command.stroff = m_data.GetU32(&offset);
-      symtab_load_command.strsize = m_data.GetU32(&offset);
-      break;
+    case LC_SYMTAB: {
+      llvm::MachO::symtab_command lc_obj;
+      if (m_data.GetU32(&offset, &lc_obj.symoff, 4)) {
+        lc_obj.cmd = lc.cmd;
+        lc_obj.cmdsize = lc.cmdsize;
+        symtab_load_command = lc_obj;
+      }
+    } break;
 
     case LC_DYLD_INFO:
-    case LC_DYLD_INFO_ONLY:
-      if (m_data.GetU32(&offset, &dyld_info.rebase_off, 10)) {
-        dyld_info.cmd = lc.cmd;
-        dyld_info.cmdsize = lc.cmdsize;
-      } else {
-        memset(&dyld_info, 0, sizeof(dyld_info));
+    case LC_DYLD_INFO_ONLY: {
+      llvm::MachO::dyld_info_command lc_obj;
+      if (m_data.GetU32(&offset, &lc_obj.rebase_off, 10)) {
+        lc_obj.cmd = lc.cmd;
+        lc_obj.cmdsize = lc.cmdsize;
+        dyld_info = lc_obj;
       }
-      break;
+    } break;
 
     case LC_LOAD_DYLIB:
     case LC_LOAD_WEAK_DYLIB:
@@ -2245,22 +2237,20 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
       }
     } break;
 
-    case LC_DYLD_EXPORTS_TRIE:
-      exports_trie_load_command.cmd = lc.cmd;
-      exports_trie_load_command.cmdsize = lc.cmdsize;
-      if (m_data.GetU32(&offset, &exports_trie_load_command.dataoff, 2) ==
-          nullptr) // fill in offset and size fields
-        memset(&exports_trie_load_command, 0,
-               sizeof(exports_trie_load_command));
-      break;
-    case LC_FUNCTION_STARTS:
-      function_starts_load_command.cmd = lc.cmd;
-      function_starts_load_command.cmdsize = lc.cmdsize;
-      if (m_data.GetU32(&offset, &function_starts_load_command.dataoff, 2) ==
-          nullptr) // fill in data offset and size fields
-        memset(&function_starts_load_command, 0,
-               sizeof(function_starts_load_command));
-      break;
+    case LC_DYLD_EXPORTS_TRIE: {
+      llvm::MachO::linkedit_data_command lc_obj;
+      lc_obj.cmd = lc.cmd;
+      lc_obj.cmdsize = lc.cmdsize;
+      if (m_data.GetU32(&offset, &lc_obj.dataoff, 2))
+        exports_trie_load_command = lc_obj;
+    } break;
+    case LC_FUNCTION_STARTS: {
+      llvm::MachO::linkedit_data_command lc_obj;
+      lc_obj.cmd = lc.cmd;
+      lc_obj.cmdsize = lc.cmdsize;
+      if (m_data.GetU32(&offset, &lc_obj.dataoff, 2))
+        function_starts_load_command = lc_obj;
+    } break;
 
     case LC_UUID: {
       const uint8_t *uuid_bytes = m_data.PeekData(offset, 16);
diff --git a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.h b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.h
index 7e3a6754dd0b8..44daaa2240688 100644
--- a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.h
+++ b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.h
@@ -263,6 +263,18 @@ class ObjectFileMachO : public lldb_private::ObjectFile {
   // in virtual address layout from the start of the TEXT segment, and
   // that span may be larger than 4GB.
   struct SymtabCommandLargeOffsets {
+    SymtabCommandLargeOffsets() {}
+    SymtabCommandLargeOffsets(const llvm::MachO::symtab_command &in)
+        : cmd(in.cmd), cmdsize(in.cmdsize), symoff(in.symoff), nsyms(in.nsyms),
+          stroff(in.stroff), strsize(in.strsize) {}
+    void operator=(const llvm::MachO::symtab_command &in) {
+      cmd = in.cmd;
+      cmdsize = in.cmdsize;
+      symoff = in.symoff;
+      nsyms = in.nsyms;
+      stroff = in.stroff;
+      strsize = in.strsize;
+    }
     uint32_t cmd = 0;          /* LC_SYMTAB */
     uint32_t cmdsize = 0;      /* sizeof(struct symtab_command) */
     lldb::offset_t symoff = 0; /* symbol table offset */
@@ -271,6 +283,131 @@ class ObjectFileMachO : public lldb_private::ObjectFile {
     uint32_t strsize = 0;      /* string table size in bytes */
   };
 
+  // The LC_DYLD_INFO's dyld_info_command has 32-bit file offsets
+  // that we will use as virtual address offsets, and may need to span
+  // more than 4GB in virtual memory.
+  struct DyldInfoCommandLargeOffsets {
+    DyldInfoCommandLargeOffsets() {}
+    DyldInfoCommandLargeOffsets(const llvm::MachO::dyld_info_command &in)
+        : cmd(in.cmd), cmdsize(in.cmdsize), rebase_off(in.rebase_off),
+          rebase_size(in.rebase_size), bind_off(in.bind_off),
+          bind_size(in.bind_size), weak_bind_off(in.weak_bind_off),
+          weak_bind_size(in.weak_bind_size), lazy_bind_off(in.lazy_bind_off),
+          lazy_bind_size(in.lazy_bind_size), export_off(in.export_off),
+          export_size(in.export_size) {}
+
+    void operator=(const llvm::MachO::dyld_info_command &in) {
+      cmd = in.cmd;
+      cmdsize = in.cmdsize;
+      rebase_off = in.rebase_off;
+      rebase_size = in.rebase_size;
+      bind_off = in.bind_off;
+      bind_size = in.bind_size;
+      weak_bind_off = in.weak_bind_off;
+      weak_bind_size = in.weak_bind_size;
+      lazy_bind_off = in.lazy_bind_off;
+      lazy_bind_size = in.lazy_bind_size;
+      export_off = in.export_off;
+      export_size = in.export_size;
+    };
+
+    uint32_t cmd = 0;                 /* LC_DYLD_INFO or LC_DYLD_INFO_ONLY */
+    uint32_t cmdsize = 0;             /* sizeof(struct dyld_info_command) */
+    lldb::offset_t rebase_off = 0;    /* file offset to rebase info  */
+    uint32_t rebase_size = 0;         /* size of rebase info   */
+    lldb::offset_t bind_off = 0;      /* file offset to binding info   */
+    uint32_t bind_size = 0;           /* size of binding info  */
+    lldb::offset_t weak_bind_off = 0; /* file offset to weak binding info   */
+    uint32_t weak_bind_size = 0;      /* size of weak binding info  */
+    lldb::offset_t lazy_bind_off = 0; /* file offset to lazy binding info */
+    uint32_t lazy_bind_size = 0;      /* size of lazy binding infs */
+    lldb::offset_t export_off = 0;    /* file offset to lazy binding info */
+    uint32_t export_size = 0;         /* size of lazy binding infs */
+  };
+
+  // The LC_DYSYMTAB's dysymtab_command has 32-bit file offsets
+  // that we will use as virtual address offsets, and may need to span
+  // more than 4GB in virtual memory.
+  struct DysymtabCommandLargeOffsets {
+    DysymtabCommandLargeOffsets() {}
+    DysymtabCommandLargeOffsets(const llvm::MachO::dysymtab_command &in)
+        : cmd(in.cmd), cmdsize(in.cmdsize), ilocalsym(in.ilocalsym),
+          nlocalsym(in.nlocalsym), iextdefsym(in.iextdefsym),
+          nextdefsym(in.nextdefsym), iundefsym(in.iundefsym),
+          nundefsym(in.nundefsym), tocoff(in.tocoff), ntoc(in.ntoc),
+          modtaboff(in.modtaboff), nmodtab(in.nmodtab),
+          extrefsymoff(in.extrefsymoff), nextrefsyms(in.nextrefsyms),
+          indirectsymoff(in.indirectsymoff), nindirectsyms(in.nindirectsyms),
+          extreloff(in.extreloff), nextrel(in.nextrel), locreloff(in.locreloff),
+          nlocrel(in.nlocrel) {}
+
+    void operator=(const llvm::MachO::dysymtab_command &in) {
+      cmd = in.cmd;
+      cmdsize = in.cmdsize;
+      ilocalsym = in.ilocalsym;
+      nlocalsym = in.nlocalsym;
+      iextdefsym = in.iextdefsym;
+      nextdefsym = in.nextdefsym;
+      iundefsym = in.iundefsym;
+      nundefsym = in.nundefsym;
+      tocoff = in.tocoff;
+      ntoc = in.ntoc;
+      modtaboff = in.modtaboff;
+      nmodtab = in.nmodtab;
+      extrefsymoff = in.extrefsymoff;
+      nextrefsyms = in.nextrefsyms;
+      indirectsymoff = in.indirectsymoff;
+      nindirectsyms = in.nindirectsyms;
+      extreloff = in.extreloff;
+      nextrel = in.nextrel;
+      locreloff = in.locreloff;
+      nlocrel = in.nlocrel;
+    };
+
+    uint32_t cmd = 0;             /* LC_DYSYMTAB */
+    uint32_t cmdsize = 0;         /* sizeof(struct dysymtab_command) */
+    uint32_t ilocalsym = 0;       /* index to local symbols */
+    uint32_t nlocalsym = 0;       /* number of local symbols */
+    uint32_t iextdefsym = 0;      /* index to externally defined symbols */
+    uint32_t nextdefsym = 0;      /* number of externally defined symbols */
+    uint32_t iundefsym = 0;       /* index to undefined symbols */
+    uint32_t nundefsym = 0;       /* number of undefined symbols */
+    lldb::offset_t tocoff = 0;    /* file offset to table of contents */
+    uint32_t ntoc = 0;            /* number of entries in table of contents */
+    lldb::offset_t modtaboff = 0; /* file offset to module table */
+    uint32_t nmodtab = 0;         /* number of module table entries */
+    lldb::offset_t extrefsymoff = 0; /* offset to referenced symbol table */
+    uint32_t nextrefsyms = 0; /* number of referenced symbol table entries */
+    lldb::offset_t indirectsymoff =
+        0;                        /* file offset to the indirect symbol table */
+    uint32_t nindirectsyms = 0;   /* number of indirect symbol table entries */
+    lldb::offset_t extreloff = 0; /* offset to external relocation entries */
+    uint32_t nextrel = 0;         /* number of external relocation entries */
+    lldb::offset_t locreloff = 0; /* offset to local relocation entries */
+    uint32_t nlocrel = 0;         /* number of local relocation entries */
+  };
+
+  // The linkedit_data_command is used in several load commands including
+  // LC_FUNCTION_STARTS and LC_DYLD_EXPORTS_TRIE.  It has a 32-bit file offset
+  // that may need to span more than 4GB in real virtual addresses.
+  struct LinkeditDataCommandLargeOffsets {
+    LinkeditDataCommandLargeOffsets() {}
+    LinkeditDataCommandLargeOffsets(
+        const llvm::MachO::linkedit_data_command &in)
+        : cmd(in.cmd), cmdsize(in.cmdsize), dataoff(in.dataoff),
+          datasize(in.datasize) {}
+    void operator=(const llvm::MachO::linkedit_data_command &in) {
+      cmd = in.cmd;
+      cmdsize = in.cmdsize;
+      dataoff = in.dataoff;
+      datasize = in.datasize;
+    }
+    uint32_t cmd = 0;     /* LC_FUNCTION_STARTS, LC_DYLD_EXPORTS_TRIE, etc */
+    uint32_t cmdsize = 0; /* sizeof(struct linkedit_data_command) */
+    lldb::offset_t dataoff = 0; /* file offset of data in __LINKEDIT segment */
+    uint32_t datasize = 0;      /* file size of data in __LINKEDIT segment  */
+  };
+
   /// Get the list of binary images that were present in the process
   /// when the corefile was produced.
   /// \return

Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo Doxygen comments.

Jonas' suggested doxygen comments.

Co-authored-by: Jonas Devlieghere <[email protected]>
@jasonmolenda jasonmolenda merged commit 3e57a0d into llvm:main Sep 20, 2025
9 checks passed
@jasonmolenda jasonmolenda deleted the allow-for-larger-macho-offset-values branch September 20, 2025 02:53
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Sep 22, 2025
The Mach-O file format has several load commands which specify the
location of data in the file in UInt32 offsets. lldb uses these same
structures to track the offsets of the binary in virtual address space
when it is running. Normally a binary is loaded in memory contiguously,
so this is fine, but on Darwin systems there is a "system shared cache"
where all system libraries are combined into one region of memory and
pre-linked. The shared cache has the TEXT segments for every binary
loaded contiguously, then the DATA segments, and finally a shared common
LINKEDIT segment for all binaries. The virtual address offset from the
TEXT segment for a libray to the LINKEDIT may exceed 4GB of virtual
address space depending on the structure of the shared cache, so this
use of a UInt32 offset will not work.

There was an initial instance of this issue that I fixed last November
in llvm#117832 where I fixed this
issue for the LC_SYMTAB / `symtab_command` structure. But we have the
same issue now with three additional structures;
`linkedit_data_command`, `dyld_info_command`, and `dysymtab_command`.
For all of these we can see the pattern of `dyld_info.export_off +=
linkedit_slide` applied to the offset fields in ObjectFileMachO.

This defines local structures that mirror the Mach-O structures, except
that it uses UInt64 offset fields so we can reuse the same field for a
large virtual address offset at runtime. I defined ctor's from the
genuine structures, as well as operator= methods so the structures can
be read from the Mach-O binary into the standard object, then copied
into our local expanded versions of them. These structures are ABI in
Mach-O and cannot change their layout.

The alternative is to create local variables alongside these Mach-O load
command objects for the offsets that we care about, adjust those by the
correct VA offsets, and only use those local variables instead of the
fields in the objects. I took the approach of the local enhanced
structure in November and I think it is the cleaner approach.

rdar://160384968
(cherry picked from commit 3e57a0d)
jasonmolenda added a commit to swiftlang/llvm-project that referenced this pull request Sep 22, 2025
…et-fields

[lldb][MachO] Local structs for larger VA offsets (llvm#159849)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants