[LLD][COFF] Prefetch inputs early-on to improve link times #169224

aganea · 2025-11-23T16:12:12Z

This PR reduces outliers in terms of runtime performance, by asking the OS to prefetch memory-mapped input files in advance, as early as possible. I have implemented the Linux aspect, however I have only tested this on Windows 11 version 24H2, with an active security stack enabled. The machine is a AMD Threadripper PRO 3975WX 32c/64t with 128 GB of RAM and Samsung 990 PRO SSD.

I have used a Unreal Engine-based game to profile the link times. Here's a quick summary of the input data:

                                    Summary
--------------------------------------------------------------------------------
               4,169 Input OBJ files (expanded from all cmd-line inputs)
      26,325,429,114 Size of all consumed OBJ files (non-lazy), in bytes
                   9 PDB type server dependencies
                   0 Precomp OBJ dependencies
         350,516,212 Input debug type records
      18,146,407,324 Size of all input debug type records, in bytes
          15,709,427 Merged TPI records
           4,747,187 Merged IPI records
              56,408 Output PDB strings
          23,410,278 Global symbol records
          45,482,231 Module symbol records
           1,584,608 Public symbol records

In normal conditions - meanning all the pages are already in RAM - this PR has no noticeable effect:

>hyperfine "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     29.689 s ±  0.550 s    [User: 259.873 s, System: 37.936 s]
  Range (min … max):   29.026 s … 30.880 s    10 runs

Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     29.594 s ±  0.342 s    [User: 261.434 s, System: 62.259 s]
  Range (min … max):   29.209 s … 30.171 s    10 runs

Summary
  with_pr\lld-link.exe @Game.exe.rsp ran
    1.00 ± 0.02 times faster than before\lld-link.exe @Game.exe.rsp

However when in production conditions, we're typically working with the Unreal Engine Editor, with exteral DCC tools like Maya, Houdini; we have several instances of Visual Studio open, VSCode with Rust analyzer, etc. All this means that between code change iterations, most of the input OBJs files might have been already evicted from the Windows RAM cache. Consequently, in the following test, I've simulated the worst case condition by evicting all data from RAM with RAMMap64 (ie. RAMMap64.exe -E[wsmt0] with a 5-sec sleep at the end to ensure the System thread actually has time to evict the pages)

>hyperfine -p cleanup.bat "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     48.124 s ±  1.770 s    [User: 269.031 s, System: 41.769 s]
  Range (min … max):   46.023 s … 50.388 s    10 runs

Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     34.192 s ±  0.478 s    [User: 263.620 s, System: 40.991 s]
  Range (min … max):   33.550 s … 34.916 s    10 runs

Summary
  with_pr\lld-link.exe @Game.exe.rsp ran
    1.41 ± 0.06 times faster than before\lld-link.exe @Game.exe.rsp

This is similar to the work done in MachO in #157917

llvmbot · 2025-11-23T16:12:42Z

@llvm/pr-subscribers-platform-windows
@llvm/pr-subscribers-lld-coff
@llvm/pr-subscribers-llvm-support

@llvm/pr-subscribers-lld

Author: Alexandre Ganea (aganea)

Changes

This PR reduces outliers in terms of runtime performance, by asking the OS to prefetch memory-mapped input files in advance, as early as possible. I have implemented the Linux aspect, however I have only tested this on Windows 11 version 24H2, with an active security stack enabled. The machine is a AMD Threadripper PRO 3975WX 32c/64t with 128 GB of RAM and Samsung 990 PRO SSD.

I have used a Unreal Engine-based game to profile the link times. Here's a quick summary of the input data:

                                    Summary
--------------------------------------------------------------------------------
               4,169 Input OBJ files (expanded from all cmd-line inputs)
      26,325,429,114 Size of all consumed OBJ files (non-lazy), in bytes
                   9 PDB type server dependencies
                   0 Precomp OBJ dependencies
         350,516,212 Input debug type records
      18,146,407,324 Size of all input debug type records, in bytes
          15,709,427 Merged TPI records
           4,747,187 Merged IPI records
              56,408 Output PDB strings
          23,410,278 Global symbol records
          45,482,231 Module symbol records
           1,584,608 Public symbol records

In normal conditions - meanning all the pages are already in RAM - this PR has no noticeable effect:

&gt;hyperfine "before\lld-link.exe @<!-- -->Game.exe.rsp" "with_pr\lld-link.exe @<!-- -->Game.exe.rsp"
Benchmark 1: before\lld-link.exe @<!-- -->Game.exe.rsp
  Time (mean ± σ):     29.689 s ±  0.550 s    [User: 259.873 s, System: 37.936 s]
  Range (min … max):   29.026 s … 30.880 s    10 runs

Benchmark 2: with_pr\lld-link.exe @<!-- -->Game.exe.rsp
  Time (mean ± σ):     29.594 s ±  0.342 s    [User: 261.434 s, System: 62.259 s]
  Range (min … max):   29.209 s … 30.171 s    10 runs

Summary
  with_pr\lld-link.exe @<!-- -->Game.exe.rsp ran
    1.00 ± 0.02 times faster than before\lld-link.exe @<!-- -->Game.exe.rsp

However when in production conditions, we're typically working with the Unreal Engine Editor, with exteral DCC tools like Maya, Houdini; we have several instances of Visual Studio open, VSCode with Rust analyzer, etc. All this means that between code change iterations, most of the input OBJs files might have been already evicted from the Windows RAM cache. Consequently, in the following test, I've simulated the worst case condition by evicting all data from RAM with RAMMap64 (ie. RAMMap64.exe -E[wsmt0] with a 5-sec sleep at the end to ensure the System thread actually has time to evict the pages)

&gt;hyperfine -p cleanup.bat "before\lld-link.exe @<!-- -->Game.exe.rsp" "with_pr\lld-link.exe @<!-- -->Game.exe.rsp"
Benchmark 1: before\lld-link.exe @<!-- -->Game.exe.rsp
  Time (mean ± σ):     48.124 s ±  1.770 s    [User: 269.031 s, System: 41.769 s]
  Range (min … max):   46.023 s … 50.388 s    10 runs

Benchmark 2: with_pr\lld-link.exe @<!-- -->Game.exe.rsp
  Time (mean ± σ):     34.192 s ±  0.478 s    [User: 263.620 s, System: 40.991 s]
  Range (min … max):   33.550 s … 34.916 s    10 runs

Summary
  with_pr\lld-link.exe @<!-- -->Game.exe.rsp ran
    1.41 ± 0.06 times faster than before\lld-link.exe @<!-- -->Game.exe.rsp

This is similar to the work done in MachO in #157917

Full diff: https://github.com/llvm/llvm-project/pull/169224.diff

6 Files Affected:

(modified) lld/COFF/Driver.cpp (+7-1)
(modified) llvm/include/llvm/Support/FileSystem.h (+2)
(modified) llvm/include/llvm/Support/MemoryBuffer.h (+5)
(modified) llvm/lib/Support/MemoryBuffer.cpp (+1)
(modified) llvm/lib/Support/Unix/Path.inc (+13)
(modified) llvm/lib/Support/Windows/Path.inc (+26)

diff --git a/lld/COFF/Driver.cpp b/lld/COFF/Driver.cpp
index 0e528de9c3652..e6e7194a684c1 100644
--- a/lld/COFF/Driver.cpp
+++ b/lld/COFF/Driver.cpp
@@ -164,6 +164,8 @@ static std::future<MBErrPair> createFutureForFile(std::string path) {
                                          /*RequiresNullTerminator=*/false);
     if (!mbOrErr)
       return MBErrPair{nullptr, mbOrErr.getError()};
+    // Prefetch memory pages in the background as we will need them soon enough.
+    (*mbOrErr)->willNeedIfMmap();
     return MBErrPair{std::move(*mbOrErr), std::error_code()};
   });
 }
@@ -337,8 +339,12 @@ void LinkerDriver::enqueuePath(StringRef path, bool wholeArchive, bool lazy) {
         auto retryMb = MemoryBuffer::getFile(*retryPath, /*IsText=*/false,
                                              /*RequiresNullTerminator=*/false);
         ec = retryMb.getError();
-        if (!ec)
+        if (!ec) {
           mb = std::move(*retryMb);
+          // Prefetch memory pages in the background as we will need them soon
+          // enough.
+          mb->willNeedIfMmap();
+        }
       } else {
         // We've already handled this file.
         return;
diff --git a/llvm/include/llvm/Support/FileSystem.h b/llvm/include/llvm/Support/FileSystem.h
index cf2a8104ac813..547d732dc3053 100644
--- a/llvm/include/llvm/Support/FileSystem.h
+++ b/llvm/include/llvm/Support/FileSystem.h
@@ -1310,6 +1310,7 @@ class mapped_file_region {
 
   LLVM_ABI void unmapImpl();
   LLVM_ABI void dontNeedImpl();
+  LLVM_ABI void willNeedImpl();
 
   LLVM_ABI std::error_code init(sys::fs::file_t FD, uint64_t Offset,
                                 mapmode Mode);
@@ -1341,6 +1342,7 @@ class mapped_file_region {
     copyFrom(mapped_file_region());
   }
   void dontNeed() { dontNeedImpl(); }
+  void willNeed() { willNeedImpl(); }
 
   LLVM_ABI size_t size() const;
   LLVM_ABI char *data() const;
diff --git a/llvm/include/llvm/Support/MemoryBuffer.h b/llvm/include/llvm/Support/MemoryBuffer.h
index f092c67265a31..9b0f1f0f7d34e 100644
--- a/llvm/include/llvm/Support/MemoryBuffer.h
+++ b/llvm/include/llvm/Support/MemoryBuffer.h
@@ -83,6 +83,11 @@ class LLVM_ABI MemoryBuffer {
   /// function should not be called on a writable buffer.
   virtual void dontNeedIfMmap() {}
 
+  /// Mark the buffer as to-be-used in a near future. This shall trigger OS
+  /// prefetching from the storage device and into memory, if possible.
+  /// This should be use purely as an read optimization.
+  virtual void willNeedIfMmap() {}
+
   /// Open the specified file as a MemoryBuffer, returning a new MemoryBuffer
   /// if successful, otherwise returning null.
   ///
diff --git a/llvm/lib/Support/MemoryBuffer.cpp b/llvm/lib/Support/MemoryBuffer.cpp
index 23b9f8c5790d2..3ef4cd9086b95 100644
--- a/llvm/lib/Support/MemoryBuffer.cpp
+++ b/llvm/lib/Support/MemoryBuffer.cpp
@@ -243,6 +243,7 @@ class MemoryBufferMMapFile : public MB {
   }
 
   void dontNeedIfMmap() override { MFR.dontNeed(); }
+  void willNeedIfMmap() override { MFR.willNeed(); }
 };
 } // namespace
 
diff --git a/llvm/lib/Support/Unix/Path.inc b/llvm/lib/Support/Unix/Path.inc
index 0d991ead72416..0937adcd1bc18 100644
--- a/llvm/lib/Support/Unix/Path.inc
+++ b/llvm/lib/Support/Unix/Path.inc
@@ -900,6 +900,19 @@ void mapped_file_region::dontNeedImpl() {
 #endif
 }
 
+void mapped_file_region::willNeedImpl() {
+  assert(Mode == mapped_file_region::readonly);
+  if (!Mapping)
+    return;
+#if defined(__MVS__) || defined(_AIX)
+  // If we don't have madvise, or it isn't beneficial, treat this as a no-op.
+#elif defined(POSIX_MADV_WILLNEED)
+  ::posix_madvise(Mapping, Size, POSIX_MADV_WILLNEED);
+#else
+  ::madvise(Mapping, Size, MADV_WILLNEED);
+#endif
+}
+
 int mapped_file_region::alignment() { return Process::getPageSizeEstimate(); }
 
 std::error_code detail::directory_iterator_construct(detail::DirIterState &it,
diff --git a/llvm/lib/Support/Windows/Path.inc b/llvm/lib/Support/Windows/Path.inc
index be007b7abdb51..8be4105ba6bd8 100644
--- a/llvm/lib/Support/Windows/Path.inc
+++ b/llvm/lib/Support/Windows/Path.inc
@@ -1023,6 +1023,32 @@ void mapped_file_region::unmapImpl() {
 
 void mapped_file_region::dontNeedImpl() {}
 
+void mapped_file_region::willNeedImpl() {
+#if (_WIN32_WINNT < _WIN32_WINNT_WIN8)
+  typedef struct _WIN32_MEMORY_RANGE_ENTRY {
+    PVOID VirtualAddress;
+    SIZE_T NumberOfBytes;
+  } WIN32_MEMORY_RANGE_ENTRY, *PWIN32_MEMORY_RANGE_ENTRY;
+#endif
+
+  HMODULE kernelM = llvm::sys::windows::loadSystemModuleSecure(L"kernel32.dll");
+  if (kernelM) {
+    // PrefetchVirtualMemory is only available on Windows 8 and later. Since we
+    // still support compilation on Windows 7, we load the function dynamically.
+    typedef BOOL(WINAPI * PrefetchVirtualMemory_t)(
+        HANDLE hProcess, ULONG_PTR NumberOfEntries,
+        _In_reads_(NumberOfEntries) PWIN32_MEMORY_RANGE_ENTRY VirtualAddresses,
+        ULONG Flags);
+    static const auto pfnPrefetchVirtualMemory =
+        (PrefetchVirtualMemory_t)::GetProcAddress(kernelM,
+                                                  "PrefetchVirtualMemory");
+    if (pfnPrefetchVirtualMemory) {
+      WIN32_MEMORY_RANGE_ENTRY Range{Mapping, Size};
+      pfnPrefetchVirtualMemory(::GetCurrentProcess(), 1, &Range, 0);
+    }
+  }
+}
+
 std::error_code mapped_file_region::sync() const {
   if (!::FlushViewOfFile(Mapping, Size))
     return mapWindowsError(GetLastError());

compnerd · 2025-11-23T16:19:24Z

llvm/lib/Support/Windows/Path.inc

 void mapped_file_region::dontNeedImpl() {}

+void mapped_file_region::willNeedImpl() {
+#if (_WIN32_WINNT < _WIN32_WINNT_WIN8)


Do we have many of these cases? I wonder if we should sink this into LLVM with a check for the type definition.

Since we're dlopen'ing and rolling our own function prototype anyway below, maybe we should just define this struct unconditionally (with a name that doesn't clash with the windows header)?

As suggested by @zmodem.

@compnerd No, it seems we don't have many cases like this.

compnerd · 2025-11-23T21:23:37Z

llvm/lib/Support/Windows/Path.inc

+  } WIN32_MEMORY_RANGE_ENTRY, *PWIN32_MEMORY_RANGE_ENTRY;
+#endif
+
+  HMODULE kernelM = llvm::sys::windows::loadSystemModuleSecure(L"kernel32.dll");


This is where the version is likely more useful. In the case of the minimum deployment version being >7, we should prefer static linking IMO. But, I know that is more challenging and potentially not as valuable to invest in, so more of a passing comment.

I agree, I was considering that however right now (you're certainly aware) we're undef'ing _WIN32_WINNT and re-def'ing it with 0x0601 in WindowsSupport.h so we wouldn't take the static linking codepath anyway.

zmodem

Those are nice numbers!

I wonder what this does in more low-RAM situations though. Will it increase swapping, or is the OS smart enough to hold back a little on the prefetching then?

We've had reports (https://crbug.com/428641952) that lld is using too much RAM when linking PDBs, and this sounds like it could potentially make that worse.

(Related to that bug, I've been wondering if we could do more of the opposite, and call mapped_file_region::dontNeed() when we're done writing out an object file's code and debug symbols to the final outputs.)

zmodem · 2025-11-24T11:37:13Z

llvm/lib/Support/Windows/Path.inc

 void mapped_file_region::dontNeedImpl() {}

+void mapped_file_region::willNeedImpl() {
+#if (_WIN32_WINNT < _WIN32_WINNT_WIN8)


Since we're dlopen'ing and rolling our own function prototype anyway below, maybe we should just define this struct unconditionally (with a name that doesn't clash with the windows header)?

aganea · 2025-11-28T15:29:43Z

I wonder what this does in more low-RAM situations though. Will it increase swapping, or is the OS smart enough to hold back a little on the prefetching then?

The docs are stating that it only brings pages from the storage into physical memory "[prefetch] it is treated as a strong hint by the system and is subject to usual physical memory constraints where it can completely or partially fail under low-memory conditions". The pages aren't even mapped in the working set, this only comes later when the process actually tries to access them, which will create a soft page fault (vs. a hard fault when the page is not in physical memory and has to be brought back from the storage).

We've had reports (https://crbug.com/428641952) that lld is using too much RAM when linking PDBs, and this sounds like it could potentially make that worse.

This is a quite an interesting and challenging problem. However I don't think it would make it worse as stated above; the prefetch hint will simply be ignored. I compared before/after this PR in low memory conditions and surprisingly this PR improves the link times:

Summary
  with_pr\lld-link.exe @Game.exe.rsp ran
    1.10 ± 0.09 times faster than before\lld-link.exe @Game.exe.rsp

I can ask perhaps on that ticket if anyone on Edge could test this PR.

(Related to that bug, I've been wondering if we could do more of the opposite, and call mapped_file_region::dontNeed() when we're done writing out an object file's code and debug symbols to the final outputs.)

Yeah I saw your comment on the ticket. This is quite challenging since in many places in LLD/COFF we're not really copying buffers from the input mmaped buffers, rather taking references to them. Until the PDB has been fully written to the mmap buffer, it is hard to know what to unload. This would require a dependency assessement of the code, to see if such thing is feaseable. But at a first glance it wouldn't be simple. However, we could do that for bitcode modules, when a ThinLTO thread is done with a specific input file. Not sure if that's the case Edge are hitting or not.

Another opportunity for solving the issue raised in the ticket would be to write a memory tracking system in LLVM, and dump the allocated list into a .json file at specific places during the LLD execution. Similar to --time-trace but for memory. That would provide hints as to what type to optimize / free buffers that aren't needed.

aganea · 2025-11-28T16:03:06Z

Also FWIW a summary of the memory usage for the above example: ~13 GB of RAM are allocated by LLD; ~24 GB are memory-mapped inputs; ~3.5 GB are the memory-mapped outputs (PDB and EXE). Overall the working set peaks at about 40.5 GB. For the same thing, link.exe peaks at 57 GB.

[LLD][COFF] Prefetch inputs early on to improve link times

17165b8

aganea requested review from compnerd, mstorsjo, rnk, tru and zmodem November 23, 2025 16:12

llvmbot added lld lld:COFF platform:windows llvm:support labels Nov 23, 2025

compnerd reviewed Nov 23, 2025

View reviewed changes

zmodem reviewed Nov 24, 2025

View reviewed changes

Unconditionally define struct

f068257

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LLD][COFF] Prefetch inputs early-on to improve link times #169224

[LLD][COFF] Prefetch inputs early-on to improve link times #169224

aganea commented Nov 23, 2025

Uh oh!

llvmbot commented Nov 23, 2025 •

edited

Loading

Uh oh!

compnerd Nov 23, 2025

Uh oh!

zmodem Nov 24, 2025

Uh oh!

aganea Nov 28, 2025

Uh oh!

compnerd Nov 23, 2025

Uh oh!

aganea Nov 28, 2025 •

edited

Loading

Uh oh!

zmodem left a comment

Uh oh!

zmodem Nov 24, 2025

Uh oh!

aganea commented Nov 28, 2025 •

edited

Loading

Uh oh!

aganea commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[LLD][COFF] Prefetch inputs early-on to improve link times #169224

Are you sure you want to change the base?

[LLD][COFF] Prefetch inputs early-on to improve link times #169224

Conversation

aganea commented Nov 23, 2025

Uh oh!

llvmbot commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compnerd Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

zmodem Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

aganea Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

compnerd Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

aganea Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zmodem left a comment

Choose a reason for hiding this comment

Uh oh!

zmodem Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

aganea commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aganea commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

llvmbot commented Nov 23, 2025 •

edited

Loading

aganea Nov 28, 2025 •

edited

Loading

aganea commented Nov 28, 2025 •

edited

Loading