Skip to content

Conversation

@shiltian
Copy link
Contributor

ForceImportAll mostly works, but it uses available_externally linkage. For
targets that don't support object linking, like AMDGPU, this approach doesn't
work.

With this PR, all imported functions use "default" linkage. I'm not entirely
sure if that's the best approach, but for AMDGPU it's not really a problem since
we internalize everything at a later stage.

…l functions

`ForceImportAll` mostly works, but it uses `available_externally` linkage. For
targets that don't support object linking, like AMDGPU, this approach doesn't
work.

With this PR, all imported functions use "default" linkage. I'm not entirely
sure if that's the best approach, but for AMDGPU it's not really a problem since
we internalize everything at a later stage.
@llvmbot llvmbot added LTO Link time optimization (regular/full LTO or ThinLTO) llvm:transforms labels Mar 29, 2025
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@llvmbot
Copy link
Member

llvmbot commented Mar 29, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-lto

Author: Shilei Tian (shiltian)

Changes

ForceImportAll mostly works, but it uses available_externally linkage. For
targets that don't support object linking, like AMDGPU, this approach doesn't
work.

With this PR, all imported functions use "default" linkage. I'm not entirely
sure if that's the best approach, but for AMDGPU it's not really a problem since
we internalize everything at a later stage.


Full diff: https://github.com/llvm/llvm-project/pull/133588.diff

5 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Utils/FunctionImportUtils.h (+3)
  • (modified) llvm/lib/Transforms/IPO/FunctionImport.cpp (+8-10)
  • (modified) llvm/lib/Transforms/Utils/FunctionImportUtils.cpp (+12-4)
  • (modified) llvm/test/Transforms/FunctionImport/adjustable_threshold.ll (+5-5)
  • (modified) llvm/test/Transforms/FunctionImport/noinline.ll (+1-1)
diff --git a/llvm/include/llvm/Transforms/Utils/FunctionImportUtils.h b/llvm/include/llvm/Transforms/Utils/FunctionImportUtils.h
index 6d83b615d5f13..2be0a20c8afa8 100644
--- a/llvm/include/llvm/Transforms/Utils/FunctionImportUtils.h
+++ b/llvm/include/llvm/Transforms/Utils/FunctionImportUtils.h
@@ -16,10 +16,13 @@
 
 #include "llvm/ADT/SetVector.h"
 #include "llvm/IR/ModuleSummaryIndex.h"
+#include "llvm/Support/CommandLine.h"
 
 namespace llvm {
 class Module;
 
+extern cl::opt<bool> ForceImportAll;
+
 /// Class to handle necessary GlobalValue changes required by ThinLTO
 /// function importing, including linkage changes and any necessary renaming.
 class FunctionImportGlobalProcessing {
diff --git a/llvm/lib/Transforms/IPO/FunctionImport.cpp b/llvm/lib/Transforms/IPO/FunctionImport.cpp
index f0daf1a558316..41b2b53959224 100644
--- a/llvm/lib/Transforms/IPO/FunctionImport.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionImport.cpp
@@ -79,10 +79,6 @@ static cl::opt<int> ImportCutoff(
     "import-cutoff", cl::init(-1), cl::Hidden, cl::value_desc("N"),
     cl::desc("Only import first N functions if N>=0 (default -1)"));
 
-static cl::opt<bool>
-    ForceImportAll("force-import-all", cl::init(false), cl::Hidden,
-                   cl::desc("Import functions with noinline attribute"));
-
 static cl::opt<float>
     ImportInstrFactor("import-instr-evolution-factor", cl::init(0.7),
                       cl::Hidden, cl::value_desc("x"),
@@ -1663,7 +1659,8 @@ void llvm::thinLTOFinalizeInModule(Module &TheModule,
     // interposable property and possibly get inlined. Simply drop
     // the definition in that case.
     if (GlobalValue::isAvailableExternallyLinkage(NewLinkage) &&
-        GlobalValue::isInterposableLinkage(GV.getLinkage())) {
+        GlobalValue::isInterposableLinkage(GV.getLinkage()) &&
+        !ForceImportAll) {
       if (!convertToDeclaration(GV))
         // FIXME: Change this to collect replaced GVs and later erase
         // them from the parent module once thinLTOResolvePrevailingGUID is
@@ -1680,11 +1677,12 @@ void llvm::thinLTOFinalizeInModule(Module &TheModule,
         assert(GV.canBeOmittedFromSymbolTable());
         GV.setVisibility(GlobalValue::HiddenVisibility);
       }
-
-      LLVM_DEBUG(dbgs() << "ODR fixing up linkage for `" << GV.getName()
-                        << "` from " << GV.getLinkage() << " to " << NewLinkage
-                        << "\n");
-      GV.setLinkage(NewLinkage);
+      if (!ForceImportAll) {
+        LLVM_DEBUG(dbgs() << "ODR fixing up linkage for `" << GV.getName()
+                          << "` from " << GV.getLinkage() << " to "
+                          << NewLinkage << "\n");
+        GV.setLinkage(NewLinkage);
+      }
     }
     // Remove declarations from comdats, including available_externally
     // as this is a declaration for the linker, and will be dropped eventually.
diff --git a/llvm/lib/Transforms/Utils/FunctionImportUtils.cpp b/llvm/lib/Transforms/Utils/FunctionImportUtils.cpp
index ae1af943bc11c..c93fa6f7fbfd8 100644
--- a/llvm/lib/Transforms/Utils/FunctionImportUtils.cpp
+++ b/llvm/lib/Transforms/Utils/FunctionImportUtils.cpp
@@ -12,7 +12,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "llvm/Transforms/Utils/FunctionImportUtils.h"
-#include "llvm/Support/CommandLine.h"
+
 using namespace llvm;
 
 /// Uses the "source_filename" instead of a Module hash ID for the suffix of
@@ -24,6 +24,12 @@ static cl::opt<bool> UseSourceFilenameForPromotedLocals(
              "This requires that the source filename has a unique name / "
              "path to avoid name collisions."));
 
+namespace llvm {
+cl::opt<bool>
+    ForceImportAll("force-import-all", cl::init(false), cl::Hidden,
+                   cl::desc("Import functions with noinline attribute"));
+}
+
 /// Checks if we should import SGV as a definition, otherwise import as a
 /// declaration.
 bool FunctionImportGlobalProcessing::doImportAsDefinition(
@@ -147,7 +153,8 @@ FunctionImportGlobalProcessing::getLinkage(const GlobalValue *SGV,
     // and/or optimization, but are turned into declarations later
     // during the EliminateAvailableExternally pass.
     if (doImportAsDefinition(SGV) && !isa<GlobalAlias>(SGV))
-      return GlobalValue::AvailableExternallyLinkage;
+      return ForceImportAll ? GlobalValue::ExternalLinkage
+                            : GlobalValue::AvailableExternallyLinkage;
     // An imported external declaration stays external.
     return SGV->getLinkage();
 
@@ -175,7 +182,7 @@ FunctionImportGlobalProcessing::getLinkage(const GlobalValue *SGV,
     // equivalent, so the issue described above for weak_any does not exist,
     // and the definition can be imported. It can be treated similarly
     // to an imported externally visible global value.
-    if (doImportAsDefinition(SGV) && !isa<GlobalAlias>(SGV))
+    if (doImportAsDefinition(SGV) && !isa<GlobalAlias>(SGV) && !ForceImportAll)
       return GlobalValue::AvailableExternallyLinkage;
     else
       return GlobalValue::ExternalLinkage;
@@ -193,7 +200,8 @@ FunctionImportGlobalProcessing::getLinkage(const GlobalValue *SGV,
     // If we are promoting the local to global scope, it is handled
     // similarly to a normal externally visible global.
     if (DoPromote) {
-      if (doImportAsDefinition(SGV) && !isa<GlobalAlias>(SGV))
+      if (doImportAsDefinition(SGV) && !isa<GlobalAlias>(SGV) &&
+          !ForceImportAll)
         return GlobalValue::AvailableExternallyLinkage;
       else
         return GlobalValue::ExternalLinkage;
diff --git a/llvm/test/Transforms/FunctionImport/adjustable_threshold.ll b/llvm/test/Transforms/FunctionImport/adjustable_threshold.ll
index f60758c3d3bf4..5a4a275f449d1 100644
--- a/llvm/test/Transforms/FunctionImport/adjustable_threshold.ll
+++ b/llvm/test/Transforms/FunctionImport/adjustable_threshold.ll
@@ -15,11 +15,11 @@
 ; RUN: opt -passes=function-import -summary-file %t3.thinlto.bc %t.bc \
 ; RUN:  -import-instr-limit=1 -force-import-all -S \
 ; RUN:  | FileCheck %s --check-prefix=IMPORTALL
-; IMPORTALL-DAG: define available_externally void @globalfunc1()
-; IMPORTALL-DAG: define available_externally void @trampoline()
-; IMPORTALL-DAG: define available_externally void @largefunction()
-; IMPORTALL-DAG: define available_externally hidden void @staticfunc2.llvm.0()
-; IMPORTALL-DAG: define available_externally void @globalfunc2()
+; IMPORTALL-DAG: define void @globalfunc1()
+; IMPORTALL-DAG: define void @trampoline()
+; IMPORTALL-DAG: define void @largefunction()
+; IMPORTALL-DAG: define hidden void @staticfunc2.llvm.0()
+; IMPORTALL-DAG: define void @globalfunc2()
 
 declare void @globalfunc1()
 declare void @globalfunc2()
diff --git a/llvm/test/Transforms/FunctionImport/noinline.ll b/llvm/test/Transforms/FunctionImport/noinline.ll
index 8687f1b3b5f7c..eef4d6b5c25c8 100644
--- a/llvm/test/Transforms/FunctionImport/noinline.ll
+++ b/llvm/test/Transforms/FunctionImport/noinline.ll
@@ -19,5 +19,5 @@ entry:
 }
 
 ; NOIMPORT: declare void @foo(ptr)
-; IMPORT: define available_externally void @foo
+; IMPORT: define void @foo
 declare void @foo(ptr) #1

@MaskRay
Copy link
Member

MaskRay commented Mar 29, 2025

https://reviews.llvm.org/D99683 added -force-import-all for AMDGPU to import external functions.

AMDGPU target does not support codegen of object files containing call of external functions, therefore the LLVM module passed to AMDGPU backend needs to contain definitions of all the callees.

Why is there such a limitation and would it be better to lift the limitation?

@teresajohnson
Copy link
Contributor

Looks like this flag was added several years ago for AMDGPU (https://reviews.llvm.org/D99683). Do you know why it was sufficient then but not now?

While this was originally added for AMDGPU I'm concerned about changing the linkage type as other use cases of this option may have appeared, and this is inconsistent with the linkage type normally used for ThinLTO importing, and would result in linker multiply defined errors in the usual case. If you aren't able to link objects, why not just use full LTO, which is what this essentially becomes?

@shiltian
Copy link
Contributor Author

Why is there such a limitation?

From what I understand, seeing all device code is necessary for resource usage analysis, which is required to compute occupancy. Occupancy determines the maximum number of threads a thread block can have. At runtime, when a kernel is launched, the thread block size must not exceed the limit based on that occupancy.

Would it be better to lift the limitation?

Ideally, yes, but in practice, probably not. Technically, we could support external function calls by conservatively assuming the callee uses all possible resources. But that would severely degrade performance. Since GPUs are primarily used for performance workloads, this kind of support may not be practically useful.

@shiltian
Copy link
Contributor Author

Looks like this flag was added several years ago for AMDGPU (https://reviews.llvm.org/D99683). Do you know why it was sufficient then but not now?

I'm not sure if ThinLTO actually worked on AMDGPU back then. At least, it doesn't seem to be working at the moment. I'd assume @yxsamliu might have more insight on this.

While this was originally added for AMDGPU I'm concerned about changing the linkage type as other use cases of this option may have appeared, and this is inconsistent with the linkage type normally used for ThinLTO importing, and would result in linker multiply defined errors in the usual case.

Hmm, that's a valid concern, but as far as I can tell, unless someone explicitly sets this flag, it's only enabled for AMDGPU. I'm not sure if there are any other targets that support global variable object linking but not function object linking.

If you aren't able to link objects, why not just use full LTO, which is what this essentially becomes?

We have full LTO enabled by default in GPU RDC mode, but it's extremely expensive. If the entire linked module ends up as a single SCC, then there's really no benefit to using ThinLTO. However, if there are multiple SCCs, ThinLTO can be quite beneficial. There are also other related efforts, like module splitting, but some benchmark results contributed by the community suggest that ThinLTO can still outperform full LTO by up to 9X.

@arsenm
Copy link
Contributor

arsenm commented Mar 30, 2025

Why is there such a limitation and would it be better to lift the limitation?

There are 2 issues. The primary issue is reporting register and other resource usage. We need some kind of module summary in the linked module to avoid assuming an unacceptable worst case scenario in the presence of any external calls.

A second issue is we don't have a way to codegen LDS which is referenced in external modules, but this is more of a niche usage that doesn't appear that often.

@yxsamliu
Copy link
Collaborator

Looks like this flag was added several years ago for AMDGPU (https://reviews.llvm.org/D99683). Do you know why it was sufficient then but not now?

I'm not sure if ThinLTO actually worked on AMDGPU back then. At least, it doesn't seem to be working at the moment. I'd assume @yxsamliu might have more insight on this.

It worked for some HIP apps.

I am still not clear about what issue the linkage available_externally causes. Can you give an example? Thanks.

@shiltian
Copy link
Contributor Author

Functions with available_externally linkage are never emitted into the object file. They're eliminated later in the pipeline, which eventually leads to an undefined symbol error.

@yxsamliu
Copy link
Collaborator

Functions with available_externally linkage are never emitted into the object file. They're eliminated later in the pipeline, which eventually leads to an undefined symbol error.

what if the same callee is called by more than one callers in two different object files? If global linkage is used, will that cause duplicate symbols in the final executable? Can we use linkonce_odr linkage instead so that duplicate symbols are allowed or merged?

@shiltian
Copy link
Contributor Author

shiltian commented Apr 1, 2025

linkonce_odr doesn't always work. For example, let's say we have a function foo, and it is imported into two modules A and B. Due to the different code in A and B, foo is optimized in different ways such that the two foo are no longer interchangeable.

With this being said, I think we need to mark all imported symbols as "internal" to avoid duplicate symbols.

After a second thought, I think the case mentioned above should not happen, if the optimization is right. For AMDGPU, we have internalization so even though if they are marked as linkonce_odr, they should be internalized first and then the optimization above could happen.

However, to make this practically useful for GPU, we might want to mark all imported symbols as internal. We really want to see all the call sites, but I think that might be beyond the general scope of ThinLTO.

@yxsamliu
Copy link
Collaborator

yxsamliu commented Apr 1, 2025

linkonce_odr doesn't always work. For example, let's say we have a function foo, and it is imported into two modules A and B. Due to the different code in A and B, foo is optimized in different ways such that the two foo are no longer interchangeable.

With this being said, I think we need to mark all imported symbols as "internal" to avoid duplicate symbols.

After a second thought, I think the case mentioned above should not happen, if the optimization is right. For AMDGPU, we have internalization so even though if they are marked as linkonce_odr, they should be internalized first and then the optimization above could happen.

However, to make this practically useful for GPU, we might want to mark all imported symbols as internal. We really want to see all the call sites, but I think that might be beyond the general scope of ThinLTO.

Does this patch change the linkage of both functions and variables? Duplicating variables may not work.

@shiltian
Copy link
Contributor Author

shiltian commented Apr 5, 2025

I'm gonna use something else to fix the issue. Close this one for now.

@shiltian shiltian closed this Apr 5, 2025
@shiltian shiltian deleted the users/shiltian/thin-lto-import-all-functions branch April 5, 2025 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llvm:transforms LTO Link time optimization (regular/full LTO or ThinLTO)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants