-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[OpenMP] Do not emit default thread limits of 128 #87558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: We have logic that currently sets values like `maxntidx` in the PTX to be 128 if it was not specified by the user because this is what OpenMP defaults to. This is not strictly correct in the OpenMP case because the user can increase this with environment variables. Additionally, this violates the CUDA linking ABI. So if we link code that comes from OpenMP with a `cubin` from another project the more strict thread limit of `128` will result in the kernel sometimes failing to link with a function compiled against a generic target of `1024`. This was observed when doing `cubin` linking against some `libc` functions.
|
@llvm/pr-subscribers-clang @llvm/pr-subscribers-flang-openmp Author: Joseph Huber (jhuber6) ChangesSummary: Full diff: https://github.com/llvm/llvm-project/pull/87558.diff 3 Files Affected:
diff --git a/clang/test/OpenMP/thread_limit_nvptx.c b/clang/test/OpenMP/thread_limit_nvptx.c
index 2132e1aa7834a0..8444e77662d2f2 100644
--- a/clang/test/OpenMP/thread_limit_nvptx.c
+++ b/clang/test/OpenMP/thread_limit_nvptx.c
@@ -7,7 +7,7 @@
#define HEADER
void foo(int N) {
-// CHECK: l11, !"maxntidx", i32 128}
+// CHECK-NOT: l11, !"maxntidx", i32 128}
#pragma omp target teams distribute parallel for simd
for (int i = 0; i < N; ++i)
;
diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index c9ee0c25194c23..cc02b90cd8d210 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -2107,7 +2107,7 @@ class OpenMPIRBuilder {
static std::pair<int32_t, int32_t> readTeamBoundsForKernel(const Triple &T,
Function &Kernel);
static void writeTeamsForKernel(const Triple &T, Function &Kernel, int32_t LB,
- int32_t UB);
+ int32_t UB, int32_t NTid);
///}
private:
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 16507a69ea8502..5791275a292012 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4554,16 +4554,16 @@ OpenMPIRBuilder::createTargetInit(const LocationDescription &Loc, bool IsSPMD,
// Manifest the launch configuration in the metadata matching the kernel
// environment.
if (MinTeamsVal > 1 || MaxTeamsVal > 0)
- writeTeamsForKernel(T, *Kernel, MinTeamsVal, MaxTeamsVal);
+ writeTeamsForKernel(T, *Kernel, MinTeamsVal, MaxTeamsVal, MaxThreadsVal);
+
+ if (MaxThreadsVal > 0)
+ writeThreadBoundsForKernel(T, *Kernel, MinThreadsVal, MaxThreadsVal);
// For max values, < 0 means unset, == 0 means set but unknown.
if (MaxThreadsVal < 0)
MaxThreadsVal = std::max(
int32_t(getGridValue(T, Kernel).GV_Default_WG_Size), MinThreadsVal);
- if (MaxThreadsVal > 0)
- writeThreadBoundsForKernel(T, *Kernel, MinThreadsVal, MaxThreadsVal);
-
Constant *MinThreads = ConstantInt::getSigned(Int32, MinThreadsVal);
Constant *MaxThreads = ConstantInt::getSigned(Int32, MaxThreadsVal);
Constant *MinTeams = ConstantInt::getSigned(Int32, MinTeamsVal);
@@ -4785,11 +4785,14 @@ OpenMPIRBuilder::readTeamBoundsForKernel(const Triple &, Function &Kernel) {
}
void OpenMPIRBuilder::writeTeamsForKernel(const Triple &T, Function &Kernel,
- int32_t LB, int32_t UB) {
+ int32_t LB, int32_t UB,
+ int32_t NTid) {
if (T.isNVPTX()) {
if (UB > 0)
updateNVPTXMetadata(Kernel, "maxclusterrank", UB, true);
- updateNVPTXMetadata(Kernel, "minctasm", LB, false);
+ // The 'minctasm' attribute is ignored without 'maxntid' also being set.
+ if (NTid > 0)
+ updateNVPTXMetadata(Kernel, "minctasm", LB, false);
}
Kernel.addFnAttr("omp_target_num_teams", std::to_string(LB));
}
|
On NVIDIA architectures, yes.
Up to the standard we implement, this is not true. All that said, I see the issue. However, making our "default" slower to support cubin linking is not something I am very excited about. Honestly, I don't think we should support cubin linking at all (for libc), and if so it should come with warnings telling users they are on their own. This is, after all, not different than CUDA where you need to match the launch bounds across your dependences. |
Summary:
We have logic that currently sets values like
maxntidxin the PTX tobe 128 if it was not specified by the user because this is what OpenMP
defaults to. This is not strictly correct in the OpenMP case because the
user can increase this with environment variables. Additionally, this
violates the CUDA linking ABI. So if we link code that comes from OpenMP
with a
cubinfrom another project the more strict thread limit of128will result in the kernel sometimes failing to link with afunction compiled against a generic target of
1024due to an inconsistencyin the number of registers required. I.e. a kernel with lower register requirements
cannot link against a function with higher .This was observed when doing
cubinlinking against somelibcfunctions.