Skip to content

Conversation

@jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Apr 3, 2024

Summary:
We have logic that currently sets values like maxntidx in the PTX to
be 128 if it was not specified by the user because this is what OpenMP
defaults to. This is not strictly correct in the OpenMP case because the
user can increase this with environment variables. Additionally, this
violates the CUDA linking ABI. So if we link code that comes from OpenMP
with a cubin from another project the more strict thread limit of
128 will result in the kernel sometimes failing to link with a
function compiled against a generic target of 1024 due to an inconsistency
in the number of registers required. I.e. a kernel with lower register requirements
cannot link against a function with higher .This was observed when doing
cubin linking against some libc functions.

Summary:
We have logic that currently sets values like `maxntidx` in the PTX to
be 128 if it was not specified by the user because this is what OpenMP
defaults to. This is not strictly correct in the OpenMP case because the
user can increase this with environment variables. Additionally, this
violates the CUDA linking ABI. So if we link code that comes from OpenMP
with a `cubin` from another project the more strict thread limit of
`128` will result in the kernel sometimes failing to link with a
function compiled against a generic target of `1024`. This was observed
when doing `cubin` linking against some `libc` functions.
@llvmbot llvmbot added clang Clang issues not falling into any other category flang:openmp clang:openmp OpenMP related changes to Clang labels Apr 3, 2024
@llvmbot
Copy link
Member

llvmbot commented Apr 3, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-flang-openmp

Author: Joseph Huber (jhuber6)

Changes

Summary:
We have logic that currently sets values like maxntidx in the PTX to
be 128 if it was not specified by the user because this is what OpenMP
defaults to. This is not strictly correct in the OpenMP case because the
user can increase this with environment variables. Additionally, this
violates the CUDA linking ABI. So if we link code that comes from OpenMP
with a cubin from another project the more strict thread limit of
128 will result in the kernel sometimes failing to link with a
function compiled against a generic target of 1024. This was observed
when doing cubin linking against some libc functions.


Full diff: https://github.com/llvm/llvm-project/pull/87558.diff

3 Files Affected:

  • (modified) clang/test/OpenMP/thread_limit_nvptx.c (+1-1)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+1-1)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+9-6)
diff --git a/clang/test/OpenMP/thread_limit_nvptx.c b/clang/test/OpenMP/thread_limit_nvptx.c
index 2132e1aa7834a0..8444e77662d2f2 100644
--- a/clang/test/OpenMP/thread_limit_nvptx.c
+++ b/clang/test/OpenMP/thread_limit_nvptx.c
@@ -7,7 +7,7 @@
 #define HEADER
 
 void foo(int N) {
-// CHECK: l11, !"maxntidx", i32 128}
+// CHECK-NOT: l11, !"maxntidx", i32 128}
 #pragma omp target teams distribute parallel for simd
   for (int i = 0; i < N; ++i)
     ;
diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index c9ee0c25194c23..cc02b90cd8d210 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -2107,7 +2107,7 @@ class OpenMPIRBuilder {
   static std::pair<int32_t, int32_t> readTeamBoundsForKernel(const Triple &T,
                                                              Function &Kernel);
   static void writeTeamsForKernel(const Triple &T, Function &Kernel, int32_t LB,
-                                  int32_t UB);
+                                  int32_t UB, int32_t NTid);
   ///}
 
 private:
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 16507a69ea8502..5791275a292012 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4554,16 +4554,16 @@ OpenMPIRBuilder::createTargetInit(const LocationDescription &Loc, bool IsSPMD,
   // Manifest the launch configuration in the metadata matching the kernel
   // environment.
   if (MinTeamsVal > 1 || MaxTeamsVal > 0)
-    writeTeamsForKernel(T, *Kernel, MinTeamsVal, MaxTeamsVal);
+    writeTeamsForKernel(T, *Kernel, MinTeamsVal, MaxTeamsVal, MaxThreadsVal);
+
+  if (MaxThreadsVal > 0)
+    writeThreadBoundsForKernel(T, *Kernel, MinThreadsVal, MaxThreadsVal);
 
   // For max values, < 0 means unset, == 0 means set but unknown.
   if (MaxThreadsVal < 0)
     MaxThreadsVal = std::max(
         int32_t(getGridValue(T, Kernel).GV_Default_WG_Size), MinThreadsVal);
 
-  if (MaxThreadsVal > 0)
-    writeThreadBoundsForKernel(T, *Kernel, MinThreadsVal, MaxThreadsVal);
-
   Constant *MinThreads = ConstantInt::getSigned(Int32, MinThreadsVal);
   Constant *MaxThreads = ConstantInt::getSigned(Int32, MaxThreadsVal);
   Constant *MinTeams = ConstantInt::getSigned(Int32, MinTeamsVal);
@@ -4785,11 +4785,14 @@ OpenMPIRBuilder::readTeamBoundsForKernel(const Triple &, Function &Kernel) {
 }
 
 void OpenMPIRBuilder::writeTeamsForKernel(const Triple &T, Function &Kernel,
-                                          int32_t LB, int32_t UB) {
+                                          int32_t LB, int32_t UB,
+                                          int32_t NTid) {
   if (T.isNVPTX()) {
     if (UB > 0)
       updateNVPTXMetadata(Kernel, "maxclusterrank", UB, true);
-    updateNVPTXMetadata(Kernel, "minctasm", LB, false);
+    // The 'minctasm' attribute is ignored without 'maxntid' also being set.
+    if (NTid > 0)
+      updateNVPTXMetadata(Kernel, "minctasm", LB, false);
   }
   Kernel.addFnAttr("omp_target_num_teams", std::to_string(LB));
 }

@jdoerfert
Copy link
Member

We have logic that currently sets values like maxntidx in the PTX to
be 128 if it was not specified by the user because this is what OpenMP
defaults to.

On NVIDIA architectures, yes.

This is not strictly correct in the OpenMP case because the
user can increase this with environment variables.

Up to the standard we implement, this is not true.
The thread limit ICV describes an upper bound, not a lower bound or range.
The new standard allows a strict thread bound via the clause, however.
That said, when a clause is present we will not set an "arbitrary" default value.

All that said, I see the issue. However, making our "default" slower to support cubin linking is not something I am very excited about. Honestly, I don't think we should support cubin linking at all (for libc), and if so it should come with warnings telling users they are on their own. This is, after all, not different than CUDA where you need to match the launch bounds across your dependences.

@jhuber6 jhuber6 closed this Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:openmp OpenMP related changes to Clang clang Clang issues not falling into any other category flang:openmp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants