[OpenMP] Do not emit default thread limits of 128 #87558

jhuber6 · 2024-04-03T21:03:25Z

Summary:
We have logic that currently sets values like maxntidx in the PTX to
be 128 if it was not specified by the user because this is what OpenMP
defaults to. This is not strictly correct in the OpenMP case because the
user can increase this with environment variables. Additionally, this
violates the CUDA linking ABI. So if we link code that comes from OpenMP
with a cubin from another project the more strict thread limit of
128 will result in the kernel sometimes failing to link with a
function compiled against a generic target of 1024 due to an inconsistency
in the number of registers required. I.e. a kernel with lower register requirements
cannot link against a function with higher .This was observed when doing
cubin linking against some libc functions.

Summary: We have logic that currently sets values like `maxntidx` in the PTX to be 128 if it was not specified by the user because this is what OpenMP defaults to. This is not strictly correct in the OpenMP case because the user can increase this with environment variables. Additionally, this violates the CUDA linking ABI. So if we link code that comes from OpenMP with a `cubin` from another project the more strict thread limit of `128` will result in the kernel sometimes failing to link with a function compiled against a generic target of `1024`. This was observed when doing `cubin` linking against some `libc` functions.

llvmbot · 2024-04-03T21:03:59Z

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-flang-openmp

Author: Joseph Huber (jhuber6)

Changes

Summary:
We have logic that currently sets values like maxntidx in the PTX to
be 128 if it was not specified by the user because this is what OpenMP
defaults to. This is not strictly correct in the OpenMP case because the
user can increase this with environment variables. Additionally, this
violates the CUDA linking ABI. So if we link code that comes from OpenMP
with a cubin from another project the more strict thread limit of
128 will result in the kernel sometimes failing to link with a
function compiled against a generic target of 1024. This was observed
when doing cubin linking against some libc functions.

Full diff: https://github.com/llvm/llvm-project/pull/87558.diff

3 Files Affected:

(modified) clang/test/OpenMP/thread_limit_nvptx.c (+1-1)
(modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+1-1)
(modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+9-6)

diff --git a/clang/test/OpenMP/thread_limit_nvptx.c b/clang/test/OpenMP/thread_limit_nvptx.c
index 2132e1aa7834a0..8444e77662d2f2 100644
--- a/clang/test/OpenMP/thread_limit_nvptx.c
+++ b/clang/test/OpenMP/thread_limit_nvptx.c
@@ -7,7 +7,7 @@
 #define HEADER
 
 void foo(int N) {
-// CHECK: l11, !"maxntidx", i32 128}
+// CHECK-NOT: l11, !"maxntidx", i32 128}
 #pragma omp target teams distribute parallel for simd
   for (int i = 0; i < N; ++i)
     ;
diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index c9ee0c25194c23..cc02b90cd8d210 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -2107,7 +2107,7 @@ class OpenMPIRBuilder {
   static std::pair<int32_t, int32_t> readTeamBoundsForKernel(const Triple &T,
                                                              Function &Kernel);
   static void writeTeamsForKernel(const Triple &T, Function &Kernel, int32_t LB,
-                                  int32_t UB);
+                                  int32_t UB, int32_t NTid);
   ///}
 
 private:
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 16507a69ea8502..5791275a292012 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4554,16 +4554,16 @@ OpenMPIRBuilder::createTargetInit(const LocationDescription &Loc, bool IsSPMD,
   // Manifest the launch configuration in the metadata matching the kernel
   // environment.
   if (MinTeamsVal > 1 || MaxTeamsVal > 0)
-    writeTeamsForKernel(T, *Kernel, MinTeamsVal, MaxTeamsVal);
+    writeTeamsForKernel(T, *Kernel, MinTeamsVal, MaxTeamsVal, MaxThreadsVal);
+
+  if (MaxThreadsVal > 0)
+    writeThreadBoundsForKernel(T, *Kernel, MinThreadsVal, MaxThreadsVal);
 
   // For max values, < 0 means unset, == 0 means set but unknown.
   if (MaxThreadsVal < 0)
     MaxThreadsVal = std::max(
         int32_t(getGridValue(T, Kernel).GV_Default_WG_Size), MinThreadsVal);
 
-  if (MaxThreadsVal > 0)
-    writeThreadBoundsForKernel(T, *Kernel, MinThreadsVal, MaxThreadsVal);
-
   Constant *MinThreads = ConstantInt::getSigned(Int32, MinThreadsVal);
   Constant *MaxThreads = ConstantInt::getSigned(Int32, MaxThreadsVal);
   Constant *MinTeams = ConstantInt::getSigned(Int32, MinTeamsVal);
@@ -4785,11 +4785,14 @@ OpenMPIRBuilder::readTeamBoundsForKernel(const Triple &, Function &Kernel) {
 }
 
 void OpenMPIRBuilder::writeTeamsForKernel(const Triple &T, Function &Kernel,
-                                          int32_t LB, int32_t UB) {
+                                          int32_t LB, int32_t UB,
+                                          int32_t NTid) {
   if (T.isNVPTX()) {
     if (UB > 0)
       updateNVPTXMetadata(Kernel, "maxclusterrank", UB, true);
-    updateNVPTXMetadata(Kernel, "minctasm", LB, false);
+    // The 'minctasm' attribute is ignored without 'maxntid' also being set.
+    if (NTid > 0)
+      updateNVPTXMetadata(Kernel, "minctasm", LB, false);
   }
   Kernel.addFnAttr("omp_target_num_teams", std::to_string(LB));
 }

jdoerfert · 2024-04-04T18:36:16Z

We have logic that currently sets values like maxntidx in the PTX to
be 128 if it was not specified by the user because this is what OpenMP
defaults to.

On NVIDIA architectures, yes.

This is not strictly correct in the OpenMP case because the
user can increase this with environment variables.

Up to the standard we implement, this is not true.
The thread limit ICV describes an upper bound, not a lower bound or range.
The new standard allows a strict thread bound via the clause, however.
That said, when a clause is present we will not set an "arbitrary" default value.

All that said, I see the issue. However, making our "default" slower to support cubin linking is not something I am very excited about. Honestly, I don't think we should support cubin linking at all (for libc), and if so it should come with warnings telling users they are on their own. This is, after all, not different than CUDA where you need to match the launch bounds across your dependences.

jhuber6 requested review from Artem-B, Meinersbur, jdoerfert, shiltian and yxsamliu April 3, 2024 21:03

llvmbot added clang Clang issues not falling into any other category flang:openmp clang:openmp OpenMP related changes to Clang labels Apr 3, 2024

jhuber6 closed this Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenMP] Do not emit default thread limits of 128 #87558

[OpenMP] Do not emit default thread limits of 128 #87558

Uh oh!

jhuber6 commented Apr 3, 2024 •

edited

Loading

Uh oh!

llvmbot commented Apr 3, 2024 •

edited

Loading

Uh oh!

jdoerfert commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[OpenMP] Do not emit default thread limits of 128 #87558

[OpenMP] Do not emit default thread limits of 128 #87558

Uh oh!

Conversation

jhuber6 commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdoerfert commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhuber6 commented Apr 3, 2024 •

edited

Loading

llvmbot commented Apr 3, 2024 •

edited

Loading