clang/AMDGPU: Set amdgpu-max-num-workgroups to disable Y/Z by default #119009

arsenm · 2024-12-06T19:11:02Z

Only OpenCL supports 2d and 3d dispatches, the other languages Y and Z
dimensions are always 1. Some of the generated OpenMP functions don't
seem to get the correct attributes. The kernels do, but the callable
__omp_offloading functions are missing them for some reason.

Only OpenCL supports 2d and 3d dispatches, the other languages Y and Z dimensions are always 1. Some of the generated OpenMP functions don't seem to get the correct attributes. The kernels do, but the callable __omp_offloading functions are missing them for some reason.

arsenm · 2024-12-06T19:11:21Z

clang/AMDGPU: Set amdgpu-max-num-workgroups to disable Y/Z by default #119009 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2024-12-06T19:11:40Z

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

Only OpenCL supports 2d and 3d dispatches, the other languages Y and Z
dimensions are always 1. Some of the generated OpenMP functions don't
seem to get the correct attributes. The kernels do, but the callable
__omp_offloading functions are missing them for some reason.

Full diff: https://github.com/llvm/llvm-project/pull/119009.diff

5 Files Affected:

(modified) clang/include/clang/Basic/LangOptions.h (+4)
(modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+32-24)
(modified) clang/test/CodeGenHIP/default-attributes.hip (+2-2)
(modified) clang/test/OpenMP/amdgcn-attributes.cpp (+6-6)
(modified) clang/test/OpenMP/amdgcn_target_global_constructor.cpp (+1-1)

diff --git a/clang/include/clang/Basic/LangOptions.h b/clang/include/clang/Basic/LangOptions.h
index 949c8f5d448bcf..d5532eec0a683e 100644
--- a/clang/include/clang/Basic/LangOptions.h
+++ b/clang/include/clang/Basic/LangOptions.h
@@ -706,6 +706,10 @@ class LangOptions : public LangOptionsBase {
     return OpenCL || CUDA;
   }
 
+  /// Return true if the dispatch size for an offload language only uses one
+  /// dimension.
+  bool gridSizeIsOneDimension() const { return CUDA || HIP || OpenMP; }
+
   /// Return the OpenCL C or C++ version as a VersionTuple.
   VersionTuple getOpenCLVersionTuple() const;
 
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index 56ad0503a11ab2..904e03b3cc7182 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -377,29 +377,6 @@ void AMDGPUTargetCodeGenInfo::setFunctionDeclAttributes(
     if (NumVGPR != 0)
       F->addFnAttr("amdgpu-num-vgpr", llvm::utostr(NumVGPR));
   }
-
-  if (const auto *Attr = FD->getAttr<AMDGPUMaxNumWorkGroupsAttr>()) {
-    uint32_t X = Attr->getMaxNumWorkGroupsX()
-                     ->EvaluateKnownConstInt(M.getContext())
-                     .getExtValue();
-    // Y and Z dimensions default to 1 if not specified
-    uint32_t Y = Attr->getMaxNumWorkGroupsY()
-                     ? Attr->getMaxNumWorkGroupsY()
-                           ->EvaluateKnownConstInt(M.getContext())
-                           .getExtValue()
-                     : 1;
-    uint32_t Z = Attr->getMaxNumWorkGroupsZ()
-                     ? Attr->getMaxNumWorkGroupsZ()
-                           ->EvaluateKnownConstInt(M.getContext())
-                           .getExtValue()
-                     : 1;
-
-    llvm::SmallString<32> AttrVal;
-    llvm::raw_svector_ostream OS(AttrVal);
-    OS << X << ',' << Y << ',' << Z;
-
-    F->addFnAttr("amdgpu-max-num-workgroups", AttrVal.str());
-  }
 }
 
 /// Emits control constants used to change per-architecture behaviour in the
@@ -450,9 +427,40 @@ void AMDGPUTargetCodeGenInfo::setTargetAttributes(
   if (!F)
     return;
 
+  // TODO: Use AttrBuilder
   const FunctionDecl *FD = dyn_cast_or_null<FunctionDecl>(D);
-  if (FD)
+  const AMDGPUMaxNumWorkGroupsAttr *MaxNumWorkGroupsAttr = nullptr;
+  if (FD) {
     setFunctionDeclAttributes(FD, F, M);
+    MaxNumWorkGroupsAttr = FD->getAttr<AMDGPUMaxNumWorkGroupsAttr>();
+  }
+
+  if (MaxNumWorkGroupsAttr) {
+    const auto *Attr = MaxNumWorkGroupsAttr;
+    uint32_t X = Attr->getMaxNumWorkGroupsX()
+                     ->EvaluateKnownConstInt(M.getContext())
+                     .getExtValue();
+    // Y and Z dimensions default to 1 if not specified
+    uint32_t Y = Attr->getMaxNumWorkGroupsY()
+                     ? Attr->getMaxNumWorkGroupsY()
+                           ->EvaluateKnownConstInt(M.getContext())
+                           .getExtValue()
+                     : 1;
+    uint32_t Z = Attr->getMaxNumWorkGroupsZ()
+                     ? Attr->getMaxNumWorkGroupsZ()
+                           ->EvaluateKnownConstInt(M.getContext())
+                           .getExtValue()
+                     : 1;
+
+    llvm::SmallString<32> AttrVal;
+    llvm::raw_svector_ostream OS(AttrVal);
+    OS << X << ',' << Y << ',' << Z;
+
+    F->addFnAttr("amdgpu-max-num-workgroups", AttrVal.str());
+  } else if (M.getLangOpts().gridSizeIsOneDimension()) {
+    // If the language only has 1D dispatches, disable Y/Z by default.
+    F->addFnAttr("amdgpu-max-num-workgroups", "4294967295,1,1");
+  }
 
   if (!getABIInfo().getCodeGenOpts().EmitIEEENaNCompliantInsts)
     F->addFnAttr("amdgpu-ieee", "false");
diff --git a/clang/test/CodeGenHIP/default-attributes.hip b/clang/test/CodeGenHIP/default-attributes.hip
index 1b53ebec9b5821..1a2cc42828c2f6 100644
--- a/clang/test/CodeGenHIP/default-attributes.hip
+++ b/clang/test/CodeGenHIP/default-attributes.hip
@@ -34,9 +34,9 @@ __global__ void kernel() {
  extern_func();
 }
 //.
-// OPTNONE: attributes #[[ATTR0]] = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+// OPTNONE: attributes #[[ATTR0]] = { convergent mustprogress noinline nounwind optnone "amdgpu-max-num-workgroups"="4294967295,1,1" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
 // OPTNONE: attributes #[[ATTR1:[0-9]+]] = { convergent nounwind "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
-// OPTNONE: attributes #[[ATTR2]] = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,1024" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }
+// OPTNONE: attributes #[[ATTR2]] = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,1024" "amdgpu-max-num-workgroups"="4294967295,1,1" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }
 // OPTNONE: attributes #[[ATTR3]] = { convergent nounwind }
 //.
 // OPTNONE: [[META0:![0-9]+]] = !{i32 1, !"amdhsa_code_object_version", i32 500}
diff --git a/clang/test/OpenMP/amdgcn-attributes.cpp b/clang/test/OpenMP/amdgcn-attributes.cpp
index 2c9e16a4f5098e..270cc225d05da2 100644
--- a/clang/test/OpenMP/amdgcn-attributes.cpp
+++ b/clang/test/OpenMP/amdgcn-attributes.cpp
@@ -31,10 +31,10 @@ int callable(int x) {
   return x + 1;
 }
 
-// DEFAULT: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,42" "kernel" "no-trapping-math"="true" "omp_target_thread_limit"="42" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }
-// CPU: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,42" "kernel" "no-trapping-math"="true" "omp_target_thread_limit"="42" "stack-protector-buffer-size"="8" "target-cpu"="gfx900" "target-features"="+16-bit-insts,+ci-insts,+dpp,+gfx8-insts,+gfx9-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" "uniform-work-group-size"="true" }
-// NOIEEE: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,42" "amdgpu-ieee"="false" "kernel" "no-nans-fp-math"="true" "no-trapping-math"="true" "omp_target_thread_limit"="42" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }
+// DEFAULT: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,42" "amdgpu-max-num-workgroups"="4294967295,1,1" "kernel" "no-trapping-math"="true" "omp_target_thread_limit"="42" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }
+// CPU: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,42" "amdgpu-max-num-workgroups"="4294967295,1,1" "kernel" "no-trapping-math"="true" "omp_target_thread_limit"="42" "stack-protector-buffer-size"="8" "target-cpu"="gfx900" "target-features"="+16-bit-insts,+ci-insts,+dpp,+gfx8-insts,+gfx9-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" "uniform-work-group-size"="true" }
+// NOIEEE: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,42" "amdgpu-ieee"="false" "amdgpu-max-num-workgroups"="4294967295,1,1" "kernel" "no-nans-fp-math"="true" "no-trapping-math"="true" "omp_target_thread_limit"="42" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }
 
-// DEFAULT: attributes #2 = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
-// CPU: attributes #2 = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx900" "target-features"="+16-bit-insts,+ci-insts,+dpp,+gfx8-insts,+gfx9-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" }
-// NOIEEE: attributes #2 = { convergent mustprogress noinline nounwind optnone "amdgpu-ieee"="false" "no-nans-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+// DEFAULT: attributes #2 = { convergent mustprogress noinline nounwind optnone "amdgpu-max-num-workgroups"="4294967295,1,1" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+// CPU: attributes #2 = { convergent mustprogress noinline nounwind optnone "amdgpu-max-num-workgroups"="4294967295,1,1" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx900" "target-features"="+16-bit-insts,+ci-insts,+dpp,+gfx8-insts,+gfx9-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" }
+// NOIEEE: attributes #2 = { convergent mustprogress noinline nounwind optnone "amdgpu-ieee"="false" "amdgpu-max-num-workgroups"="4294967295,1,1" "no-nans-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
diff --git a/clang/test/OpenMP/amdgcn_target_global_constructor.cpp b/clang/test/OpenMP/amdgcn_target_global_constructor.cpp
index 9f1e68d4ea0fec..ffde39479761c4 100644
--- a/clang/test/OpenMP/amdgcn_target_global_constructor.cpp
+++ b/clang/test/OpenMP/amdgcn_target_global_constructor.cpp
@@ -98,7 +98,7 @@ S A;
 //
 //.
 // CHECK: attributes #[[ATTR0]] = { convergent noinline nounwind "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
-// CHECK: attributes #[[ATTR1]] = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+// CHECK: attributes #[[ATTR1]] = { convergent mustprogress noinline nounwind optnone "amdgpu-max-num-workgroups"="4294967295,1,1" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
 // CHECK: attributes #[[ATTR2:[0-9]+]] = { convergent "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
 // CHECK: attributes #[[ATTR3]] = { convergent }
 // CHECK: attributes #[[ATTR4]] = { convergent nounwind }

shiltian · 2024-12-06T19:17:12Z

clang/include/clang/Basic/LangOptions.h


+  /// Return true if the dispatch size for an offload language only uses one
+  /// dimension.
+  bool gridSizeIsOneDimension() const { return CUDA || HIP || OpenMP; }


CUDA and HIP don't support 3d grid? Did I misunderstand?

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#execution-configuration Isn't it a dim3?

shiltian · 2024-12-06T19:25:12Z

The kernels do, but the callable __omp_offloading functions are missing them for some reason.

The OpenMP GPU CodeGen only calls it on the target entry point.

arsenm · 2024-12-06T21:12:16Z

The OpenMP GPU CodeGen only calls it on the target entry point.

It should be calling in on every single function

arsenm · 2024-12-06T22:07:33Z

So this is wrong. It was correct for openmp. Is there a language version we check can where 3d grids were introduced?

shiltian · 2024-12-07T19:23:39Z

So this is wrong. It was correct for openmp.

Even OpenMP is discussing to introduce multi-dim support, though in LLVM I already implemented it as an extension. :-)

arsenm added the backend:AMDGPU label Dec 6, 2024 — with Graphite App

arsenm added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. labels Dec 6, 2024 — with Graphite App

arsenm removed the clang Clang issues not falling into any other category label Dec 6, 2024

arsenm requested review from AlexVlx, JonChesterfield, jhuber6 and yxsamliu December 6, 2024 19:11

arsenm marked this pull request as ready for review December 6, 2024 19:11

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:openmp OpenMP related changes to Clang labels Dec 6, 2024

shiltian reviewed Dec 6, 2024

View reviewed changes

Don't do for openmp

d57f060

arsenm closed this Dec 6, 2024

arsenm deleted the users/arsenm/clang-amdgpu-set-default-gridsize-1d-dispatch branch December 6, 2024 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clang/AMDGPU: Set amdgpu-max-num-workgroups to disable Y/Z by default #119009

clang/AMDGPU: Set amdgpu-max-num-workgroups to disable Y/Z by default #119009

Uh oh!

arsenm commented Dec 6, 2024

Uh oh!

arsenm commented Dec 6, 2024

Uh oh!

llvmbot commented Dec 6, 2024 •

edited

Loading

Uh oh!

shiltian Dec 6, 2024

Uh oh!

arsenm Dec 6, 2024

Uh oh!

shiltian Dec 6, 2024

Uh oh!

shiltian commented Dec 6, 2024

Uh oh!

arsenm commented Dec 6, 2024

Uh oh!

arsenm commented Dec 6, 2024

Uh oh!

shiltian commented Dec 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clang/AMDGPU: Set amdgpu-max-num-workgroups to disable Y/Z by default #119009

clang/AMDGPU: Set amdgpu-max-num-workgroups to disable Y/Z by default #119009

Uh oh!

Conversation

arsenm commented Dec 6, 2024

Uh oh!

arsenm commented Dec 6, 2024

Uh oh!

llvmbot commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiltian Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

shiltian Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

shiltian commented Dec 6, 2024

Uh oh!

arsenm commented Dec 6, 2024

Uh oh!

arsenm commented Dec 6, 2024

Uh oh!

shiltian commented Dec 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

llvmbot commented Dec 6, 2024 •

edited

Loading