[clang][CodeGen][SPIRV] Translate `amdgpu_flat_work_group_size` into `max_work_group_size`. #116820

AlexVlx · 2024-11-19T15:05:33Z

HIPAMD relies on the amdgpu_flat_work_group_size attribute to implement key functionality such as the __launch_bounds__ __global__ function annotation. This attribute is not available / directly translatable to SPIR-V, hence as it is AMDGCN flavoured SPIR-V suffers from information loss.

This patch addresses that limitation by converting the unsupported attribute into the max_work_group_size attribute which maps to MaxWorkgroupSizeINTEL, which is available in / handled by SPIR-V. When reverse translating from SPIR-V to AMDGCN LLVMIR we invert the map and add the original AMDGPU attribute.

llvmbot · 2024-11-19T15:06:10Z

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-clang-codegen

@llvm/pr-subscribers-clang

Author: Alex Voicu (AlexVlx)

Changes

HIPAMD relies on the amdgpu_flat_work_group_size attribute to implement key functionality such as the __launch_bounds__ __global__ function annotation. This attribute is not available / directly translatable to SPIR-V, hence as it is AMDGCN flavoured SPIR-V suffers from information loss.

This patch addresses that limitation by converting the unsupported attribute into the reqd_work_group_size attribute, which is available in / handled by SPIR-V.

Full diff: https://github.com/llvm/llvm-project/pull/116820.diff

2 Files Affected:

(modified) clang/lib/CodeGen/Targets/SPIR.cpp (+34)
(modified) clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu (+7)

diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index a48fe9d5f1ee9c..c35d91b1f49af2 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -64,6 +64,8 @@ class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
   void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
   LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
                                   const VarDecl *D) const override;
+  void setTargetAttributes(const Decl *D, llvm::GlobalValue *GV,
+                           CodeGen::CodeGenModule &M) const override;
   llvm::SyncScope::ID getLLVMSyncScopeID(const LangOptions &LangOpts,
                                          SyncScope Scope,
                                          llvm::AtomicOrdering Ordering,
@@ -245,6 +247,38 @@ SPIRVTargetCodeGenInfo::getGlobalVarAddressSpace(CodeGenModule &CGM,
   return DefaultGlobalAS;
 }
 
+void SPIRVTargetCodeGenInfo::setTargetAttributes(
+    const Decl *D, llvm::GlobalValue *GV, CodeGen::CodeGenModule &M) const {
+  if (!M.getLangOpts().HIP ||
+      M.getTarget().getTriple().getVendor() != llvm::Triple::AMD)
+    return;
+  if (GV->isDeclaration())
+    return;
+
+  auto F = dyn_cast<llvm::Function>(GV);
+  if (!F)
+    return;
+
+  auto FD = dyn_cast_or_null<FunctionDecl>(D);
+  if (!FD)
+    return;
+  if (!FD->hasAttr<CUDAGlobalAttr>())
+    return;
+
+  unsigned N = M.getLangOpts().GPUMaxThreadsPerBlock;
+  if (auto FlatWGS = FD->getAttr<AMDGPUFlatWorkGroupSizeAttr>())
+    N = FlatWGS->getMax()->EvaluateKnownConstInt(M.getContext()).getExtValue();
+
+  auto Int32Ty = llvm::IntegerType::getInt32Ty(M.getLLVMContext());
+  llvm::Metadata *AttrMDArgs[] = {
+      llvm::ConstantAsMetadata::get(llvm::ConstantInt::get(Int32Ty, N)),
+      llvm::ConstantAsMetadata::get(llvm::ConstantInt::get(Int32Ty, 1)),
+      llvm::ConstantAsMetadata::get(llvm::ConstantInt::get(Int32Ty, 1))};
+
+  F->setMetadata("reqd_work_group_size",
+                 llvm::MDNode::get(M.getLLVMContext(), AttrMDArgs));
+}
+
 llvm::SyncScope::ID
 SPIRVTargetCodeGenInfo::getLLVMSyncScopeID(const LangOptions &, SyncScope Scope,
                                            llvm::AtomicOrdering,
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu
index 11a133fd1351d2..3d01ac40259254 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu
@@ -4,6 +4,9 @@
 // RUN: %clang_cc1 -triple amdgcn-amd-amdhsa --gpu-max-threads-per-block=1024 \
 // RUN:     -fcuda-is-device -emit-llvm -o - -x hip %s \
 // RUN:     | FileCheck -check-prefixes=CHECK,MAX1024 %s
+// RUN: %clang_cc1 -triple spirv64-amd-amdhsa --gpu-max-threads-per-block=1024 \
+// RUN:     -fcuda-is-device -emit-llvm -o - -x hip %s \
+// RUN:     | FileCheck -check-prefixes=CHECK-SPIRV,MAX1024-SPIRV %s
 // RUN: %clang_cc1 -triple nvptx \
 // RUN:     -fcuda-is-device -emit-llvm -o - %s | FileCheck %s \
 // RUN:     -check-prefix=NAMD
@@ -21,12 +24,14 @@
 
 __global__ void flat_work_group_size_default() {
 // CHECK: define{{.*}} amdgpu_kernel void @_Z28flat_work_group_size_defaultv() [[FLAT_WORK_GROUP_SIZE_DEFAULT:#[0-9]+]]
+// CHECK-SPIRV: define{{.*}} spir_kernel void @_Z28flat_work_group_size_defaultv(){{.*}} !reqd_work_group_size [[REQD_WORK_GROUP_SIZE_DEFAULT:![0-9]+]]
 // NOUB: define{{.*}} void @_Z28flat_work_group_size_defaultv() [[NOUB:#[0-9]+]]
 }
 
 __attribute__((amdgpu_flat_work_group_size(32, 64))) // expected-no-diagnostics
 __global__ void flat_work_group_size_32_64() {
 // CHECK: define{{.*}} amdgpu_kernel void @_Z26flat_work_group_size_32_64v() [[FLAT_WORK_GROUP_SIZE_32_64:#[0-9]+]]
+// CHECK-SPIRV: define{{.*}} spir_kernel void @_Z26flat_work_group_size_32_64v(){{.*}} !reqd_work_group_size [[REQD_WORK_GROUP_SIZE_64:![0-9]+]]
 }
 __attribute__((amdgpu_waves_per_eu(2))) // expected-no-diagnostics
 __global__ void waves_per_eu_2() {
@@ -82,7 +87,9 @@ template __global__ void template_32_4_a_max_num_work_groups<2>();
 
 // DEFAULT-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.*}}"amdgpu-flat-work-group-size"="1,1024"{{.*}}"uniform-work-group-size"="true"
 // MAX1024-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.*}}"amdgpu-flat-work-group-size"="1,1024"
+// MAX1024-SPIRV-DAG: [[REQD_WORK_GROUP_SIZE_DEFAULT]] = !{i32 1024, i32 1, i32 1}
 // CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64]] = {{.*}}"amdgpu-flat-work-group-size"="32,64"
+// CHECK-SPIRV-DAG: [[REQD_WORK_GROUP_SIZE_64]] = !{i32 64, i32 1, i32 1}
 // CHECK-DAG: attributes [[WAVES_PER_EU_2]] = {{.*}}"amdgpu-waves-per-eu"="2"
 // CHECK-DAG: attributes [[NUM_SGPR_32]] = {{.*}}"amdgpu-num-sgpr"="32"
 // CHECK-DAG: attributes [[NUM_VGPR_64]] = {{.*}}"amdgpu-num-vgpr"="64"

llvmbot · 2024-11-19T15:06:10Z

@llvm/pr-subscribers-backend-amdgpu

Author: Alex Voicu (AlexVlx)

Changes

HIPAMD relies on the amdgpu_flat_work_group_size attribute to implement key functionality such as the __launch_bounds__ __global__ function annotation. This attribute is not available / directly translatable to SPIR-V, hence as it is AMDGCN flavoured SPIR-V suffers from information loss.

This patch addresses that limitation by converting the unsupported attribute into the reqd_work_group_size attribute, which is available in / handled by SPIR-V.

Full diff: https://github.com/llvm/llvm-project/pull/116820.diff

2 Files Affected:

(modified) clang/lib/CodeGen/Targets/SPIR.cpp (+34)
(modified) clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu (+7)

diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index a48fe9d5f1ee9c..c35d91b1f49af2 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -64,6 +64,8 @@ class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
   void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;
   LangAS getGlobalVarAddressSpace(CodeGenModule &CGM,
                                   const VarDecl *D) const override;
+  void setTargetAttributes(const Decl *D, llvm::GlobalValue *GV,
+                           CodeGen::CodeGenModule &M) const override;
   llvm::SyncScope::ID getLLVMSyncScopeID(const LangOptions &LangOpts,
                                          SyncScope Scope,
                                          llvm::AtomicOrdering Ordering,
@@ -245,6 +247,38 @@ SPIRVTargetCodeGenInfo::getGlobalVarAddressSpace(CodeGenModule &CGM,
   return DefaultGlobalAS;
 }
 
+void SPIRVTargetCodeGenInfo::setTargetAttributes(
+    const Decl *D, llvm::GlobalValue *GV, CodeGen::CodeGenModule &M) const {
+  if (!M.getLangOpts().HIP ||
+      M.getTarget().getTriple().getVendor() != llvm::Triple::AMD)
+    return;
+  if (GV->isDeclaration())
+    return;
+
+  auto F = dyn_cast<llvm::Function>(GV);
+  if (!F)
+    return;
+
+  auto FD = dyn_cast_or_null<FunctionDecl>(D);
+  if (!FD)
+    return;
+  if (!FD->hasAttr<CUDAGlobalAttr>())
+    return;
+
+  unsigned N = M.getLangOpts().GPUMaxThreadsPerBlock;
+  if (auto FlatWGS = FD->getAttr<AMDGPUFlatWorkGroupSizeAttr>())
+    N = FlatWGS->getMax()->EvaluateKnownConstInt(M.getContext()).getExtValue();
+
+  auto Int32Ty = llvm::IntegerType::getInt32Ty(M.getLLVMContext());
+  llvm::Metadata *AttrMDArgs[] = {
+      llvm::ConstantAsMetadata::get(llvm::ConstantInt::get(Int32Ty, N)),
+      llvm::ConstantAsMetadata::get(llvm::ConstantInt::get(Int32Ty, 1)),
+      llvm::ConstantAsMetadata::get(llvm::ConstantInt::get(Int32Ty, 1))};
+
+  F->setMetadata("reqd_work_group_size",
+                 llvm::MDNode::get(M.getLLVMContext(), AttrMDArgs));
+}
+
 llvm::SyncScope::ID
 SPIRVTargetCodeGenInfo::getLLVMSyncScopeID(const LangOptions &, SyncScope Scope,
                                            llvm::AtomicOrdering,
diff --git a/clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu b/clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu
index 11a133fd1351d2..3d01ac40259254 100644
--- a/clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu
+++ b/clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu
@@ -4,6 +4,9 @@
 // RUN: %clang_cc1 -triple amdgcn-amd-amdhsa --gpu-max-threads-per-block=1024 \
 // RUN:     -fcuda-is-device -emit-llvm -o - -x hip %s \
 // RUN:     | FileCheck -check-prefixes=CHECK,MAX1024 %s
+// RUN: %clang_cc1 -triple spirv64-amd-amdhsa --gpu-max-threads-per-block=1024 \
+// RUN:     -fcuda-is-device -emit-llvm -o - -x hip %s \
+// RUN:     | FileCheck -check-prefixes=CHECK-SPIRV,MAX1024-SPIRV %s
 // RUN: %clang_cc1 -triple nvptx \
 // RUN:     -fcuda-is-device -emit-llvm -o - %s | FileCheck %s \
 // RUN:     -check-prefix=NAMD
@@ -21,12 +24,14 @@
 
 __global__ void flat_work_group_size_default() {
 // CHECK: define{{.*}} amdgpu_kernel void @_Z28flat_work_group_size_defaultv() [[FLAT_WORK_GROUP_SIZE_DEFAULT:#[0-9]+]]
+// CHECK-SPIRV: define{{.*}} spir_kernel void @_Z28flat_work_group_size_defaultv(){{.*}} !reqd_work_group_size [[REQD_WORK_GROUP_SIZE_DEFAULT:![0-9]+]]
 // NOUB: define{{.*}} void @_Z28flat_work_group_size_defaultv() [[NOUB:#[0-9]+]]
 }
 
 __attribute__((amdgpu_flat_work_group_size(32, 64))) // expected-no-diagnostics
 __global__ void flat_work_group_size_32_64() {
 // CHECK: define{{.*}} amdgpu_kernel void @_Z26flat_work_group_size_32_64v() [[FLAT_WORK_GROUP_SIZE_32_64:#[0-9]+]]
+// CHECK-SPIRV: define{{.*}} spir_kernel void @_Z26flat_work_group_size_32_64v(){{.*}} !reqd_work_group_size [[REQD_WORK_GROUP_SIZE_64:![0-9]+]]
 }
 __attribute__((amdgpu_waves_per_eu(2))) // expected-no-diagnostics
 __global__ void waves_per_eu_2() {
@@ -82,7 +87,9 @@ template __global__ void template_32_4_a_max_num_work_groups<2>();
 
 // DEFAULT-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.*}}"amdgpu-flat-work-group-size"="1,1024"{{.*}}"uniform-work-group-size"="true"
 // MAX1024-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.*}}"amdgpu-flat-work-group-size"="1,1024"
+// MAX1024-SPIRV-DAG: [[REQD_WORK_GROUP_SIZE_DEFAULT]] = !{i32 1024, i32 1, i32 1}
 // CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64]] = {{.*}}"amdgpu-flat-work-group-size"="32,64"
+// CHECK-SPIRV-DAG: [[REQD_WORK_GROUP_SIZE_64]] = !{i32 64, i32 1, i32 1}
 // CHECK-DAG: attributes [[WAVES_PER_EU_2]] = {{.*}}"amdgpu-waves-per-eu"="2"
 // CHECK-DAG: attributes [[NUM_SGPR_32]] = {{.*}}"amdgpu-num-sgpr"="32"
 // CHECK-DAG: attributes [[NUM_VGPR_64]] = {{.*}}"amdgpu-num-vgpr"="64"

yxsamliu · 2024-11-19T15:31:17Z

reqd_work_group_size is for OpenCL reqd_work_group_size attribute and it sets exact block size. amdgpu-flat-work-group-size sets a (min, max) range for block size.

HIP launch bounds sets a block size range (1, bound). It cannot be represented by reqd_work_group_size.

AlexVlx · 2024-11-19T15:40:15Z

reqd_work_group_size is for OpenCL reqd_work_group_size attribute and it sets exact block size. amdgpu-flat-work-group-size sets a (min, max) range for block size.

HIP launch bounds sets a block size range (1, bound). It cannot be represented by reqd_work_group_size.

This is not quite correct. CUDA defines __launch_bounds__ as only carrying the maximum, not a range. I implemented it in HIP, and a range of [1, bound] is equivalent to just taking the maximum, the lower bound is spurious / I only put it in place because I probably misread the syntax for the attribute / misinterpreted it. TL;DR HIP __launch_bounds__ should match CUDA __launch_bounds__, and those only take extrema, not ranges AFAICS, so this is fine.

yxsamliu · 2024-11-19T15:49:19Z

reqd_work_group_size is for OpenCL reqd_work_group_size attribute and it sets exact block size. amdgpu-flat-work-group-size sets a (min, max) range for block size.
HIP launch bounds sets a block size range (1, bound). It cannot be represented by reqd_work_group_size.

This is not quite correct. CUDA defines __launch_bounds__ as only carrying the maximum, not a range. I implemented it in HIP, and a range of [1, bound] is equivalent to just taking the maximum, the lower bound is spurious / I only put it in place because I probably misread the syntax for the attribute / misinterpreted it. TL;DR HIP __launch_bounds__ should match CUDA __launch_bounds__, and those only take extrema, not ranges AFAICS, so this is fine.

For example, if you use reqd_work_group_size to represent launch_bounds(1024), then launch the kernel with block size 256, it will fail since reqd_work_group_size means the kernel can only be launched with block size 1024. I don't think that matches what launch_bounds(1024) intends to be. It intends to allow the kernel to be launched with block size between 1 and 1024.

AlexVlx · 2024-11-19T16:17:09Z

reqd_work_group_size is for OpenCL reqd_work_group_size attribute and it sets exact block size. amdgpu-flat-work-group-size sets a (min, max) range for block size.
HIP launch bounds sets a block size range (1, bound). It cannot be represented by reqd_work_group_size.

This is not quite correct. CUDA defines __launch_bounds__ as only carrying the maximum, not a range. I implemented it in HIP, and a range of [1, bound] is equivalent to just taking the maximum, the lower bound is spurious / I only put it in place because I probably misread the syntax for the attribute / misinterpreted it. TL;DR HIP __launch_bounds__ should match CUDA __launch_bounds__, and those only take extrema, not ranges AFAICS, so this is fine.

For example, if you use reqd_work_group_size to represent launch_bounds(1024), then launch the kernel with block size 256, it will fail since reqd_work_group_size means the kernel can only be launched with block size 1024. I don't think that matches what launch_bounds(1024) intends to be. It intends to allow the kernel to be launched with block size between 1 and 1024.

Oh, apologies, I probably should have clarified that we're only going to see this in SPIR-V, as part of run-time finalisation/JIT it gets translated back into the original amdgpu attribute; it's mostly for the convenience of carrying the maximum / composing with existing tools that an existing attribute is chosen, otherwise I'd have had to side-channel it. I agree that it is not a direct match, but sadly there is no direct match, as work_group_size_hint is too weak. If you're strongly opposed to this I can re-factor this to pass via side-channel.

arsenm

This is just wrong. These are not equivalent or translatable between each other.

The flat work group size is 1-dimensional only, and a range of permissible values. reqd_work_group_size is an exact match for all 3 dimensions. The backend already can directly consume reqd_work_group_size

yxsamliu · 2024-11-19T18:52:13Z

reqd_work_group_size is for OpenCL reqd_work_group_size attribute and it sets exact block size. amdgpu-flat-work-group-size sets a (min, max) range for block size.
HIP launch bounds sets a block size range (1, bound). It cannot be represented by reqd_work_group_size.

This is not quite correct. CUDA defines __launch_bounds__ as only carrying the maximum, not a range. I implemented it in HIP, and a range of [1, bound] is equivalent to just taking the maximum, the lower bound is spurious / I only put it in place because I probably misread the syntax for the attribute / misinterpreted it. TL;DR HIP __launch_bounds__ should match CUDA __launch_bounds__, and those only take extrema, not ranges AFAICS, so this is fine.

For example, if you use reqd_work_group_size to represent launch_bounds(1024), then launch the kernel with block size 256, it will fail since reqd_work_group_size means the kernel can only be launched with block size 1024. I don't think that matches what launch_bounds(1024) intends to be. It intends to allow the kernel to be launched with block size between 1 and 1024.

Oh, apologies, I probably should have clarified that we're only going to see this in SPIR-V, as part of run-time finalisation/JIT it gets translated back into the original amdgpu attribute; it's mostly for the convenience of carrying the maximum / composing with existing tools that an existing attribute is chosen, otherwise I'd have had to side-channel it. I agree that it is not a direct match, but sadly there is no direct match, as work_group_size_hint is too weak. If you're strongly opposed to this I can re-factor this to pass via side-channel.

I see. Basically we redefined the semantic of reqd_work_group_size for HIP-generated SPIRV. Do we have a way to differentiate OpenCL-generated and HIP-generated SPIRV? They will be translated differently about reqd_work_group_size

arsenm · 2024-11-19T18:57:15Z

I see. Basically we redefined the semantic of reqd_work_group_size for HIP-generated SPIRV. Do we have a way to differentiate OpenCL-generated and HIP-generated SPIRV? They will be translated differently about reqd_work_group_size

There shouldn't be one. Representation needs to be consistent and language / producer / consumer independent

…le_flat_work_group_size_amdgcnspirv

…better and doesn't clash with OCL uses.

arsenm · 2024-12-19T09:17:49Z

clang/lib/CodeGen/Targets/SPIR.cpp

+      llvm::ConstantAsMetadata::get(llvm::ConstantInt::get(Int32Ty, 1)),
+      llvm::ConstantAsMetadata::get(llvm::ConstantInt::get(Int32Ty, 1))};
+
+  F->setMetadata("max_work_group_size",


Why metadata? I know the OpenCL stuff uses metadata, but I think that's because it predates arbitrary string attributes. This is also "setTargetAttributes".

Because the Translator and other tools interact/expect it as metadata.

arsenm · 2024-12-19T09:18:27Z

clang/lib/CodeGen/Targets/SPIR.cpp

+  if (!M.getLangOpts().HIP ||
+      M.getTarget().getTriple().getVendor() != llvm::Triple::AMD)
+    return;


Remove the vendor check. The language check is also suspect, this is interpretation of a target attribute

It is not super suspect in context as at the moment this is only for AMDGCN flavoured SPIR-V, which we only support in HIP, and the mapping from flat to dim X only makes sense there.

arsenm · 2024-12-19T09:18:50Z

clang/lib/CodeGen/Targets/SPIR.cpp

+  auto F = dyn_cast<llvm::Function>(GV);
+  if (!F)
+    return;


Can this fail if the FunctionDecl below fails?

arsenm · 2024-12-19T09:20:25Z

clang/lib/CodeGen/Targets/SPIR.cpp

+  // We encode the maximum flat WG size in the first component of the 3D
+  // max_work_group_size attribute, which will get reverse translated into the
+  // original AMDGPU attribute when targeting AMDGPU.


I'm still confused about why this is supposed to be "OK"

I think this is OK in HIP because the language (only) defines __launch_bounds__, which is 1D, and we implement with the AMDGPU attribute. At the same time, the SPIR-V attribute cannot be produced via other defined means in HIP (there's no Clang __attribute__ for it, for example, so the user couldn't have written some N-dimensional max_work_group_size themselves), so its presence in AMDGCN flavoured SPIR-V is fairly unambiguously originating from here.

In general we will eventually replace this with processing for all AMDGPU attributes, but that has some challenges in that it'd be more infectious in the translator (or the BE and any eventual SPIR-V consumer, if they were to manifest). Conversely we cannot just drop the original attribute on the floor as correctness depends on it. Hence the PR.

That reminds me, the way launch_bounds is implemented is also problematic: #91468

Launch bounds should not be implemented in terms of the flat workgroup size attribute in a header. clang should be directly interpreting it. So yes, there is a clang attribute for it. It's just HIP has a hackier implementation of it than the CUDA attribute which we just ignore

We are talking across eachother. I am saying that the SPIR-V attribute cannot be generated via Clang, i.e. that you cannot write __attribute__((foo)) in your source and obtain max_work_group_size metadata, at the moment. Furthermore, from the implementation of Clang's __launch_bounds__:

// An AST node is created for this attribute, but is not used by other parts // of the compiler. However, this node needs to exist in the AST because // non-LLVM backends may be relying on the attribute's presence.

So this is a glorified annotation / we'd still have to decide on how to lower it into IR, which would likely end up atop flat workgroup size, unless we choose to spam yet another attribute. We also use flat workgroup size implicitly to control / implement --gpu-max-threads-per-block, which is important for correctness, and is in a fairly similar place with __launch_bounds__ (it's always 1D, doesn't have a minimum etc.). It's also not handled by this patch, so I'll have to add it:)

That being said, the idea in #91468 is sound, but it will require a bit of work to get done; I think we'd still have to choose a way to pass the info through SPIR-V (what this PR tries to do).

yxsamliu

LGTM

…le_flat_work_group_size_amdgcnspirv

…`max_work_group_size`. (llvm#116820) HIPAMD relies on the `amdgpu_flat_work_group_size` attribute to implement key functionality such as the `__launch_bounds__` `__global__` function annotation. This attribute is not available / directly translatable to SPIR-V, hence as it is AMDGCN flavoured SPIR-V suffers from information loss. This patch addresses that limitation by converting the unsupported attribute into the `max_work_group_size` attribute which maps to [`MaxWorkgroupSizeINTEL`](https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/INTEL/SPV_INTEL_kernel_attributes.asciidoc), which is available in / handled by SPIR-V. When reverse translating from SPIR-V to AMDGCN LLVMIR we invert the map and add the original AMDGPU attribute. Change-Id: I5d95cd17d7169a61dc26fb410a838263e4497374

Translate amdgpu_flat_work_group_size into reqd_work_group_size.

c5efdd2

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. labels Nov 19, 2024

AlexVlx requested a review from yxsamliu November 19, 2024 15:06

AlexVlx added SPIR-V SPIR-V language support and removed clang Clang issues not falling into any other category backend:AMDGPU labels Nov 19, 2024

Adapt test.

4349def

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU labels Nov 19, 2024

arsenm requested changes Nov 19, 2024

View reviewed changes

AlexVlx closed this Nov 19, 2024

AlexVlx added 2 commits December 18, 2024 12:46

Merge branch 'main' of https://github.com/llvm/llvm-project into hand…

2dd4456

…le_flat_work_group_size_amdgcnspirv

Switch to using max_workgroup_size which models expected semantics …

958d1cd

…better and doesn't clash with OCL uses.

AlexVlx reopened this Dec 18, 2024

arsenm reviewed Dec 19, 2024

View reviewed changes

yxsamliu approved these changes Jan 6, 2025

View reviewed changes

AlexVlx added 2 commits January 6, 2025 20:06

Merge branch 'main' of https://github.com/llvm/llvm-project into hand…

27d5c6e

…le_flat_work_group_size_amdgcnspirv

Merge branch 'main' of https://github.com/llvm/llvm-project into hand…

59d057f

…le_flat_work_group_size_amdgcnspirv

AlexVlx changed the title ~~[clang][CodeGen][SPIRV] Translate amdgpu_flat_work_group_size into reqd_work_group_size.~~ [clang][CodeGen][SPIRV] Translate amdgpu_flat_work_group_size into max_work_group_size. Jan 6, 2025

AlexVlx merged commit 66acb26 into llvm:main Jan 7, 2025
8 checks passed

AlexVlx deleted the handle_flat_work_group_size_amdgcnspirv branch January 7, 2025 10:44

[clang][CodeGen][SPIRV] Translate amdgpu_flat_work_group_size into max_work_group_size. #116820

[clang][CodeGen][SPIRV] Translate amdgpu_flat_work_group_size into max_work_group_size. #116820

Uh oh!

Conversation

AlexVlx commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 19, 2024

Uh oh!

yxsamliu commented Nov 19, 2024

Uh oh!

AlexVlx commented Nov 19, 2024

Uh oh!

yxsamliu commented Nov 19, 2024

Uh oh!

AlexVlx commented Nov 19, 2024

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

yxsamliu commented Nov 19, 2024

Uh oh!

arsenm commented Nov 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yxsamliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[clang][CodeGen][SPIRV] Translate `amdgpu_flat_work_group_size` into `max_work_group_size`. #116820

[clang][CodeGen][SPIRV] Translate `amdgpu_flat_work_group_size` into `max_work_group_size`. #116820

AlexVlx commented Nov 19, 2024 •

edited

Loading

llvmbot commented Nov 19, 2024 •

edited

Loading

arsenm Dec 19, 2024 •

edited

Loading