Skip to content

Conversation

@shiltian
Copy link
Contributor

@shiltian shiltian commented Feb 5, 2025

This PR updates the logic for merging two flat workgroup size pairs. The current
implementation, which uses clampStateAndIndicateChange to compute a union, is
not ideal. If we think from ensure proper resource allocation perspective, for
instance, if one pair specifies a minimum of 32 flat workgroup size, and another
specifies a minimum of 64, we should guarantee that 64 flat workgroup size can
be supported, as failing to do so could result in excessive resource allocation.
A similar principle applies to the upper bound. Thus, the PR uses the following
approach for merging two pairs,

lo_a,up_a and lo_b,up_b: max(lo_a, lo_b), max(up_a, up_b).

This ensures that resource allocation adheres to the stricter constraints from
both inputs.

This PR only updates the merge for AAAMDFlatWorkGroupSize. AAAMDWavesPerEU
will be rewritten in a follow-up.

…oup size

This PR updates the logic for merging two flat workgroup size pairs. The current
implementation, which uses `clampStateAndIndicateChange` to compute a union, is
not ideal. If we think from ensure proper resource allocation perspective, for
instance, if one pair specifies a minimum of 32 flat workgroup size, and another
specifies a minimum of 64, we should guarantee that 64 flat workgroup size can
be supported, as failing to do so could result in excessive resource allocation.
A similar principle applies to the upper bound. Thus, the PR uses the following
approach for merging two pairs,

`lo_a,up_a and lo_b,up_b: max(lo_a, lo_b), max(up_a, up_b)`.

This ensures that resource allocation adheres to the stricter constraints from
both inputs.

This PR only updates the merge for `AAAMDFlatWorkGroupSize`. `AAAMDWavesPerEU`
will be rewritten in a follow-up.
Copy link
Contributor Author

shiltian commented Feb 5, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@llvmbot
Copy link
Member

llvmbot commented Feb 5, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

This PR updates the logic for merging two flat workgroup size pairs. The current
implementation, which uses clampStateAndIndicateChange to compute a union, is
not ideal. If we think from ensure proper resource allocation perspective, for
instance, if one pair specifies a minimum of 32 flat workgroup size, and another
specifies a minimum of 64, we should guarantee that 64 flat workgroup size can
be supported, as failing to do so could result in excessive resource allocation.
A similar principle applies to the upper bound. Thus, the PR uses the following
approach for merging two pairs,

lo_a,up_a and lo_b,up_b: max(lo_a, lo_b), max(up_a, up_b).

This ensures that resource allocation adheres to the stricter constraints from
both inputs.

This PR only updates the merge for AAAMDFlatWorkGroupSize. AAAMDWavesPerEU
will be rewritten in a follow-up.


Full diff: https://github.com/llvm/llvm-project/pull/125889.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp (+29-6)
  • (modified) llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll (+6-6)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index 546db318c17d53..16be7d50127083 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -799,7 +799,31 @@ AAAMDAttributes &AAAMDAttributes::createForPosition(const IRPosition &IRP,
   llvm_unreachable("AAAMDAttributes is only valid for function position");
 }
 
-/// Base class to derive different size ranges.
+/// Merge two range states and indicate changes. The range can apply to
+/// flat-workgroup-size and waves-per-eu attributes. The growth/merge of these
+/// two attributes is neither a union nor an intersection. Instead, it follows
+/// the rule below:
+///
+/// Lower = max(S.Lower, R.Lower)
+/// Upper = max(S.Upper, R.Upper)
+ChangeStatus mergeRangeStateAndIndicateChange(IntegerRangeState &S,
+                                              const IntegerRangeState &R) {
+  auto Assumed = S.getAssumed();
+  unsigned Min = R.getAssumed().getLower().getZExtValue();
+  unsigned Max = R.getAssumed().getUpper().getZExtValue();
+  Min = std::max(Min, static_cast<unsigned>(Assumed.getLower().getZExtValue()));
+  Max = std::max(Max, static_cast<unsigned>(Assumed.getUpper().getZExtValue()));
+  ConstantRange Range(APInt(32, Min), APInt(32, Max));
+  IntegerRangeState RangeState(Range);
+  S = RangeState;
+  return S == Assumed ? ChangeStatus::UNCHANGED : ChangeStatus::CHANGED;
+}
+
+/// Base class for deriving different size ranges. This size range is
+/// specifically intended for flat-workgroup-size and waves-per-eu attributes.
+/// Therefore, updating the attribute should use
+/// \p mergeRangeStateAndIndicateChange. Given this, the attribute may not be
+/// applicable to other potential size ranges in the future.
 struct AAAMDSizeRangeAttribute
     : public StateWrapper<IntegerRangeState, AbstractAttribute, uint32_t> {
   using Base = StateWrapper<IntegerRangeState, AbstractAttribute, uint32_t>;
@@ -826,8 +850,8 @@ struct AAAMDSizeRangeAttribute
       if (!CallerInfo || !CallerInfo->isValidState())
         return false;
 
-      Change |=
-          clampStateAndIndicateChange(this->getState(), CallerInfo->getState());
+      Change |= mergeRangeStateAndIndicateChange(this->getState(),
+                                                 CallerInfo->getState());
 
       return true;
     };
@@ -903,15 +927,14 @@ struct AAAMDFlatWorkGroupSize : public AAAMDSizeRangeAttribute {
       }
     }
 
-    // We don't want to directly clamp the state if it's the max range because
+    // We don't want to directly merge the state if it's the max range because
     // that is basically the worst state.
     if (Range == MaxRange)
       return;
 
     auto [Min, Max] = Range;
     ConstantRange CR(APInt(32, Min), APInt(32, Max + 1));
-    IntegerRangeState IRS(CR);
-    clampStateAndIndicateChange(this->getState(), IRS);
+    this->getState() = IntegerRangeState(CR);
 
     if (HasAttr || AMDGPU::isEntryFunctionCC(F->getCallingConv()))
       indicateOptimisticFixpoint();
diff --git a/llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll b/llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
index 1afd31c6d45e7d..0a3121f9de1676 100644
--- a/llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
+++ b/llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
@@ -86,7 +86,7 @@ define internal void @flat_group_512_1024() #4 {
 
 define amdgpu_kernel void @kernel_128_512() #5 {
 ; CHECK-LABEL: define {{[^@]+}}@kernel_128_512
-; CHECK-SAME: () #[[ATTR2]] {
+; CHECK-SAME: () #[[ATTR6:[0-9]+]] {
 ; CHECK-NEXT:    call void @default_to_128_512()
 ; CHECK-NEXT:    call void @flat_group_64_64()
 ; CHECK-NEXT:    ret void
@@ -98,7 +98,7 @@ define amdgpu_kernel void @kernel_128_512() #5 {
 
 define amdgpu_kernel void @kernel_512_512() #6 {
 ; CHECK-LABEL: define {{[^@]+}}@kernel_512_512
-; CHECK-SAME: () #[[ATTR6:[0-9]+]] {
+; CHECK-SAME: () #[[ATTR2]] {
 ; CHECK-NEXT:    call void @default_to_128_512()
 ; CHECK-NEXT:    call void @flat_group_512_1024()
 ; CHECK-NEXT:    ret void
@@ -111,7 +111,7 @@ define amdgpu_kernel void @kernel_512_512() #6 {
 ; Called from kernels with 128,256 and 64,128 => 64,256
 define internal void @default_to_64_256() {
 ; CHECK-LABEL: define {{[^@]+}}@default_to_64_256
-; CHECK-SAME: () #[[ATTR7:[0-9]+]] {
+; CHECK-SAME: () #[[ATTR4]] {
 ; CHECK-NEXT:    ret void
 ;
   ret void
@@ -153,7 +153,7 @@ define internal void @merge_cycle_1() #3 {
 
 define amdgpu_kernel void @kernel_64_256() #7 {
 ; CHECK-LABEL: define {{[^@]+}}@kernel_64_256
-; CHECK-SAME: () #[[ATTR7]] {
+; CHECK-SAME: () #[[ATTR7:[0-9]+]] {
 ; CHECK-NEXT:    call void @merge_cycle_0()
 ; CHECK-NEXT:    call void @default_captured_address()
 ; CHECK-NEXT:    call void @externally_visible_default()
@@ -205,11 +205,11 @@ attributes #7 = { "amdgpu-flat-work-group-size"="64,256" }
 ;.
 ; CHECK: attributes #[[ATTR0]] = { "amdgpu-flat-work-group-size"="1,256" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR1]] = { "amdgpu-flat-work-group-size"="64,128" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR2]] = { "amdgpu-flat-work-group-size"="128,512" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="2,10" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR2]] = { "amdgpu-flat-work-group-size"="512,512" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="2,10" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR3]] = { "amdgpu-flat-work-group-size"="64,64" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR4]] = { "amdgpu-flat-work-group-size"="128,256" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR5]] = { "amdgpu-flat-work-group-size"="512,1024" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR6]] = { "amdgpu-flat-work-group-size"="512,512" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="2,10" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR6]] = { "amdgpu-flat-work-group-size"="128,512" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="2,10" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR7]] = { "amdgpu-flat-work-group-size"="64,256" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR8]] = { "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ;.

ret void
}

; Called from kernels with 128,256 and 64,128 => 64,256
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also update the comments

@arsenm
Copy link
Contributor

arsenm commented Feb 6, 2025

This does not sound correct. You can't just start executing functions with a higher minimum group size than the callers. This is not just about resourcing, but will break known bits optimizations on the work item IDs

@shiltian
Copy link
Contributor Author

shiltian commented Feb 6, 2025

If we have a function foo that is called by two callers, bar and baz, where bar has (32,256) and baz has (128,512), what should foo have?

This is not just about resourcing, but will break known bits optimizations on the work item IDs

Can you explain more?

@arsenm
Copy link
Contributor

arsenm commented Feb 6, 2025

If we have a function foo that is called by two callers, bar and baz, where bar has (32,256) and baz has (128,512), what should foo have?

32, 512. If you really wanted to increase the minimum range, you would have to modify the callers values too (which definitely isn't legal for the maximum, I'm less sure on the minimums)

@shiltian shiltian closed this Feb 6, 2025
@shiltian shiltian deleted the users/shiltian/new-range-for-flat-wgrp-size-and-waves-per-eu branch February 6, 2025 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants