[AMDGPU][Attributor] Fix the merge of two range states of flat workgroup size #125889

shiltian · 2025-02-05T17:36:01Z

This PR updates the logic for merging two flat workgroup size pairs. The current
implementation, which uses clampStateAndIndicateChange to compute a union, is
not ideal. If we think from ensure proper resource allocation perspective, for
instance, if one pair specifies a minimum of 32 flat workgroup size, and another
specifies a minimum of 64, we should guarantee that 64 flat workgroup size can
be supported, as failing to do so could result in excessive resource allocation.
A similar principle applies to the upper bound. Thus, the PR uses the following
approach for merging two pairs,

lo_a,up_a and lo_b,up_b: max(lo_a, lo_b), max(up_a, up_b).

This ensures that resource allocation adheres to the stricter constraints from
both inputs.

This PR only updates the merge for AAAMDFlatWorkGroupSize. AAAMDWavesPerEU
will be rewritten in a follow-up.

…oup size This PR updates the logic for merging two flat workgroup size pairs. The current implementation, which uses `clampStateAndIndicateChange` to compute a union, is not ideal. If we think from ensure proper resource allocation perspective, for instance, if one pair specifies a minimum of 32 flat workgroup size, and another specifies a minimum of 64, we should guarantee that 64 flat workgroup size can be supported, as failing to do so could result in excessive resource allocation. A similar principle applies to the upper bound. Thus, the PR uses the following approach for merging two pairs, `lo_a,up_a and lo_b,up_b: max(lo_a, lo_b), max(up_a, up_b)`. This ensures that resource allocation adheres to the stricter constraints from both inputs. This PR only updates the merge for `AAAMDFlatWorkGroupSize`. `AAAMDWavesPerEU` will be rewritten in a follow-up.

shiltian · 2025-02-05T17:36:25Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-02-05T17:39:05Z

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

This PR updates the logic for merging two flat workgroup size pairs. The current
implementation, which uses clampStateAndIndicateChange to compute a union, is
not ideal. If we think from ensure proper resource allocation perspective, for
instance, if one pair specifies a minimum of 32 flat workgroup size, and another
specifies a minimum of 64, we should guarantee that 64 flat workgroup size can
be supported, as failing to do so could result in excessive resource allocation.
A similar principle applies to the upper bound. Thus, the PR uses the following
approach for merging two pairs,

lo_a,up_a and lo_b,up_b: max(lo_a, lo_b), max(up_a, up_b).

This ensures that resource allocation adheres to the stricter constraints from
both inputs.

This PR only updates the merge for AAAMDFlatWorkGroupSize. AAAMDWavesPerEU
will be rewritten in a follow-up.

Full diff: https://github.com/llvm/llvm-project/pull/125889.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp (+29-6)
(modified) llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll (+6-6)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index 546db318c17d53..16be7d50127083 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -799,7 +799,31 @@ AAAMDAttributes &AAAMDAttributes::createForPosition(const IRPosition &IRP,
   llvm_unreachable("AAAMDAttributes is only valid for function position");
 }
 
-/// Base class to derive different size ranges.
+/// Merge two range states and indicate changes. The range can apply to
+/// flat-workgroup-size and waves-per-eu attributes. The growth/merge of these
+/// two attributes is neither a union nor an intersection. Instead, it follows
+/// the rule below:
+///
+/// Lower = max(S.Lower, R.Lower)
+/// Upper = max(S.Upper, R.Upper)
+ChangeStatus mergeRangeStateAndIndicateChange(IntegerRangeState &S,
+                                              const IntegerRangeState &R) {
+  auto Assumed = S.getAssumed();
+  unsigned Min = R.getAssumed().getLower().getZExtValue();
+  unsigned Max = R.getAssumed().getUpper().getZExtValue();
+  Min = std::max(Min, static_cast<unsigned>(Assumed.getLower().getZExtValue()));
+  Max = std::max(Max, static_cast<unsigned>(Assumed.getUpper().getZExtValue()));
+  ConstantRange Range(APInt(32, Min), APInt(32, Max));
+  IntegerRangeState RangeState(Range);
+  S = RangeState;
+  return S == Assumed ? ChangeStatus::UNCHANGED : ChangeStatus::CHANGED;
+}
+
+/// Base class for deriving different size ranges. This size range is
+/// specifically intended for flat-workgroup-size and waves-per-eu attributes.
+/// Therefore, updating the attribute should use
+/// \p mergeRangeStateAndIndicateChange. Given this, the attribute may not be
+/// applicable to other potential size ranges in the future.
 struct AAAMDSizeRangeAttribute
     : public StateWrapper<IntegerRangeState, AbstractAttribute, uint32_t> {
   using Base = StateWrapper<IntegerRangeState, AbstractAttribute, uint32_t>;
@@ -826,8 +850,8 @@ struct AAAMDSizeRangeAttribute
       if (!CallerInfo || !CallerInfo->isValidState())
         return false;
 
-      Change |=
-          clampStateAndIndicateChange(this->getState(), CallerInfo->getState());
+      Change |= mergeRangeStateAndIndicateChange(this->getState(),
+                                                 CallerInfo->getState());
 
       return true;
     };
@@ -903,15 +927,14 @@ struct AAAMDFlatWorkGroupSize : public AAAMDSizeRangeAttribute {
       }
     }
 
-    // We don't want to directly clamp the state if it's the max range because
+    // We don't want to directly merge the state if it's the max range because
     // that is basically the worst state.
     if (Range == MaxRange)
       return;
 
     auto [Min, Max] = Range;
     ConstantRange CR(APInt(32, Min), APInt(32, Max + 1));
-    IntegerRangeState IRS(CR);
-    clampStateAndIndicateChange(this->getState(), IRS);
+    this->getState() = IntegerRangeState(CR);
 
     if (HasAttr || AMDGPU::isEntryFunctionCC(F->getCallingConv()))
       indicateOptimisticFixpoint();
diff --git a/llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll b/llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
index 1afd31c6d45e7d..0a3121f9de1676 100644
--- a/llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
+++ b/llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll
@@ -86,7 +86,7 @@ define internal void @flat_group_512_1024() #4 {
 
 define amdgpu_kernel void @kernel_128_512() #5 {
 ; CHECK-LABEL: define {{[^@]+}}@kernel_128_512
-; CHECK-SAME: () #[[ATTR2]] {
+; CHECK-SAME: () #[[ATTR6:[0-9]+]] {
 ; CHECK-NEXT:    call void @default_to_128_512()
 ; CHECK-NEXT:    call void @flat_group_64_64()
 ; CHECK-NEXT:    ret void
@@ -98,7 +98,7 @@ define amdgpu_kernel void @kernel_128_512() #5 {
 
 define amdgpu_kernel void @kernel_512_512() #6 {
 ; CHECK-LABEL: define {{[^@]+}}@kernel_512_512
-; CHECK-SAME: () #[[ATTR6:[0-9]+]] {
+; CHECK-SAME: () #[[ATTR2]] {
 ; CHECK-NEXT:    call void @default_to_128_512()
 ; CHECK-NEXT:    call void @flat_group_512_1024()
 ; CHECK-NEXT:    ret void
@@ -111,7 +111,7 @@ define amdgpu_kernel void @kernel_512_512() #6 {
 ; Called from kernels with 128,256 and 64,128 => 64,256
 define internal void @default_to_64_256() {
 ; CHECK-LABEL: define {{[^@]+}}@default_to_64_256
-; CHECK-SAME: () #[[ATTR7:[0-9]+]] {
+; CHECK-SAME: () #[[ATTR4]] {
 ; CHECK-NEXT:    ret void
 ;
   ret void
@@ -153,7 +153,7 @@ define internal void @merge_cycle_1() #3 {
 
 define amdgpu_kernel void @kernel_64_256() #7 {
 ; CHECK-LABEL: define {{[^@]+}}@kernel_64_256
-; CHECK-SAME: () #[[ATTR7]] {
+; CHECK-SAME: () #[[ATTR7:[0-9]+]] {
 ; CHECK-NEXT:    call void @merge_cycle_0()
 ; CHECK-NEXT:    call void @default_captured_address()
 ; CHECK-NEXT:    call void @externally_visible_default()
@@ -205,11 +205,11 @@ attributes #7 = { "amdgpu-flat-work-group-size"="64,256" }
 ;.
 ; CHECK: attributes #[[ATTR0]] = { "amdgpu-flat-work-group-size"="1,256" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR1]] = { "amdgpu-flat-work-group-size"="64,128" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR2]] = { "amdgpu-flat-work-group-size"="128,512" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="2,10" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR2]] = { "amdgpu-flat-work-group-size"="512,512" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="2,10" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR3]] = { "amdgpu-flat-work-group-size"="64,64" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR4]] = { "amdgpu-flat-work-group-size"="128,256" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR5]] = { "amdgpu-flat-work-group-size"="512,1024" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="4,10" "uniform-work-group-size"="false" }
-; CHECK: attributes #[[ATTR6]] = { "amdgpu-flat-work-group-size"="512,512" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="2,10" "uniform-work-group-size"="false" }
+; CHECK: attributes #[[ATTR6]] = { "amdgpu-flat-work-group-size"="128,512" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "amdgpu-waves-per-eu"="2,10" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR7]] = { "amdgpu-flat-work-group-size"="64,256" "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ; CHECK: attributes #[[ATTR8]] = { "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
 ;.

shiltian · 2025-02-05T17:39:54Z

llvm/test/CodeGen/AMDGPU/propagate-flat-work-group-size.ll

  ret void
 }

 ; Called from kernels with 128,256 and 64,128 => 64,256


I should also update the comments

arsenm · 2025-02-06T03:34:45Z

This does not sound correct. You can't just start executing functions with a higher minimum group size than the callers. This is not just about resourcing, but will break known bits optimizations on the work item IDs

shiltian · 2025-02-06T03:41:16Z

If we have a function foo that is called by two callers, bar and baz, where bar has (32,256) and baz has (128,512), what should foo have?

This is not just about resourcing, but will break known bits optimizations on the work item IDs

Can you explain more?

arsenm · 2025-02-06T03:53:06Z

If we have a function foo that is called by two callers, bar and baz, where bar has (32,256) and baz has (128,512), what should foo have?

32, 512. If you really wanted to increase the minimum range, you would have to modify the callers values too (which definitely isn't legal for the maximum, I'm less sure on the minimums)

llvmbot added the backend:AMDGPU label Feb 5, 2025

shiltian requested review from arsenm and jdoerfert February 5, 2025 17:36

shiltian mentioned this pull request Feb 5, 2025

[AMDGPU][Attributor] Rework update of AAAMDWavesPerEU #123995

Merged

shiltian commented Feb 5, 2025

View reviewed changes

shiltian closed this Feb 6, 2025

shiltian deleted the users/shiltian/new-range-for-flat-wgrp-size-and-waves-per-eu branch February 6, 2025 04:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU][Attributor] Fix the merge of two range states of flat workgroup size #125889

[AMDGPU][Attributor] Fix the merge of two range states of flat workgroup size #125889

Uh oh!

shiltian commented Feb 5, 2025

Uh oh!

shiltian commented Feb 5, 2025

Uh oh!

llvmbot commented Feb 5, 2025

Uh oh!

shiltian Feb 5, 2025

Uh oh!

arsenm commented Feb 6, 2025

Uh oh!

shiltian commented Feb 6, 2025 •

edited

Loading

Uh oh!

arsenm commented Feb 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[AMDGPU][Attributor] Fix the merge of two range states of flat workgroup size #125889

[AMDGPU][Attributor] Fix the merge of two range states of flat workgroup size #125889

Uh oh!

Conversation

shiltian commented Feb 5, 2025

Uh oh!

shiltian commented Feb 5, 2025

Uh oh!

llvmbot commented Feb 5, 2025

Uh oh!

shiltian Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm commented Feb 6, 2025

Uh oh!

shiltian commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm commented Feb 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shiltian commented Feb 6, 2025 •

edited

Loading