[AMDGPU] Limit promoting allocas that have users with dynamic index above a threshold on number of elements #170327

choikwa · 2025-12-02T17:04:59Z

AMDGPU backend has poor code generation (scalarized copy, but best compiler can do on CDNA on arbitrary vector IR) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.

Copilot

Pull request overview

This PR adds a threshold mechanism to prevent promoting allocas with dynamic indices when the number of vector elements exceeds a configurable limit. This addresses poor code generation and compile-time issues in the AMDGPU backend when extracting subvectors with dynamic indices from large vectors (e.g., <128 x i8> with <32 x i8> subvector users).

Key Changes:

Introduced a new command-line option DynIdxNumElmLimit (default: 8) to control the maximum number of elements for alloca promotion with dynamic indices
Added validation in GEP handling to check if dynamic indices are used in loads and reject promotion when element count exceeds the threshold
Added test cases demonstrating the behavior with different vector sizes (v32i8, v8i8) and non-load GEP usage

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp	Implements the dynamic index element limit check in GEP validation logic
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll	Adds test cases verifying the threshold behavior for different vector sizes and GEP usage patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvmbot · 2025-12-02T17:05:31Z

@llvm/pr-subscribers-backend-amdgpu

Author: Kevin Choi (choikwa)

Changes

AMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.

Full diff: https://github.com/llvm/llvm-project/pull/170327.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+22)
(modified) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll (+80)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index bb95265a794a0..aba660ffb6e45 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -85,6 +85,11 @@ static cl::opt<unsigned>
                             "when sorting profitable allocas"),
                    cl::init(4));
 
+static cl::opt<unsigned> DynIdxNumElmLimit("dynamic-index-num-element-limit",
+    cl::desc("Maximum number of elements for promoting alloca with dynamic"
+      " index"),
+    cl::init(8));
+
 // Shared implementation which can do both promotion to vector and to LDS.
 class AMDGPUPromoteAllocaImpl {
 private:
@@ -919,6 +924,23 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {
       Value *Index = GEPToVectorIndex(GEP, &Alloca, VecEltTy, *DL, NewGEPInsts);
       if (!Index)
         return RejectUser(Inst, "cannot compute vector index for GEP");
+      
+      if (!isa<ConstantInt>(Index)) {
+        bool UsedInLoad = false;
+        for (auto *U : GEP->users()) {
+          if(isa<LoadInst>(U)) {
+            UsedInLoad = true;
+            break;
+          }
+        }
+        if (auto *UserVecTy = dyn_cast<FixedVectorType>(
+                GEP->getSourceElementType())) {
+          if (UsedInLoad && UserVecTy->getNumElements() > DynIdxNumElmLimit) {
+            return RejectUser(Inst, 
+              "user has too many number of elements for dynamic index");
+          }
+        }
+      }
 
       GEPVectorIdx[GEP] = Index;
       UsersToRemove.push_back(Inst);
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
index 76e1868b3c4b9..caab29b58c13f 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
@@ -3,6 +3,8 @@
 
 ; Check that invalid IR is not produced on a vector typed
 ; getelementptr with a scalar alloca pointer base.
+; Also check if GEP with dynamic index is rejected above
+; threshold # of elements.
 
 define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
 ; CHECK-LABEL: define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
@@ -250,6 +252,84 @@ bb2:
   store i32 0, ptr addrspace(5) %extractelement
   ret void
 }
+
+define amdgpu_kernel void @GEP_dynamic_idx_v32i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v32i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds <16 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT:    [[VEC:%.*]] = load <16 x i8>, ptr addrspace(5) [[GEP]], align 4
+; CHECK-NEXT:    store <16 x i8> [[VEC]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <16 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %vec = load <16 x i8>, ptr addrspace(5) %gep, align 4
+  store <16 x i8> %vec, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_v8i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v8i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = freeze <64 x i8> poison
+; CHECK-NEXT:    [[TMP0:%.*]] = mul i32 [[IDX]], 8
+; CHECK-NEXT:    [[TMP1:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP0]]
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <8 x i8> poison, i8 [[TMP1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = add i32 [[TMP0]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP3]]
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <8 x i8> [[TMP2]], i8 [[TMP4]], i64 1
+; CHECK-NEXT:    [[TMP6:%.*]] = add i32 [[TMP0]], 2
+; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP6]]
+; CHECK-NEXT:    [[TMP8:%.*]] = insertelement <8 x i8> [[TMP5]], i8 [[TMP7]], i64 2
+; CHECK-NEXT:    [[TMP9:%.*]] = add i32 [[TMP0]], 3
+; CHECK-NEXT:    [[TMP10:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP9]]
+; CHECK-NEXT:    [[TMP11:%.*]] = insertelement <8 x i8> [[TMP8]], i8 [[TMP10]], i64 3
+; CHECK-NEXT:    [[TMP12:%.*]] = add i32 [[TMP0]], 4
+; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP12]]
+; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <8 x i8> [[TMP11]], i8 [[TMP13]], i64 4
+; CHECK-NEXT:    [[TMP15:%.*]] = add i32 [[TMP0]], 5
+; CHECK-NEXT:    [[TMP16:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP15]]
+; CHECK-NEXT:    [[TMP17:%.*]] = insertelement <8 x i8> [[TMP14]], i8 [[TMP16]], i64 5
+; CHECK-NEXT:    [[TMP18:%.*]] = add i32 [[TMP0]], 6
+; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP18]]
+; CHECK-NEXT:    [[TMP20:%.*]] = insertelement <8 x i8> [[TMP17]], i8 [[TMP19]], i64 6
+; CHECK-NEXT:    [[TMP21:%.*]] = add i32 [[TMP0]], 7
+; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP21]]
+; CHECK-NEXT:    [[TMP23:%.*]] = insertelement <8 x i8> [[TMP20]], i8 [[TMP22]], i64 7
+; CHECK-NEXT:    store <8 x i8> [[TMP23]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %vec = load <8 x i8>, ptr addrspace(5) %gep, align 4
+  store <8 x i8> %vec, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_noload(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_noload(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds <8 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT:    [[GEPINT:%.*]] = ptrtoint ptr addrspace(5) [[GEP]] to i64
+; CHECK-NEXT:    store i64 [[GEPINT]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %gepint = ptrtoint ptr addrspace(5) %gep to i64
+  store i64 %gepint, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+
 ;.
 ; CHECK: [[META0]] = !{}
 ; CHECK: [[RNG1]] = !{i32 0, i32 1025}

github-actions · 2025-12-02T17:06:40Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…bove a threshold on number of elements AMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> user), dynamic indexing will blow up compile-time in GreedyRA. Added check in GEP to see if it's used in a load. Added testcase to test different number of elements in subvector user.

Co-authored-by: Copilot <[email protected]>

perlfu

Is the fundamental limit here actually the number of elements in the GEP type, or rather the width of the GEP type in 32b VGPRs?
I guess alignment (or rather misalignment) drives the complexity explosion?

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

ruiling · 2025-12-03T13:33:18Z

dynamic indexing will blow up compile-time in GreedyRA.

Have you done some further investigation why it causes issues in GreedyRA?

Note that by adding another limit, we are also making the pass less useful in alloca promotion. Do you have the runtime performance and compile-time numbers with and without this change for your case? 8 sounds too small, maybe 16? (since the case you cared has 32 elements).

choikwa · 2025-12-03T16:50:30Z

dynamic indexing will blow up compile-time in GreedyRA.

Have you done some further investigation why it causes issues in GreedyRA?

Note that by adding another limit, we are also making the pass less useful in alloca promotion. Do you have the runtime performance and compile-time numbers with and without this change for your case? 8 sounds too small, maybe 16? (since the case you cared has 32 elements).

Yes, we had an MLIR testcase (SWDEV-559837) that would blow up in compile-time when promote alloca tried to create <128 x i8> with <16 x i8> users. After rejecting those cases, compile time dropped from ~2min to 0.5s in my sandbox. Investigation has shown that a long chain of extract/insert elements with dynamic index would end up creating 35x more LiveIntervals for GreedyRA to deal with, and ends up being bogged down in interference check in the eviction phase.
I've discussed with colleagues and the hope is that this fix is very surgical to avoid dropping runtime perf while targetting compile-time. Internally, we are tracking runtime perf and thought that this change was too small to warrant a custom request.

Edit: This was a regression from SWDEV-525817, but seeing how in that case the promote alloca needed to turn [16 x double] to <16 x double>, I don't suspect hipBone to regress with this change.

choikwa · 2025-12-03T19:13:35Z

Is the fundamental limit here actually the number of elements in the GEP type, or rather the width of the GEP type in 32b VGPRs? I guess alignment (or rather misalignment) drives the complexity explosion?

It looks like the IR count in SDag scales linearly with number of elements (roughly 4x after legalization etc per extract/insert). Problem seems especially bad in GreedyRA with O(n^2) or O(nlogn) interference check as seen in the compilation profile.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

…ll, test different limit values

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

choikwa requested review from arsenm, Copilot and perlfu December 2, 2025 17:04

llvmbot added the backend:AMDGPU label Dec 2, 2025

Copilot AI reviewed Dec 2, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa and others added 3 commits December 2, 2025 17:21

Update llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

406a575

Co-authored-by: Copilot <[email protected]>

NFC, formatting

a311a65

choikwa force-pushed the lim-dyn-idx branch from 06ac1b1 to a311a65 Compare December 2, 2025 23:22

perlfu reviewed Dec 3, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa added 2 commits December 3, 2025 12:38

addressing feedback

80fb5a4

nfc, rename var

671de2f

choikwa added 2 commits December 3, 2025 13:14

nfc, formatting

3b72499

space

8e31b85

shiltian reviewed Dec 3, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa added 3 commits December 3, 2025 13:54

addressing feedback, move tests to promote-alloca-vector-dynamic-idx.…

904e5e0

…ll, test different limit values

format

8700e91

remove comment

9d7b94b

ruiling reviewed Dec 4, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa added 2 commits December 3, 2025 23:21

Look at User LoadInst's type

b7c57d4

format

c0fd36f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Limit promoting allocas that have users with dynamic index above a threshold on number of elements #170327

[AMDGPU] Limit promoting allocas that have users with dynamic index above a threshold on number of elements #170327

choikwa commented Dec 2, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

llvmbot commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025 •

edited

Loading

Uh oh!

perlfu left a comment

Uh oh!

Uh oh!

Uh oh!

ruiling commented Dec 3, 2025

Uh oh!

choikwa commented Dec 3, 2025 •

edited

Loading

Uh oh!

choikwa commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[AMDGPU] Limit promoting allocas that have users with dynamic index above a threshold on number of elements #170327

Are you sure you want to change the base?

[AMDGPU] Limit promoting allocas that have users with dynamic index above a threshold on number of elements #170327

Conversation

choikwa commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

llvmbot commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perlfu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ruiling commented Dec 3, 2025

Uh oh!

choikwa commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choikwa commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

choikwa commented Dec 2, 2025 •

edited

Loading

github-actions bot commented Dec 2, 2025 •

edited

Loading

choikwa commented Dec 3, 2025 •

edited

Loading