Skip to content

Conversation

@choikwa
Copy link
Contributor

@choikwa choikwa commented Dec 2, 2025

AMDGPU backend has poor code generation (scalarized copy, but best compiler can do on CDNA on arbitrary vector IR) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a threshold mechanism to prevent promoting allocas with dynamic indices when the number of vector elements exceeds a configurable limit. This addresses poor code generation and compile-time issues in the AMDGPU backend when extracting subvectors with dynamic indices from large vectors (e.g., <128 x i8> with <32 x i8> subvector users).

Key Changes:

  • Introduced a new command-line option DynIdxNumElmLimit (default: 8) to control the maximum number of elements for alloca promotion with dynamic indices
  • Added validation in GEP handling to check if dynamic indices are used in loads and reject promotion when element count exceeds the threshold
  • Added test cases demonstrating the behavior with different vector sizes (v32i8, v8i8) and non-load GEP usage

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Implements the dynamic index element limit check in GEP validation logic
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll Adds test cases verifying the threshold behavior for different vector sizes and GEP usage patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Kevin Choi (choikwa)

Changes

AMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.


Full diff: https://github.com/llvm/llvm-project/pull/170327.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+22)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll (+80)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index bb95265a794a0..aba660ffb6e45 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -85,6 +85,11 @@ static cl::opt<unsigned>
                             "when sorting profitable allocas"),
                    cl::init(4));
 
+static cl::opt<unsigned> DynIdxNumElmLimit("dynamic-index-num-element-limit",
+    cl::desc("Maximum number of elements for promoting alloca with dynamic"
+      " index"),
+    cl::init(8));
+
 // Shared implementation which can do both promotion to vector and to LDS.
 class AMDGPUPromoteAllocaImpl {
 private:
@@ -919,6 +924,23 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {
       Value *Index = GEPToVectorIndex(GEP, &Alloca, VecEltTy, *DL, NewGEPInsts);
       if (!Index)
         return RejectUser(Inst, "cannot compute vector index for GEP");
+      
+      if (!isa<ConstantInt>(Index)) {
+        bool UsedInLoad = false;
+        for (auto *U : GEP->users()) {
+          if(isa<LoadInst>(U)) {
+            UsedInLoad = true;
+            break;
+          }
+        }
+        if (auto *UserVecTy = dyn_cast<FixedVectorType>(
+                GEP->getSourceElementType())) {
+          if (UsedInLoad && UserVecTy->getNumElements() > DynIdxNumElmLimit) {
+            return RejectUser(Inst, 
+              "user has too many number of elements for dynamic index");
+          }
+        }
+      }
 
       GEPVectorIdx[GEP] = Index;
       UsersToRemove.push_back(Inst);
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
index 76e1868b3c4b9..caab29b58c13f 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
@@ -3,6 +3,8 @@
 
 ; Check that invalid IR is not produced on a vector typed
 ; getelementptr with a scalar alloca pointer base.
+; Also check if GEP with dynamic index is rejected above
+; threshold # of elements.
 
 define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
 ; CHECK-LABEL: define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
@@ -250,6 +252,84 @@ bb2:
   store i32 0, ptr addrspace(5) %extractelement
   ret void
 }
+
+define amdgpu_kernel void @GEP_dynamic_idx_v32i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v32i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds <16 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT:    [[VEC:%.*]] = load <16 x i8>, ptr addrspace(5) [[GEP]], align 4
+; CHECK-NEXT:    store <16 x i8> [[VEC]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <16 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %vec = load <16 x i8>, ptr addrspace(5) %gep, align 4
+  store <16 x i8> %vec, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_v8i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v8i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = freeze <64 x i8> poison
+; CHECK-NEXT:    [[TMP0:%.*]] = mul i32 [[IDX]], 8
+; CHECK-NEXT:    [[TMP1:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP0]]
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <8 x i8> poison, i8 [[TMP1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = add i32 [[TMP0]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP3]]
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <8 x i8> [[TMP2]], i8 [[TMP4]], i64 1
+; CHECK-NEXT:    [[TMP6:%.*]] = add i32 [[TMP0]], 2
+; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP6]]
+; CHECK-NEXT:    [[TMP8:%.*]] = insertelement <8 x i8> [[TMP5]], i8 [[TMP7]], i64 2
+; CHECK-NEXT:    [[TMP9:%.*]] = add i32 [[TMP0]], 3
+; CHECK-NEXT:    [[TMP10:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP9]]
+; CHECK-NEXT:    [[TMP11:%.*]] = insertelement <8 x i8> [[TMP8]], i8 [[TMP10]], i64 3
+; CHECK-NEXT:    [[TMP12:%.*]] = add i32 [[TMP0]], 4
+; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP12]]
+; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <8 x i8> [[TMP11]], i8 [[TMP13]], i64 4
+; CHECK-NEXT:    [[TMP15:%.*]] = add i32 [[TMP0]], 5
+; CHECK-NEXT:    [[TMP16:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP15]]
+; CHECK-NEXT:    [[TMP17:%.*]] = insertelement <8 x i8> [[TMP14]], i8 [[TMP16]], i64 5
+; CHECK-NEXT:    [[TMP18:%.*]] = add i32 [[TMP0]], 6
+; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP18]]
+; CHECK-NEXT:    [[TMP20:%.*]] = insertelement <8 x i8> [[TMP17]], i8 [[TMP19]], i64 6
+; CHECK-NEXT:    [[TMP21:%.*]] = add i32 [[TMP0]], 7
+; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP21]]
+; CHECK-NEXT:    [[TMP23:%.*]] = insertelement <8 x i8> [[TMP20]], i8 [[TMP22]], i64 7
+; CHECK-NEXT:    store <8 x i8> [[TMP23]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %vec = load <8 x i8>, ptr addrspace(5) %gep, align 4
+  store <8 x i8> %vec, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_noload(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_noload(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds <8 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT:    [[GEPINT:%.*]] = ptrtoint ptr addrspace(5) [[GEP]] to i64
+; CHECK-NEXT:    store i64 [[GEPINT]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %gepint = ptrtoint ptr addrspace(5) %gep to i64
+  store i64 %gepint, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+
 ;.
 ; CHECK: [[META0]] = !{}
 ; CHECK: [[RNG1]] = !{i32 0, i32 1025}

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

choikwa and others added 3 commits December 2, 2025 17:21
…bove a threshold on number of elements

AMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc.
For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.
Copy link
Contributor

@perlfu perlfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fundamental limit here actually the number of elements in the GEP type, or rather the width of the GEP type in 32b VGPRs?
I guess alignment (or rather misalignment) drives the complexity explosion?

@ruiling
Copy link
Contributor

ruiling commented Dec 3, 2025

dynamic indexing will blow up compile-time in GreedyRA.

Have you done some further investigation why it causes issues in GreedyRA?

Note that by adding another limit, we are also making the pass less useful in alloca promotion. Do you have the runtime performance and compile-time numbers with and without this change for your case? 8 sounds too small, maybe 16? (since the case you cared has 32 elements).

@choikwa
Copy link
Contributor Author

choikwa commented Dec 3, 2025

dynamic indexing will blow up compile-time in GreedyRA.

Have you done some further investigation why it causes issues in GreedyRA?

Note that by adding another limit, we are also making the pass less useful in alloca promotion. Do you have the runtime performance and compile-time numbers with and without this change for your case? 8 sounds too small, maybe 16? (since the case you cared has 32 elements).

Yes, we had an MLIR testcase (SWDEV-559837) that would blow up in compile-time when promote alloca tried to create <128 x i8> with <16 x i8> users. After rejecting those cases, compile time dropped from ~2min to 0.5s in my sandbox. Investigation has shown that a long chain of extract/insert elements with dynamic index would end up creating 35x more LiveIntervals for GreedyRA to deal with, and ends up being bogged down in interference check in the eviction phase.
I've discussed with colleagues and the hope is that this fix is very surgical to avoid dropping runtime perf while targetting compile-time. Internally, we are tracking runtime perf and thought that this change was too small to warrant a custom request.

Edit: This was a regression from SWDEV-525817, but seeing how in that case the promote alloca needed to turn [16 x double] to <16 x double>, I don't suspect hipBone to regress with this change.

@choikwa
Copy link
Contributor Author

choikwa commented Dec 3, 2025

Is the fundamental limit here actually the number of elements in the GEP type, or rather the width of the GEP type in 32b VGPRs? I guess alignment (or rather misalignment) drives the complexity explosion?

It looks like the IR count in SDag scales linearly with number of elements (roughly 4x after legalization etc per extract/insert). Problem seems especially bad in GreedyRA with O(n^2) or O(nlogn) interference check as seen in the compilation profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants