-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[AMDGPU] Limit promoting allocas that have users with dynamic index above a threshold on number of elements #170327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a threshold mechanism to prevent promoting allocas with dynamic indices when the number of vector elements exceeds a configurable limit. This addresses poor code generation and compile-time issues in the AMDGPU backend when extracting subvectors with dynamic indices from large vectors (e.g., <128 x i8> with <32 x i8> subvector users).
Key Changes:
- Introduced a new command-line option
DynIdxNumElmLimit(default: 8) to control the maximum number of elements for alloca promotion with dynamic indices - Added validation in GEP handling to check if dynamic indices are used in loads and reject promotion when element count exceeds the threshold
- Added test cases demonstrating the behavior with different vector sizes (v32i8, v8i8) and non-load GEP usage
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | Implements the dynamic index element limit check in GEP validation logic |
| llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll | Adds test cases verifying the threshold behavior for different vector sizes and GEP usage patterns |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@llvm/pr-subscribers-backend-amdgpu Author: Kevin Choi (choikwa) ChangesAMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA. Added check in GEP to see if it's used in a load. Full diff: https://github.com/llvm/llvm-project/pull/170327.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index bb95265a794a0..aba660ffb6e45 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -85,6 +85,11 @@ static cl::opt<unsigned>
"when sorting profitable allocas"),
cl::init(4));
+static cl::opt<unsigned> DynIdxNumElmLimit("dynamic-index-num-element-limit",
+ cl::desc("Maximum number of elements for promoting alloca with dynamic"
+ " index"),
+ cl::init(8));
+
// Shared implementation which can do both promotion to vector and to LDS.
class AMDGPUPromoteAllocaImpl {
private:
@@ -919,6 +924,23 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {
Value *Index = GEPToVectorIndex(GEP, &Alloca, VecEltTy, *DL, NewGEPInsts);
if (!Index)
return RejectUser(Inst, "cannot compute vector index for GEP");
+
+ if (!isa<ConstantInt>(Index)) {
+ bool UsedInLoad = false;
+ for (auto *U : GEP->users()) {
+ if(isa<LoadInst>(U)) {
+ UsedInLoad = true;
+ break;
+ }
+ }
+ if (auto *UserVecTy = dyn_cast<FixedVectorType>(
+ GEP->getSourceElementType())) {
+ if (UsedInLoad && UserVecTy->getNumElements() > DynIdxNumElmLimit) {
+ return RejectUser(Inst,
+ "user has too many number of elements for dynamic index");
+ }
+ }
+ }
GEPVectorIdx[GEP] = Index;
UsersToRemove.push_back(Inst);
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
index 76e1868b3c4b9..caab29b58c13f 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
@@ -3,6 +3,8 @@
; Check that invalid IR is not produced on a vector typed
; getelementptr with a scalar alloca pointer base.
+; Also check if GEP with dynamic index is rejected above
+; threshold # of elements.
define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
; CHECK-LABEL: define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
@@ -250,6 +252,84 @@ bb2:
store i32 0, ptr addrspace(5) %extractelement
ret void
}
+
+define amdgpu_kernel void @GEP_dynamic_idx_v32i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v32i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT: [[GEP:%.*]] = getelementptr inbounds <16 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT: [[VEC:%.*]] = load <16 x i8>, ptr addrspace(5) [[GEP]], align 4
+; CHECK-NEXT: store <16 x i8> [[VEC]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT: ret void
+;
+entry:
+ %alloca = alloca [64 x i8], align 4, addrspace(5)
+ %gep = getelementptr inbounds <16 x i8>, ptr addrspace(5) %alloca, i32 %idx
+ %vec = load <16 x i8>, ptr addrspace(5) %gep, align 4
+ store <16 x i8> %vec, ptr addrspace(1) %out, align 4
+ ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_v8i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v8i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[ALLOCA:%.*]] = freeze <64 x i8> poison
+; CHECK-NEXT: [[TMP0:%.*]] = mul i32 [[IDX]], 8
+; CHECK-NEXT: [[TMP1:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP0]]
+; CHECK-NEXT: [[TMP2:%.*]] = insertelement <8 x i8> poison, i8 [[TMP1]], i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[TMP0]], 1
+; CHECK-NEXT: [[TMP4:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP3]]
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <8 x i8> [[TMP2]], i8 [[TMP4]], i64 1
+; CHECK-NEXT: [[TMP6:%.*]] = add i32 [[TMP0]], 2
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP6]]
+; CHECK-NEXT: [[TMP8:%.*]] = insertelement <8 x i8> [[TMP5]], i8 [[TMP7]], i64 2
+; CHECK-NEXT: [[TMP9:%.*]] = add i32 [[TMP0]], 3
+; CHECK-NEXT: [[TMP10:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP9]]
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <8 x i8> [[TMP8]], i8 [[TMP10]], i64 3
+; CHECK-NEXT: [[TMP12:%.*]] = add i32 [[TMP0]], 4
+; CHECK-NEXT: [[TMP13:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = insertelement <8 x i8> [[TMP11]], i8 [[TMP13]], i64 4
+; CHECK-NEXT: [[TMP15:%.*]] = add i32 [[TMP0]], 5
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP15]]
+; CHECK-NEXT: [[TMP17:%.*]] = insertelement <8 x i8> [[TMP14]], i8 [[TMP16]], i64 5
+; CHECK-NEXT: [[TMP18:%.*]] = add i32 [[TMP0]], 6
+; CHECK-NEXT: [[TMP19:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP18]]
+; CHECK-NEXT: [[TMP20:%.*]] = insertelement <8 x i8> [[TMP17]], i8 [[TMP19]], i64 6
+; CHECK-NEXT: [[TMP21:%.*]] = add i32 [[TMP0]], 7
+; CHECK-NEXT: [[TMP22:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP21]]
+; CHECK-NEXT: [[TMP23:%.*]] = insertelement <8 x i8> [[TMP20]], i8 [[TMP22]], i64 7
+; CHECK-NEXT: store <8 x i8> [[TMP23]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT: ret void
+;
+entry:
+ %alloca = alloca [64 x i8], align 4, addrspace(5)
+ %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+ %vec = load <8 x i8>, ptr addrspace(5) %gep, align 4
+ store <8 x i8> %vec, ptr addrspace(1) %out, align 4
+ ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_noload(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_noload(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT: [[GEP:%.*]] = getelementptr inbounds <8 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT: [[GEPINT:%.*]] = ptrtoint ptr addrspace(5) [[GEP]] to i64
+; CHECK-NEXT: store i64 [[GEPINT]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT: ret void
+;
+entry:
+ %alloca = alloca [64 x i8], align 4, addrspace(5)
+ %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+ %gepint = ptrtoint ptr addrspace(5) %gep to i64
+ store i64 %gepint, ptr addrspace(1) %out, align 4
+ ret void
+}
+
+
;.
; CHECK: [[META0]] = !{}
; CHECK: [[RNG1]] = !{i32 0, i32 1025}
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
…bove a threshold on number of elements AMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> user), dynamic indexing will blow up compile-time in GreedyRA. Added check in GEP to see if it's used in a load. Added testcase to test different number of elements in subvector user.
Co-authored-by: Copilot <[email protected]>
perlfu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fundamental limit here actually the number of elements in the GEP type, or rather the width of the GEP type in 32b VGPRs?
I guess alignment (or rather misalignment) drives the complexity explosion?
Have you done some further investigation why it causes issues in GreedyRA? Note that by adding another limit, we are also making the pass less useful in alloca promotion. Do you have the runtime performance and compile-time numbers with and without this change for your case? |
Yes, we had an MLIR testcase (SWDEV-559837) that would blow up in compile-time when promote alloca tried to create <128 x i8> with <16 x i8> users. After rejecting those cases, compile time dropped from ~2min to 0.5s in my sandbox. Investigation has shown that a long chain of extract/insert elements with dynamic index would end up creating 35x more LiveIntervals for GreedyRA to deal with, and ends up being bogged down in interference check in the eviction phase. Edit: This was a regression from SWDEV-525817, but seeing how in that case the promote alloca needed to turn [16 x double] to <16 x double>, I don't suspect hipBone to regress with this change. |
It looks like the IR count in SDag scales linearly with number of elements (roughly 4x after legalization etc per extract/insert). Problem seems especially bad in GreedyRA with O(n^2) or O(nlogn) interference check as seen in the compilation profile. |
…ll, test different limit values
AMDGPU backend has poor code generation (scalarized copy, but best compiler can do on CDNA on arbitrary vector IR) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA.
Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.