-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[AMDGPU] AMDGPUPromoteAlloca: increase default max-regs to 32 #155076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Increase promote-alloca-to-vector-max-regs to 32 from 16. This restores default promotion of 16 x double which was disabled by llvm#127973.
|
@llvm/pr-subscribers-backend-amdgpu Author: Carl Ritson (perlfu) ChangesIncrease promote-alloca-to-vector-max-regs to 32 from 16. Fixes SWDEV-525817. Patch is 27.70 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/155076.diff 7 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index f226c7f381aa2..d988a89a506b9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -70,7 +70,7 @@ static cl::opt<unsigned> PromoteAllocaToVectorMaxRegs(
"amdgpu-promote-alloca-to-vector-max-regs",
cl::desc(
"Maximum vector size (in 32b registers) to use when promoting alloca"),
- cl::init(16));
+ cl::init(32));
// Use up to 1/4 of available register budget for vectorization.
// FIXME: Increase the limit for whole function budgets? Perhaps x2?
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu.private-memory.ll b/llvm/test/CodeGen/AMDGPU/amdgpu.private-memory.ll
index f4b90b4293a46..4859e291b0613 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu.private-memory.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu.private-memory.ll
@@ -441,8 +441,8 @@ entry:
; SI: buffer_load_dword
define amdgpu_kernel void @v16i32_stack(ptr addrspace(1) %out, i32 %a) {
- %alloca = alloca [2 x <16 x i32>], addrspace(5)
- %tmp0 = getelementptr [2 x <16 x i32>], ptr addrspace(5) %alloca, i32 0, i32 %a
+ %alloca = alloca [3 x <16 x i32>], addrspace(5)
+ %tmp0 = getelementptr [3 x <16 x i32>], ptr addrspace(5) %alloca, i32 0, i32 %a
%tmp5 = load <16 x i32>, ptr addrspace(5) %tmp0
store <16 x i32> %tmp5, ptr addrspace(1) %out
ret void
@@ -485,8 +485,8 @@ define amdgpu_kernel void @v16i32_stack(ptr addrspace(1) %out, i32 %a) {
; SI: buffer_load_dword
define amdgpu_kernel void @v16float_stack(ptr addrspace(1) %out, i32 %a) {
- %alloca = alloca [2 x <16 x float>], addrspace(5)
- %tmp0 = getelementptr [2 x <16 x float>], ptr addrspace(5) %alloca, i32 0, i32 %a
+ %alloca = alloca [3 x <16 x float>], addrspace(5)
+ %tmp0 = getelementptr [3 x <16 x float>], ptr addrspace(5) %alloca, i32 0, i32 %a
%tmp5 = load <16 x float>, ptr addrspace(5) %tmp0
store <16 x float> %tmp5, ptr addrspace(1) %out
ret void
@@ -501,8 +501,8 @@ define amdgpu_kernel void @v16float_stack(ptr addrspace(1) %out, i32 %a) {
; SI: buffer_load_dword
define amdgpu_kernel void @v2float_stack(ptr addrspace(1) %out, i32 %a) {
- %alloca = alloca [16 x <2 x float>], addrspace(5)
- %tmp0 = getelementptr [16 x <2 x float>], ptr addrspace(5) %alloca, i32 0, i32 %a
+ %alloca = alloca [17 x <2 x float>], addrspace(5)
+ %tmp0 = getelementptr [17 x <2 x float>], ptr addrspace(5) %alloca, i32 0, i32 %a
%tmp5 = load <2 x float>, ptr addrspace(5) %tmp0
store <2 x float> %tmp5, ptr addrspace(1) %out
ret void
diff --git a/llvm/test/CodeGen/AMDGPU/dynamic-vgpr-reserve-stack-for-cwsr.ll b/llvm/test/CodeGen/AMDGPU/dynamic-vgpr-reserve-stack-for-cwsr.ll
index 3f499535400ef..322703df1fcd9 100644
--- a/llvm/test/CodeGen/AMDGPU/dynamic-vgpr-reserve-stack-for-cwsr.ll
+++ b/llvm/test/CodeGen/AMDGPU/dynamic-vgpr-reserve-stack-for-cwsr.ll
@@ -150,7 +150,7 @@ define amdgpu_cs void @with_spills() #0 {
ret void
}
-define amdgpu_cs void @realign_stack(<32 x i32> %x) #0 {
+define amdgpu_cs void @realign_stack(<33 x i32> %x) #0 {
; CHECK-LABEL: realign_stack:
; CHECK: ; %bb.0:
; CHECK-NEXT: s_getreg_b32 s33, hwreg(HW_REG_HW_ID2, 8, 2)
@@ -158,8 +158,9 @@ define amdgpu_cs void @realign_stack(<32 x i32> %x) #0 {
; CHECK-NEXT: s_cmp_lg_u32 0, s33
; CHECK-NEXT: s_mov_b32 s0, callee@abs32@lo
; CHECK-NEXT: s_cmovk_i32 s33, 0x200
-; CHECK-NEXT: s_movk_i32 s32, 0x100
-; CHECK-NEXT: s_clause 0x7
+; CHECK-NEXT: s_movk_i32 s32, 0x180
+; CHECK-NEXT: s_clause 0x8
+; CHECK-NEXT: scratch_store_b32 off, v32, s33 offset:128
; CHECK-NEXT: scratch_store_b128 off, v[28:31], s33 offset:112
; CHECK-NEXT: scratch_store_b128 off, v[24:27], s33 offset:96
; CHECK-NEXT: scratch_store_b128 off, v[20:23], s33 offset:80
@@ -169,12 +170,12 @@ define amdgpu_cs void @realign_stack(<32 x i32> %x) #0 {
; CHECK-NEXT: scratch_store_b128 off, v[4:7], s33 offset:16
; CHECK-NEXT: scratch_store_b128 off, v[0:3], s33
; CHECK-NEXT: v_mov_b32_e32 v0, 0x47
-; CHECK-NEXT: s_cmovk_i32 s32, 0x300
+; CHECK-NEXT: s_cmovk_i32 s32, 0x380
; CHECK-NEXT: s_swappc_b64 s[30:31], s[0:1]
; CHECK-NEXT: s_alloc_vgpr 0
; CHECK-NEXT: s_endpgm
- %v = alloca <32 x i32>, align 128, addrspace(5)
- store <32 x i32> %x, ptr addrspace(5) %v
+ %v = alloca <33 x i32>, align 128, addrspace(5)
+ store <33 x i32> %x, ptr addrspace(5) %v
call amdgpu_gfx void @callee(i32 71)
ret void
}
diff --git a/llvm/test/CodeGen/AMDGPU/machine-function-info-cwsr.ll b/llvm/test/CodeGen/AMDGPU/machine-function-info-cwsr.ll
index cd428be729ae2..94ae31ccfb4ae 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-function-info-cwsr.ll
+++ b/llvm/test/CodeGen/AMDGPU/machine-function-info-cwsr.ll
@@ -31,11 +31,11 @@ define amdgpu_cs void @with_calls() #0 {
ret void
}
-define amdgpu_cs void @realign_stack(<32 x i32> %x) #0 {
+define amdgpu_cs void @realign_stack(<33 x i32> %x) #0 {
; CHECK-LABEL: {{^}}name: realign_stack
; CHECK: scratchReservedForDynamicVGPRs: 512
- %v = alloca <32 x i32>, align 128, addrspace(5)
- store <32 x i32> %x, ptr addrspace(5) %v
+ %v = alloca <33 x i32>, align 128, addrspace(5)
+ store <33 x i32> %x, ptr addrspace(5) %v
call amdgpu_gfx void @callee(i32 71)
ret void
}
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-max-regs.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-max-regs.ll
index ad42748ab3d60..c1123d7b515be 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-max-regs.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-max-regs.ll
@@ -1,9 +1,41 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
-; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-promote-alloca -disable-promote-alloca-to-lds=1 < %s | FileCheck --check-prefix=BASE --check-prefix=DEFAULT %s
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-promote-alloca -disable-promote-alloca-to-lds=1 -amdgpu-promote-alloca-to-vector-max-regs=16 < %s | FileCheck --check-prefix=BASE --check-prefix=MAX16 %s
; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-promote-alloca -disable-promote-alloca-to-lds=1 -amdgpu-promote-alloca-to-vector-max-regs=24 < %s | FileCheck --check-prefix=BASE %s --check-prefix=MAX24
-; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-promote-alloca -disable-promote-alloca-to-lds=1 -amdgpu-promote-alloca-to-vector-max-regs=32 < %s | FileCheck --check-prefix=BASE %s --check-prefix=MAX32
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-promote-alloca -disable-promote-alloca-to-lds=1 < %s | FileCheck --check-prefix=BASE %s --check-prefix=DEFAULT
define amdgpu_kernel void @i32_24_elements(ptr %out) #0 {
+; MAX16-LABEL: define amdgpu_kernel void @i32_24_elements(
+; MAX16-SAME: ptr [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
+; MAX16-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
+; MAX16-NEXT: [[Y:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.y()
+; MAX16-NEXT: [[C1:%.*]] = icmp uge i32 [[X]], 3
+; MAX16-NEXT: [[C2:%.*]] = icmp uge i32 [[Y]], 3
+; MAX16-NEXT: [[SEL1:%.*]] = select i1 [[C1]], i32 1, i32 2
+; MAX16-NEXT: [[SEL2:%.*]] = select i1 [[C2]], i32 0, i32 [[SEL1]]
+; MAX16-NEXT: [[ALLOCA:%.*]] = alloca [24 x i32], align 16, addrspace(5)
+; MAX16-NEXT: call void @llvm.memset.p5.i32(ptr addrspace(5) [[ALLOCA]], i8 0, i32 96, i1 false)
+; MAX16-NEXT: [[GEP_0:%.*]] = getelementptr inbounds [24 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 0
+; MAX16-NEXT: [[GEP_1:%.*]] = getelementptr inbounds [24 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 20
+; MAX16-NEXT: store i32 42, ptr addrspace(5) [[GEP_0]], align 4
+; MAX16-NEXT: store i32 43, ptr addrspace(5) [[GEP_1]], align 4
+; MAX16-NEXT: [[GEP:%.*]] = getelementptr inbounds [24 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 [[SEL2]]
+; MAX16-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(5) [[GEP]], align 4
+; MAX16-NEXT: store i32 [[LOAD]], ptr [[OUT]], align 4
+; MAX16-NEXT: ret void
+;
+; MAX24-LABEL: define amdgpu_kernel void @i32_24_elements(
+; MAX24-SAME: ptr [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
+; MAX24-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
+; MAX24-NEXT: [[Y:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.y()
+; MAX24-NEXT: [[C1:%.*]] = icmp uge i32 [[X]], 3
+; MAX24-NEXT: [[C2:%.*]] = icmp uge i32 [[Y]], 3
+; MAX24-NEXT: [[SEL1:%.*]] = select i1 [[C1]], i32 1, i32 2
+; MAX24-NEXT: [[SEL2:%.*]] = select i1 [[C2]], i32 0, i32 [[SEL1]]
+; MAX24-NEXT: [[ALLOCA:%.*]] = freeze <24 x i32> poison
+; MAX24-NEXT: [[TMP1:%.*]] = extractelement <24 x i32> <i32 42, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 43, i32 0, i32 0, i32 0>, i32 [[SEL2]]
+; MAX24-NEXT: store i32 [[TMP1]], ptr [[OUT]], align 4
+; MAX24-NEXT: ret void
+;
; DEFAULT-LABEL: define amdgpu_kernel void @i32_24_elements(
; DEFAULT-SAME: ptr [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
; DEFAULT-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
@@ -19,12 +51,50 @@ define amdgpu_kernel void @i32_24_elements(ptr %out) #0 {
; DEFAULT-NEXT: store i32 42, ptr addrspace(5) [[GEP_0]], align 4
; DEFAULT-NEXT: store i32 43, ptr addrspace(5) [[GEP_1]], align 4
; DEFAULT-NEXT: [[GEP:%.*]] = getelementptr inbounds [24 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 [[SEL2]]
-; DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(5) [[GEP]], align 4
-; DEFAULT-NEXT: store i32 [[LOAD]], ptr [[OUT]], align 4
+; DEFAULT-NEXT: [[TMP1:%.*]] = load i32, ptr addrspace(5) [[GEP]], align 4
+; DEFAULT-NEXT: store i32 [[TMP1]], ptr [[OUT]], align 4
; DEFAULT-NEXT: ret void
;
-; MAX24-LABEL: define amdgpu_kernel void @i32_24_elements(
-; MAX24-SAME: ptr [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
+ %x = tail call i32 @llvm.amdgcn.workitem.id.x()
+ %y = tail call i32 @llvm.amdgcn.workitem.id.y()
+ %c1 = icmp uge i32 %x, 3
+ %c2 = icmp uge i32 %y, 3
+ %sel1 = select i1 %c1, i32 1, i32 2
+ %sel2 = select i1 %c2, i32 0, i32 %sel1
+ %alloca = alloca [24 x i32], align 16, addrspace(5)
+ call void @llvm.memset.p5.i32(ptr addrspace(5) %alloca, i8 0, i32 96, i1 false)
+ %gep.0 = getelementptr inbounds [24 x i32], ptr addrspace(5) %alloca, i32 0, i32 0
+ %gep.1 = getelementptr inbounds [24 x i32], ptr addrspace(5) %alloca, i32 0, i32 20
+ store i32 42, ptr addrspace(5) %gep.0
+ store i32 43, ptr addrspace(5) %gep.1
+ %gep = getelementptr inbounds [24 x i32], ptr addrspace(5) %alloca, i32 0, i32 %sel2
+ %load = load i32, ptr addrspace(5) %gep
+ store i32 %load, ptr %out
+ ret void
+}
+
+define amdgpu_kernel void @i32_24_elements_attrib(ptr %out) #1 {
+; MAX16-LABEL: define amdgpu_kernel void @i32_24_elements_attrib(
+; MAX16-SAME: ptr [[OUT:%.*]]) #[[ATTR1:[0-9]+]] {
+; MAX16-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
+; MAX16-NEXT: [[Y:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.y()
+; MAX16-NEXT: [[C1:%.*]] = icmp uge i32 [[X]], 3
+; MAX16-NEXT: [[C2:%.*]] = icmp uge i32 [[Y]], 3
+; MAX16-NEXT: [[SEL1:%.*]] = select i1 [[C1]], i32 1, i32 2
+; MAX16-NEXT: [[SEL2:%.*]] = select i1 [[C2]], i32 0, i32 [[SEL1]]
+; MAX16-NEXT: [[ALLOCA:%.*]] = alloca [24 x i32], align 16, addrspace(5)
+; MAX16-NEXT: call void @llvm.memset.p5.i32(ptr addrspace(5) [[ALLOCA]], i8 0, i32 96, i1 false)
+; MAX16-NEXT: [[GEP_0:%.*]] = getelementptr inbounds [24 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 0
+; MAX16-NEXT: [[GEP_1:%.*]] = getelementptr inbounds [24 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 20
+; MAX16-NEXT: store i32 42, ptr addrspace(5) [[GEP_0]], align 4
+; MAX16-NEXT: store i32 43, ptr addrspace(5) [[GEP_1]], align 4
+; MAX16-NEXT: [[GEP:%.*]] = getelementptr inbounds [24 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 [[SEL2]]
+; MAX16-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(5) [[GEP]], align 4
+; MAX16-NEXT: store i32 [[LOAD]], ptr [[OUT]], align 4
+; MAX16-NEXT: ret void
+;
+; MAX24-LABEL: define amdgpu_kernel void @i32_24_elements_attrib(
+; MAX24-SAME: ptr [[OUT:%.*]]) #[[ATTR1:[0-9]+]] {
; MAX24-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
; MAX24-NEXT: [[Y:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.y()
; MAX24-NEXT: [[C1:%.*]] = icmp uge i32 [[X]], 3
@@ -36,18 +106,18 @@ define amdgpu_kernel void @i32_24_elements(ptr %out) #0 {
; MAX24-NEXT: store i32 [[TMP1]], ptr [[OUT]], align 4
; MAX24-NEXT: ret void
;
-; MAX32-LABEL: define amdgpu_kernel void @i32_24_elements(
-; MAX32-SAME: ptr [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
-; MAX32-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
-; MAX32-NEXT: [[Y:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.y()
-; MAX32-NEXT: [[C1:%.*]] = icmp uge i32 [[X]], 3
-; MAX32-NEXT: [[C2:%.*]] = icmp uge i32 [[Y]], 3
-; MAX32-NEXT: [[SEL1:%.*]] = select i1 [[C1]], i32 1, i32 2
-; MAX32-NEXT: [[SEL2:%.*]] = select i1 [[C2]], i32 0, i32 [[SEL1]]
-; MAX32-NEXT: [[ALLOCA:%.*]] = freeze <24 x i32> poison
-; MAX32-NEXT: [[TMP1:%.*]] = extractelement <24 x i32> <i32 42, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 43, i32 0, i32 0, i32 0>, i32 [[SEL2]]
-; MAX32-NEXT: store i32 [[TMP1]], ptr [[OUT]], align 4
-; MAX32-NEXT: ret void
+; DEFAULT-LABEL: define amdgpu_kernel void @i32_24_elements_attrib(
+; DEFAULT-SAME: ptr [[OUT:%.*]]) #[[ATTR1:[0-9]+]] {
+; DEFAULT-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
+; DEFAULT-NEXT: [[Y:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.y()
+; DEFAULT-NEXT: [[C1:%.*]] = icmp uge i32 [[X]], 3
+; DEFAULT-NEXT: [[C2:%.*]] = icmp uge i32 [[Y]], 3
+; DEFAULT-NEXT: [[SEL1:%.*]] = select i1 [[C1]], i32 1, i32 2
+; DEFAULT-NEXT: [[SEL2:%.*]] = select i1 [[C2]], i32 0, i32 [[SEL1]]
+; DEFAULT-NEXT: [[ALLOCA:%.*]] = freeze <24 x i32> poison
+; DEFAULT-NEXT: [[TMP1:%.*]] = extractelement <24 x i32> <i32 42, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 43, i32 0, i32 0, i32 0>, i32 [[SEL2]]
+; DEFAULT-NEXT: store i32 [[TMP1]], ptr [[OUT]], align 4
+; DEFAULT-NEXT: ret void
;
%x = tail call i32 @llvm.amdgcn.workitem.id.x()
%y = tail call i32 @llvm.amdgcn.workitem.id.y()
@@ -67,18 +137,24 @@ define amdgpu_kernel void @i32_24_elements(ptr %out) #0 {
ret void
}
-define amdgpu_kernel void @i32_24_elements_attrib(ptr %out) #1 {
-; BASE-LABEL: define amdgpu_kernel void @i32_24_elements_attrib(
-; BASE-SAME: ptr [[OUT:%.*]]) #[[ATTR1:[0-9]+]] {
+define amdgpu_kernel void @i32_32_elements(ptr %out) #0 {
+; BASE-LABEL: define amdgpu_kernel void @i32_32_elements(
+; BASE-SAME: ptr [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
; BASE-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
; BASE-NEXT: [[Y:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.y()
; BASE-NEXT: [[C1:%.*]] = icmp uge i32 [[X]], 3
; BASE-NEXT: [[C2:%.*]] = icmp uge i32 [[Y]], 3
; BASE-NEXT: [[SEL1:%.*]] = select i1 [[C1]], i32 1, i32 2
; BASE-NEXT: [[SEL2:%.*]] = select i1 [[C2]], i32 0, i32 [[SEL1]]
-; BASE-NEXT: [[ALLOCA:%.*]] = freeze <24 x i32> poison
-; BASE-NEXT: [[TMP1:%.*]] = extractelement <24 x i32> <i32 42, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 43, i32 0, i32 0, i32 0>, i32 [[SEL2]]
-; BASE-NEXT: store i32 [[TMP1]], ptr [[OUT]], align 4
+; BASE-NEXT: [[ALLOCA:%.*]] = alloca [32 x i32], align 16, addrspace(5)
+; BASE-NEXT: call void @llvm.memset.p5.i32(ptr addrspace(5) [[ALLOCA]], i8 0, i32 128, i1 false)
+; BASE-NEXT: [[GEP_0:%.*]] = getelementptr inbounds [32 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 0
+; BASE-NEXT: [[GEP_1:%.*]] = getelementptr inbounds [32 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 30
+; BASE-NEXT: store i32 42, ptr addrspace(5) [[GEP_0]], align 4
+; BASE-NEXT: store i32 43, ptr addrspace(5) [[GEP_1]], align 4
+; BASE-NEXT: [[GEP:%.*]] = getelementptr inbounds [32 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 [[SEL2]]
+; BASE-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(5) [[GEP]], align 4
+; BASE-NEXT: store i32 [[LOAD]], ptr [[OUT]], align 4
; BASE-NEXT: ret void
;
%x = tail call i32 @llvm.amdgcn.workitem.id.x()
@@ -87,40 +163,40 @@ define amdgpu_kernel void @i32_24_elements_attrib(ptr %out) #1 {
%c2 = icmp uge i32 %y, 3
%sel1 = select i1 %c1, i32 1, i32 2
%sel2 = select i1 %c2, i32 0, i32 %sel1
- %alloca = alloca [24 x i32], align 16, addrspace(5)
- call void @llvm.memset.p5.i32(ptr addrspace(5) %alloca, i8 0, i32 96, i1 false)
- %gep.0 = getelementptr inbounds [24 x i32], ptr addrspace(5) %alloca, i32 0, i32 0
- %gep.1 = getelementptr inbounds [24 x i32], ptr addrspace(5) %alloca, i32 0, i32 20
+ %alloca = alloca [32 x i32], align 16, addrspace(5)
+ call void @llvm.memset.p5.i32(ptr addrspace(5) %alloca, i8 0, i32 128, i1 false)
+ %gep.0 = getelementptr inbounds [32 x i32], ptr addrspace(5) %alloca, i32 0, i32 0
+ %gep.1 = getelementptr inbounds [32 x i32], ptr addrspace(5) %alloca, i32 0, i32 30
store i32 42, ptr addrspace(5) %gep.0
store i32 43, ptr addrspace(5) %gep.1
- %gep = getelementptr inbounds [24 x i32], ptr addrspace(5) %alloca, i32 0, i32 %sel2
+ %gep = getelementptr inbounds [32 x i32], ptr addrspace(5) %alloca, i32 0, i32 %sel2
%load = load i32, ptr addrspace(5) %gep
store i32 %load, ptr %out
ret void
}
-define amdgpu_kernel void @i32_32_elements(ptr %out) #0 {
-; DEFAULT-LABEL: define amdgpu_kernel void @i32_32_elements(
-; DEFAULT-SAME: ptr [[OUT:%.*]]) #[[ATTR0]] {
-; DEFAULT-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
-; DEFAULT-NEXT: [[Y:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.y()
-; DEFAULT-NEXT: [[C1:%.*]] = icmp uge i32 [[X]], 3
-; DEFAULT-NEXT: [[C2:%.*]] = icmp uge i32 [[Y]], 3
-; DEFAULT-NEXT: [[SEL1:%.*]] = select i1 [[C1]], i32 1, i32 2
-; DEFAULT-NEXT: [[SEL2:%.*]] = select i1 [[C2]], i32 0, i32 [[SEL1]]
-; DEFAULT-NEXT: [[ALLOCA:%.*]] = alloca [32 x i32], align 16, addrspace(5)
-; DEFAULT-NEXT: call void @llvm.memset.p5.i32(ptr addrspace(5) [[ALLOCA]], i8 0, i32 128, i1 false)
-; DEFAULT-NEXT: [[GEP_0:%.*]] = getelementptr inbounds [32 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 0
-; DEFAULT-NEXT: [[GEP_1:%.*]] = getelementptr inbounds [32 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 30
-; DEFAULT-NEXT: store i32 42, ptr addrspace(5) [[GEP_0]], align 4
-; DEFAULT-NEXT: store i32 43, ptr addrspace(5) [[GEP_1]], align 4
-; DEFAULT-NEXT: [[GEP:%.*]] = getelementptr inbounds [32 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 [[SEL2]]
-; DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(5) [[GEP]], align 4
-; DEFAULT-NEXT: store i32 [[LOAD]], ptr [[OUT]], align 4
-; DEFAULT-NEXT: ret void
+define amdgpu_kernel void @i32_32_elements_attrib(ptr %out) #2 {
+; MAX16-LABEL: define amdgpu_kernel void @i32_32_elements_attrib(
+; MAX16-SAME: ptr [[OUT:%.*]]) #[[ATTR2:[0-9]+]] {
+; MAX16-NEXT: [[X:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.x()
+; MAX16-NEXT: [[Y:%.*]] = tail call i32 @llvm.amdgcn.workitem.id.y()
+; MAX16-NEXT: [[C1:%.*]] = icmp uge i32 [[X]], 3
+; MAX16-NEXT: [[C2:%.*]] = icmp uge i32 [[Y]], 3
+; MAX16-NEXT: [[SEL1:%.*]] = select i1 [[C1]], i32 1, i32 2
+; MAX16-NEXT: [[SEL2:%.*]] = select i1 [[C2]], i32 0, i32 [[SEL1]]
+; MAX16-NEXT: [[ALLOCA:%.*]] = alloca [32 x i32], align 16, addrspace(5)
+; MAX16-NEXT: call void @llvm.memset.p5.i32(ptr addrspace(5) [[ALLOCA]], i8 0, i32 128, i1 false)
+; MAX16-NEXT: [[GEP_0:%.*]] = getelementptr inbounds [32 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 0
+; MAX16-NEXT: [[GEP_1:%.*]] = getelementptr inbounds [32 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 30
+; MAX16-NEXT: store i32 42, ptr addrspace(5) [[GEP_0]], align 4
+; MAX16-NEXT: store i32 43, ptr addrspace(5) [[GEP_1]], align 4
+; MAX16-NEXT: [[GEP:%.*]] = getelementptr inbounds [32 x i32], ptr addrspace(5) [[ALLOCA]], i32 0, i32 [[SEL2]]
+; MAX16-NEXT: [[TMP1:%.*]] = load i32, ptr addrspace(5) [[GEP]], align 4
+; MAX16-NEXT: store i32 [[TMP1]]...
[truncated]
|
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I think using volatile is the best way to defeat the optimization for the tests intended to use the stack, but it's not that important
| define amdgpu_kernel void @v16i32_stack(ptr addrspace(1) %out, i32 %a) { | ||
| %alloca = alloca [2 x <16 x i32>], addrspace(5) | ||
| %tmp0 = getelementptr [2 x <16 x i32>], ptr addrspace(5) %alloca, i32 0, i32 %a | ||
| %alloca = alloca [3 x <16 x i32>], addrspace(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cases probably should have been defeated through volatile or adding the flag but I guess it doesn't really matter
| %v = alloca <32 x i32>, align 128, addrspace(5) | ||
| store <32 x i32> %x, ptr addrspace(5) %v | ||
| %v = alloca <33 x i32>, align 128, addrspace(5) | ||
| store <33 x i32> %x, ptr addrspace(5) %v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volatile?
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/143/builds/10298 Here is the relevant piece of the build log for the reference |
…55076) Increase promote-alloca-to-vector-max-regs to 32 from 16. This restores default promotion of 16 x double which was disabled by Fixes SWDEV-525817.
Increase promote-alloca-to-vector-max-regs to 32 from 16.
This restores default promotion of 16 x double which was disabled by #127973.
Fixes SWDEV-525817.