-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[AMDGPU] Restrict promote alloca on pointers across address spaces #119762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If the load/store of a pointer to stack that is not in the same address space,
we restrict the promote alloca pass not to vectorize if the pointer storage
sizes are different.
Example: In address space 0, pointer size is 64 bits.
In address space 5, pointer size if 32 bits.
Casting the pointer across these address spaces is undefined behavior.
Assertion found through fuzzing.
|
@llvm/pr-subscribers-backend-amdgpu Author: Sumanth Gundapaneni (sgundapa) ChangesIf the load/store of a pointer to stack that is not in the same address space, we restrict the promote alloca pass not to vectorize if the pointer storage sizes are different. Full diff: https://github.com/llvm/llvm-project/pull/119762.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index e27ef71c1c0883..913a601b0e0888 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -796,6 +796,16 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {
if (!IsSimple)
return RejectUser(Inst, "not a simple load or store");
+ // If the access type is a pointer, reject the address spaces with
+ // different pointer sizes.
+ // store <2 x ptr> %arg, ptr addrspace(5) %alloca - Reject.
+ // %tmp = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca - ok.
+ if (AccessTy->isPtrOrPtrVectorTy()) {
+ if (DL->getPointerSize(getLoadStoreAddressSpace(Inst)) !=
+ DL->getPointerSize(AccessTy->getPointerAddressSpace()))
+ return RejectUser(Inst, "pointers to incompatible address spaces");
+ }
+
Ptr = Ptr->stripPointerCasts();
// Alloca already accessed as vector.
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
index 1e49500a243e10..2a22cdda7a7e79 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
@@ -93,21 +93,6 @@ end:
ret void
}
-define ptr @alloca_load_store_ptr64_full_ivec(ptr %arg) {
-; CHECK-LABEL: define ptr @alloca_load_store_ptr64_full_ivec
-; CHECK-SAME: (ptr [[ARG:%.*]]) {
-; CHECK-NEXT: entry:
-; CHECK-NEXT: [[TMP0:%.*]] = ptrtoint ptr [[ARG]] to i64
-; CHECK-NEXT: [[TMP1:%.*]] = bitcast i64 [[TMP0]] to <8 x i8>
-; CHECK-NEXT: ret ptr [[ARG]]
-;
-entry:
- %alloca = alloca [8 x i8], align 8, addrspace(5)
- store ptr %arg, ptr addrspace(5) %alloca, align 8
- %tmp = load ptr, ptr addrspace(5) %alloca, align 8
- ret ptr %tmp
-}
-
define ptr addrspace(3) @alloca_load_store_ptr32_full_ivec(ptr addrspace(3) %arg) {
; CHECK-LABEL: define ptr addrspace(3) @alloca_load_store_ptr32_full_ivec
; CHECK-SAME: (ptr addrspace(3) [[ARG:%.*]]) {
@@ -123,22 +108,6 @@ entry:
ret ptr addrspace(3) %tmp
}
-define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec(<2 x ptr> %arg) {
-; CHECK-LABEL: define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec
-; CHECK-SAME: (<2 x ptr> [[ARG:%.*]]) {
-; CHECK-NEXT: entry:
-; CHECK-NEXT: [[TMP0:%.*]] = ptrtoint <2 x ptr> [[ARG]] to <2 x i64>
-; CHECK-NEXT: [[TMP1:%.*]] = bitcast <2 x i64> [[TMP0]] to <4 x i32>
-; CHECK-NEXT: [[TMP2:%.*]] = inttoptr <4 x i32> [[TMP1]] to <4 x ptr addrspace(3)>
-; CHECK-NEXT: ret <4 x ptr addrspace(3)> [[TMP2]]
-;
-entry:
- %alloca = alloca [4 x i32], align 8, addrspace(5)
- store <2 x ptr> %arg, ptr addrspace(5) %alloca, align 8
- %tmp = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca, align 8
- ret <4 x ptr addrspace(3)> %tmp
-}
-
define <8 x i16> @ptralloca_load_store_ints_full(<2 x i64> %arg) {
; CHECK-LABEL: define <8 x i16> @ptralloca_load_store_ints_full
; CHECK-SAME: (<2 x i64> [[ARG:%.*]]) {
@@ -198,3 +167,39 @@ entry:
%tmp = load ptr addrspace(3), ptr addrspace(5) %alloca, align 8
ret ptr addrspace(3) %tmp
}
+
+; Will not vectorize because we are doing a load/store of a pointer across
+; address spaces of varying pointer sizes.
+define ptr @alloca_load_store_ptr64_full_ivec(ptr %arg) {
+; CHECK-LABEL: define ptr @alloca_load_store_ptr64_full_ivec
+; CHECK-SAME: (ptr [[ARG:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[ALLOCA:%.*]] = alloca [8 x i8], align 8, addrspace(5)
+; CHECK-NEXT: store ptr [[ARG]], ptr addrspace(5) [[ALLOCA]], align 8
+; CHECK-NEXT: [[TMP:%.*]] = load ptr, ptr addrspace(5) [[ALLOCA]], align 8
+; CHECK-NEXT: ret ptr [[TMP]]
+;
+entry:
+ %alloca = alloca [8 x i8], align 8, addrspace(5)
+ store ptr %arg, ptr addrspace(5) %alloca, align 8
+ %tmp = load ptr, ptr addrspace(5) %alloca, align 8
+ ret ptr %tmp
+}
+
+; Will not vectorize because we are doing a load/store of a pointer across
+; address spaces of varying pointer sizes.
+define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec(<2 x ptr> %arg) {
+; CHECK-LABEL: define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec
+; CHECK-SAME: (<2 x ptr> [[ARG:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[ALLOCA:%.*]] = alloca [4 x i32], align 8, addrspace(5)
+; CHECK-NEXT: store <2 x ptr> [[ARG]], ptr addrspace(5) [[ALLOCA]], align 8
+; CHECK-NEXT: [[TMP:%.*]] = load <4 x ptr addrspace(3)>, ptr addrspace(5) [[ALLOCA]], align 8
+; CHECK-NEXT: ret <4 x ptr addrspace(3)> [[TMP]]
+;
+entry:
+ %alloca = alloca [4 x i32], align 8, addrspace(5)
+ store <2 x ptr> %arg, ptr addrspace(5) %alloca, align 8
+ %tmp = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca, align 8
+ ret <4 x ptr addrspace(3)> %tmp
+}
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
index 7c5410004ed5b7..b583f33a7d9e63 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
@@ -218,38 +218,35 @@ entry:
ret void
}
-define void @test_different_type_subvector_ptrs(<2 x ptr addrspace(1)> %val.0, <4 x ptr addrspace(3)> %val.1) {
+define void @test_different_type_subvector_ptrs(<2 x ptr addrspace(3)> %val.0, <4 x ptr addrspace(3)> %val.1) {
; CHECK-LABEL: define void @test_different_type_subvector_ptrs
-; CHECK-SAME: (<2 x ptr addrspace(1)> [[VAL_0:%.*]], <4 x ptr addrspace(3)> [[VAL_1:%.*]]) {
+; CHECK-SAME: (<2 x ptr addrspace(3)> [[VAL_0:%.*]], <4 x ptr addrspace(3)> [[VAL_1:%.*]]) {
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[TMP0:%.*]] = ptrtoint <2 x ptr addrspace(1)> [[VAL_0]] to <2 x i64>
-; CHECK-NEXT: [[TMP1:%.*]] = extractelement <2 x i64> [[TMP0]], i64 0
-; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i64> undef, i64 [[TMP1]], i32 0
-; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i64> [[TMP0]], i64 1
-; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i64> [[TMP2]], i64 [[TMP3]], i32 1
-; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i64> poison, i64 [[TMP1]], i64 0
-; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i64> [[TMP5]], i64 [[TMP3]], i64 1
-; CHECK-NEXT: [[TMP7:%.*]] = inttoptr <2 x i64> [[TMP6]] to <2 x ptr addrspace(1)>
-; CHECK-NEXT: [[DUMMYUSER:%.*]] = freeze <2 x ptr addrspace(1)> [[TMP7]]
-; CHECK-NEXT: [[TMP8:%.*]] = ptrtoint <4 x ptr addrspace(3)> [[VAL_1]] to <4 x i32>
-; CHECK-NEXT: [[TMP9:%.*]] = bitcast <4 x i32> [[TMP8]] to <2 x i64>
-; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i64> [[TMP9]], i64 0
-; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i64> [[TMP4]], i64 [[TMP10]], i32 0
-; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i64> [[TMP9]], i64 1
-; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i64> [[TMP11]], i64 [[TMP12]], i32 1
-; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x i64> poison, i64 [[TMP10]], i64 0
-; CHECK-NEXT: [[TMP15:%.*]] = insertelement <2 x i64> [[TMP14]], i64 [[TMP12]], i64 1
-; CHECK-NEXT: [[TMP16:%.*]] = bitcast <2 x i64> [[TMP15]] to <4 x i32>
-; CHECK-NEXT: [[TMP17:%.*]] = inttoptr <4 x i32> [[TMP16]] to <4 x ptr addrspace(3)>
-; CHECK-NEXT: [[DUMMYUSER_1:%.*]] = freeze <4 x ptr addrspace(3)> [[TMP17]]
-; CHECK-NEXT: ret void
-;
+; CHECK-NEXT: [[TMP0:%.*]] = ptrtoint <2 x ptr addrspace(3)> [[VAL_0]] to <2 x i32>
+; CHECK-NEXT: [[TMP1:%.*]] = bitcast <2 x i32> [[TMP0]] to <1 x i64>
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <1 x i64> [[TMP1]], i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i64> undef, i64 [[TMP2]], i32 0
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <1 x i64> poison, i64 [[TMP2]], i64 0
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <1 x i64> [[TMP4]] to <2 x i32>
+; CHECK-NEXT: [[TMP6:%.*]] = inttoptr <2 x i32> [[TMP5]] to <2 x ptr addrspace(3)>
+; CHECK-NEXT: [[DUMMYUSER:%.*]] = freeze <2 x ptr addrspace(3)> [[TMP6]]
+; CHECK-NEXT: [[TMP7:%.*]] = ptrtoint <4 x ptr addrspace(3)> [[VAL_1]] to <4 x i32>
+; CHECK-NEXT: [[TMP8:%.*]] = bitcast <4 x i32> [[TMP7]] to <2 x i64>
+; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x i64> [[TMP8]], i64 0
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i64> [[TMP3]], i64 [[TMP9]], i32 0
+; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP8]], i64 1
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i64> [[TMP10]], i64 [[TMP11]], i32 1
+; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i64> poison, i64 [[TMP9]], i64 0
+; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x i64> [[TMP13]], i64 [[TMP11]], i64 1
+; CHECK-NEXT: [[TMP15:%.*]] = bitcast <2 x i64> [[TMP14]] to <4 x i32>
+; CHECK-NEXT: [[TMP16:%.*]] = inttoptr <4 x i32> [[TMP15]] to <4 x ptr addrspace(3)>
+; CHECK-NEXT: [[DUMMYUSER_1:%.*]] = freeze <4 x ptr addrspace(3)> [[TMP16]]
entry:
%stack = alloca [4 x i64], align 4, addrspace(5)
- store <2 x ptr addrspace(1)> %val.0, ptr addrspace(5) %stack
- %reload = load <2 x ptr addrspace(1)>, ptr addrspace(5) %stack
- %dummyuser = freeze <2 x ptr addrspace(1)> %reload
+ store <2 x ptr addrspace(3)> %val.0, ptr addrspace(5) %stack
+ %reload = load <2 x ptr addrspace(3)>, ptr addrspace(5) %stack
+ %dummyuser = freeze <2 x ptr addrspace(3)> %reload
store <4 x ptr addrspace(3)> %val.1, ptr addrspace(5) %stack
%reload.1 = load <4 x ptr addrspace(3)>, ptr addrspace(5) %stack
|
shiltian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting the pointer across these address spaces is undefined behavior.
Are you sure?
I don't think this PR is right. It probably just treats the symptoms instead of the root cause of the assertion.
| store <2 x ptr> %arg, ptr addrspace(5) %alloca, align 8 | ||
| %tmp = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca, align 8 | ||
| ret <4 x ptr addrspace(3)> %tmp | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should reproduce the same situation with one as a scalar, and some with int/fp types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please be specific here on what you mean by scalar ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i32 and double
If the load/store of a pointer to stack that is not in the same address space,
we restrict the promote alloca pass not to vectorize if the pointer storage
sizes are different.
Example: In address space 0, pointer size is 64 bits.
In address space 5, pointer size if 32 bits.
Casting the pointer across these address spaces with varied pointer sizes is undefined behavior
Assertion found through fuzzing.
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 988480323d5ef9bb658f13ac598d4ce2aa23c782 d52e0a7229f9ce471b0c15b3fe19772da2272fcd llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.llThe following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
}Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
}Please refer to the Undefined Behavior Manual for more information. |
| bool canCopyAcrossAddressSpaces = true; | ||
| if (AccessTy->isPtrOrPtrVectorTy() && VecTy->isPtrOrPtrVectorTy()) { | ||
| if (DL.getPointerSize(AccessTy->getPointerAddressSpace()) != | ||
| DL.getPointerSize(VecTy->getPointerAddressSpace())) | ||
| canCopyAcrossAddressSpaces = false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no copy across address spaces here, this check is conceptually wrong. You only need to verify the size is compatible. For the final code emission, you'll need to insert no-op casts to get the types to match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing code here is correct, and you would only need adjustment at the code transformation later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me explore the no-op casts.
So <2 x ptr addrspace(5)> can be stored in to ptr addrspace (0) ? Isn't this the undefined behavior ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be illegal type punning but you don't have the context. It's just bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matt. I will try a few ways to avoid this casting assert.
Another quick question, at which point in optimization pipeline, we decide this is an illegal type punning leading to undefined behavior ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a property of the pass pipeline, it's a property of the system as a whole. The address space cast from addrspace(5) to 0 is not a no-op cast. If you reload it and use it as the wrong type, it will be an invalid pointer
| define void @alloca_load_store_ptr_mixed_addrspace_ptrvec() { | ||
| entry: | ||
| %A2 = alloca <2 x ptr addrspace(5)>, align 8, addrspace(5) | ||
| store ptr undef, ptr addrspace(5) %A2, align 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also test with a real value instead of an undef. A test with a constant leaf would be useful too
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check is not correct, and test isn't using the correct alloca address space
If the load/store of a pointer to stack that is not in the same address space, we restrict the promote alloca pass not to vectorize if the pointer storage sizes are different.
Example: In address space 0, pointer size is 64 bits.
In address space 5, pointer size if 32 bits.
Casting the pointer across these address spaces is undefined behavior.
Assertion found through fuzzing.