Skip to content

Conversation

@sgundapa
Copy link
Contributor

If the load/store of a pointer to stack that is not in the same address space, we restrict the promote alloca pass not to vectorize if the pointer storage sizes are different.
Example: In address space 0, pointer size is 64 bits.
In address space 5, pointer size if 32 bits.
Casting the pointer across these address spaces is undefined behavior.
Assertion found through fuzzing.

If the load/store of a pointer to stack that is not in the same address space,
we restrict the promote alloca pass not to vectorize if the pointer storage
sizes are different.
Example: In address space 0, pointer size is 64 bits.
         In address space 5, pointer size if 32 bits.
Casting the pointer across these address spaces is undefined behavior.
Assertion found through fuzzing.
@llvmbot
Copy link
Member

llvmbot commented Dec 12, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Sumanth Gundapaneni (sgundapa)

Changes

If the load/store of a pointer to stack that is not in the same address space, we restrict the promote alloca pass not to vectorize if the pointer storage sizes are different.
Example: In address space 0, pointer size is 64 bits.
In address space 5, pointer size if 32 bits.
Casting the pointer across these address spaces is undefined behavior.
Assertion found through fuzzing.


Full diff: https://github.com/llvm/llvm-project/pull/119762.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+10)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll (+36-31)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll (+24-27)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index e27ef71c1c0883..913a601b0e0888 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -796,6 +796,16 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {
       if (!IsSimple)
         return RejectUser(Inst, "not a simple load or store");
 
+      // If the access type is a pointer, reject the address spaces with
+      // different pointer sizes.
+      // store <2 x ptr> %arg, ptr addrspace(5) %alloca - Reject.
+      // %tmp = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca - ok.
+      if (AccessTy->isPtrOrPtrVectorTy()) {
+        if (DL->getPointerSize(getLoadStoreAddressSpace(Inst)) !=
+            DL->getPointerSize(AccessTy->getPointerAddressSpace()))
+          return RejectUser(Inst, "pointers to incompatible address spaces");
+      }
+
       Ptr = Ptr->stripPointerCasts();
 
       // Alloca already accessed as vector.
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
index 1e49500a243e10..2a22cdda7a7e79 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
@@ -93,21 +93,6 @@ end:
   ret void
 }
 
-define ptr @alloca_load_store_ptr64_full_ivec(ptr %arg) {
-; CHECK-LABEL: define ptr @alloca_load_store_ptr64_full_ivec
-; CHECK-SAME: (ptr [[ARG:%.*]]) {
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = ptrtoint ptr [[ARG]] to i64
-; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i64 [[TMP0]] to <8 x i8>
-; CHECK-NEXT:    ret ptr [[ARG]]
-;
-entry:
-  %alloca = alloca [8 x i8], align 8, addrspace(5)
-  store ptr %arg, ptr addrspace(5) %alloca, align 8
-  %tmp = load ptr, ptr addrspace(5) %alloca, align 8
-  ret ptr %tmp
-}
-
 define ptr addrspace(3) @alloca_load_store_ptr32_full_ivec(ptr addrspace(3) %arg) {
 ; CHECK-LABEL: define ptr addrspace(3) @alloca_load_store_ptr32_full_ivec
 ; CHECK-SAME: (ptr addrspace(3) [[ARG:%.*]]) {
@@ -123,22 +108,6 @@ entry:
   ret ptr addrspace(3) %tmp
 }
 
-define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec(<2 x ptr> %arg) {
-; CHECK-LABEL: define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec
-; CHECK-SAME: (<2 x ptr> [[ARG:%.*]]) {
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = ptrtoint <2 x ptr> [[ARG]] to <2 x i64>
-; CHECK-NEXT:    [[TMP1:%.*]] = bitcast <2 x i64> [[TMP0]] to <4 x i32>
-; CHECK-NEXT:    [[TMP2:%.*]] = inttoptr <4 x i32> [[TMP1]] to <4 x ptr addrspace(3)>
-; CHECK-NEXT:    ret <4 x ptr addrspace(3)> [[TMP2]]
-;
-entry:
-  %alloca = alloca [4 x i32], align 8, addrspace(5)
-  store <2 x ptr> %arg, ptr addrspace(5) %alloca, align 8
-  %tmp = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca, align 8
-  ret <4 x ptr addrspace(3)> %tmp
-}
-
 define <8 x i16> @ptralloca_load_store_ints_full(<2 x i64> %arg) {
 ; CHECK-LABEL: define <8 x i16> @ptralloca_load_store_ints_full
 ; CHECK-SAME: (<2 x i64> [[ARG:%.*]]) {
@@ -198,3 +167,39 @@ entry:
   %tmp = load ptr addrspace(3), ptr addrspace(5) %alloca, align 8
   ret ptr addrspace(3) %tmp
 }
+
+; Will not vectorize because we are doing a load/store of a pointer across
+; address spaces of varying pointer sizes.
+define ptr @alloca_load_store_ptr64_full_ivec(ptr %arg) {
+; CHECK-LABEL: define ptr @alloca_load_store_ptr64_full_ivec
+; CHECK-SAME: (ptr [[ARG:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [8 x i8], align 8, addrspace(5)
+; CHECK-NEXT:    store ptr [[ARG]], ptr addrspace(5) [[ALLOCA]], align 8
+; CHECK-NEXT:    [[TMP:%.*]] = load ptr, ptr addrspace(5) [[ALLOCA]], align 8
+; CHECK-NEXT:    ret ptr [[TMP]]
+;
+entry:
+  %alloca = alloca [8 x i8], align 8, addrspace(5)
+  store ptr %arg, ptr addrspace(5) %alloca, align 8
+  %tmp = load ptr, ptr addrspace(5) %alloca, align 8
+  ret ptr %tmp
+}
+
+; Will not vectorize because we are doing a load/store of a pointer across
+; address spaces of varying pointer sizes.
+define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec(<2 x ptr> %arg) {
+; CHECK-LABEL: define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec
+; CHECK-SAME: (<2 x ptr> [[ARG:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [4 x i32], align 8, addrspace(5)
+; CHECK-NEXT:    store <2 x ptr> [[ARG]], ptr addrspace(5) [[ALLOCA]], align 8
+; CHECK-NEXT:    [[TMP:%.*]] = load <4 x ptr addrspace(3)>, ptr addrspace(5) [[ALLOCA]], align 8
+; CHECK-NEXT:    ret <4 x ptr addrspace(3)> [[TMP]]
+;
+entry:
+  %alloca = alloca [4 x i32], align 8, addrspace(5)
+  store <2 x ptr> %arg, ptr addrspace(5) %alloca, align 8
+  %tmp = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca, align 8
+  ret <4 x ptr addrspace(3)> %tmp
+}
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
index 7c5410004ed5b7..b583f33a7d9e63 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
@@ -218,38 +218,35 @@ entry:
   ret void
 }
 
-define void @test_different_type_subvector_ptrs(<2 x ptr addrspace(1)> %val.0, <4 x ptr addrspace(3)> %val.1) {
+define void @test_different_type_subvector_ptrs(<2 x ptr addrspace(3)> %val.0, <4 x ptr addrspace(3)> %val.1) {
 ; CHECK-LABEL: define void @test_different_type_subvector_ptrs
-; CHECK-SAME: (<2 x ptr addrspace(1)> [[VAL_0:%.*]], <4 x ptr addrspace(3)> [[VAL_1:%.*]]) {
+; CHECK-SAME: (<2 x ptr addrspace(3)> [[VAL_0:%.*]], <4 x ptr addrspace(3)> [[VAL_1:%.*]]) {
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = ptrtoint <2 x ptr addrspace(1)> [[VAL_0]] to <2 x i64>
-; CHECK-NEXT:    [[TMP1:%.*]] = extractelement <2 x i64> [[TMP0]], i64 0
-; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <4 x i64> undef, i64 [[TMP1]], i32 0
-; CHECK-NEXT:    [[TMP3:%.*]] = extractelement <2 x i64> [[TMP0]], i64 1
-; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <4 x i64> [[TMP2]], i64 [[TMP3]], i32 1
-; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <2 x i64> poison, i64 [[TMP1]], i64 0
-; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <2 x i64> [[TMP5]], i64 [[TMP3]], i64 1
-; CHECK-NEXT:    [[TMP7:%.*]] = inttoptr <2 x i64> [[TMP6]] to <2 x ptr addrspace(1)>
-; CHECK-NEXT:    [[DUMMYUSER:%.*]] = freeze <2 x ptr addrspace(1)> [[TMP7]]
-; CHECK-NEXT:    [[TMP8:%.*]] = ptrtoint <4 x ptr addrspace(3)> [[VAL_1]] to <4 x i32>
-; CHECK-NEXT:    [[TMP9:%.*]] = bitcast <4 x i32> [[TMP8]] to <2 x i64>
-; CHECK-NEXT:    [[TMP10:%.*]] = extractelement <2 x i64> [[TMP9]], i64 0
-; CHECK-NEXT:    [[TMP11:%.*]] = insertelement <4 x i64> [[TMP4]], i64 [[TMP10]], i32 0
-; CHECK-NEXT:    [[TMP12:%.*]] = extractelement <2 x i64> [[TMP9]], i64 1
-; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <4 x i64> [[TMP11]], i64 [[TMP12]], i32 1
-; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <2 x i64> poison, i64 [[TMP10]], i64 0
-; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <2 x i64> [[TMP14]], i64 [[TMP12]], i64 1
-; CHECK-NEXT:    [[TMP16:%.*]] = bitcast <2 x i64> [[TMP15]] to <4 x i32>
-; CHECK-NEXT:    [[TMP17:%.*]] = inttoptr <4 x i32> [[TMP16]] to <4 x ptr addrspace(3)>
-; CHECK-NEXT:    [[DUMMYUSER_1:%.*]] = freeze <4 x ptr addrspace(3)> [[TMP17]]
-; CHECK-NEXT:    ret void
-;
+; CHECK-NEXT:    [[TMP0:%.*]] = ptrtoint <2 x ptr addrspace(3)> [[VAL_0]] to <2 x i32>
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast <2 x i32> [[TMP0]] to <1 x i64>
+; CHECK-NEXT:    [[TMP2:%.*]] = extractelement <1 x i64> [[TMP1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = insertelement <4 x i64> undef, i64 [[TMP2]], i32 0
+; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <1 x i64> poison, i64 [[TMP2]], i64 0
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast <1 x i64> [[TMP4]] to <2 x i32>
+; CHECK-NEXT:    [[TMP6:%.*]] = inttoptr <2 x i32> [[TMP5]] to <2 x ptr addrspace(3)>
+; CHECK-NEXT:    [[DUMMYUSER:%.*]] = freeze <2 x ptr addrspace(3)> [[TMP6]]
+; CHECK-NEXT:    [[TMP7:%.*]] = ptrtoint <4 x ptr addrspace(3)> [[VAL_1]] to <4 x i32>
+; CHECK-NEXT:    [[TMP8:%.*]] = bitcast <4 x i32> [[TMP7]] to <2 x i64>
+; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <2 x i64> [[TMP8]], i64 0
+; CHECK-NEXT:    [[TMP10:%.*]] = insertelement <4 x i64> [[TMP3]], i64 [[TMP9]], i32 0
+; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <2 x i64> [[TMP8]], i64 1
+; CHECK-NEXT:    [[TMP12:%.*]] = insertelement <4 x i64> [[TMP10]], i64 [[TMP11]], i32 1
+; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <2 x i64> poison, i64 [[TMP9]], i64 0
+; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <2 x i64> [[TMP13]], i64 [[TMP11]], i64 1
+; CHECK-NEXT:    [[TMP15:%.*]] = bitcast <2 x i64> [[TMP14]] to <4 x i32>
+; CHECK-NEXT:    [[TMP16:%.*]] = inttoptr <4 x i32> [[TMP15]] to <4 x ptr addrspace(3)>
+; CHECK-NEXT:    [[DUMMYUSER_1:%.*]] = freeze <4 x ptr addrspace(3)> [[TMP16]]
 entry:
   %stack = alloca [4 x i64], align 4, addrspace(5)
 
-  store <2 x ptr addrspace(1)> %val.0, ptr addrspace(5) %stack
-  %reload = load <2 x ptr addrspace(1)>, ptr addrspace(5) %stack
-  %dummyuser = freeze <2 x ptr addrspace(1)> %reload
+  store <2 x ptr addrspace(3)> %val.0, ptr addrspace(5) %stack
+  %reload = load <2 x ptr addrspace(3)>, ptr addrspace(5) %stack
+  %dummyuser = freeze <2 x ptr addrspace(3)> %reload
 
   store <4 x ptr addrspace(3)> %val.1, ptr addrspace(5) %stack
   %reload.1 = load <4 x ptr addrspace(3)>, ptr addrspace(5) %stack

Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting the pointer across these address spaces is undefined behavior.

Are you sure?

I don't think this PR is right. It probably just treats the symptoms instead of the root cause of the assertion.

store <2 x ptr> %arg, ptr addrspace(5) %alloca, align 8
%tmp = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca, align 8
ret <4 x ptr addrspace(3)> %tmp
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should reproduce the same situation with one as a scalar, and some with int/fp types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please be specific here on what you mean by scalar ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i32 and double

If the load/store of a pointer to stack that is not in the same address space,
we restrict the promote alloca pass not to vectorize if the pointer storage
sizes are different.
Example: In address space 0, pointer size is 64 bits.
         In address space 5, pointer size if 32 bits.
Casting the pointer across these address spaces with varied pointer sizes is undefined behavior
Assertion found through fuzzing.
@github-actions
Copy link

github-actions bot commented Dec 20, 2024

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 988480323d5ef9bb658f13ac598d4ce2aa23c782 d52e0a7229f9ce471b0c15b3fe19772da2272fcd llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll

The following files introduce new uses of undef:

  • llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

Comment on lines +680 to +685
bool canCopyAcrossAddressSpaces = true;
if (AccessTy->isPtrOrPtrVectorTy() && VecTy->isPtrOrPtrVectorTy()) {
if (DL.getPointerSize(AccessTy->getPointerAddressSpace()) !=
DL.getPointerSize(VecTy->getPointerAddressSpace()))
canCopyAcrossAddressSpaces = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no copy across address spaces here, this check is conceptually wrong. You only need to verify the size is compatible. For the final code emission, you'll need to insert no-op casts to get the types to match

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing code here is correct, and you would only need adjustment at the code transformation later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explore the no-op casts.
So <2 x ptr addrspace(5)> can be stored in to ptr addrspace (0) ? Isn't this the undefined behavior ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be illegal type punning but you don't have the context. It's just bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt. I will try a few ways to avoid this casting assert.
Another quick question, at which point in optimization pipeline, we decide this is an illegal type punning leading to undefined behavior ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a property of the pass pipeline, it's a property of the system as a whole. The address space cast from addrspace(5) to 0 is not a no-op cast. If you reload it and use it as the wrong type, it will be an invalid pointer

define void @alloca_load_store_ptr_mixed_addrspace_ptrvec() {
entry:
%A2 = alloca <2 x ptr addrspace(5)>, align 8, addrspace(5)
store ptr undef, ptr addrspace(5) %A2, align 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also test with a real value instead of an undef. A test with a constant leaf would be useful too

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check is not correct, and test isn't using the correct alloca address space

@sgundapa sgundapa closed this Oct 30, 2025
@sgundapa sgundapa deleted the cast_assert branch October 30, 2025 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants