AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer #128388

arsenm · 2025-02-23T02:57:51Z

We need to promote 8/16-bit cases to 32-bit. Unfortunately we are
missing demanded bits optimizations on readfirstlane, so we end up emitting
an and instruction on the input. I'm also surprised this pass isn't handling
half or bfloat yet.

We need to promote 8/16-bit cases to 32-bit. Unfortunately we are missing demanded bits optimizations on readfirstlane, so we end up emitting an and instruction on the input.

arsenm · 2025-02-23T02:58:06Z

AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer #128388 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-02-23T02:58:25Z

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

We need to promote 8/16-bit cases to 32-bit. Unfortunately we are
missing demanded bits optimizations on readfirstlane, so we end up emitting
an and instruction on the input.

Patch is 196.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128388.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp (+9-2)
(added) llvm/test/CodeGen/AMDGPU/atomic-optimizer-promote-i8.ll (+176)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+4241)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
index 02f5ce2d18ff6..e46d0587e7943 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
@@ -898,8 +898,15 @@ void AMDGPUAtomicOptimizerImpl::optimizeAtomic(Instruction &I,
 
     // We need to broadcast the value who was the lowest active lane (the first
     // lane) to all other lanes in the wavefront.
-    Value *BroadcastI = nullptr;
-    BroadcastI = B.CreateIntrinsic(Ty, Intrinsic::amdgcn_readfirstlane, PHI);
+
+    Value *ReadlaneVal = PHI;
+    if (TyBitWidth < 32)
+      ReadlaneVal = B.CreateZExt(PHI, B.getInt32Ty());
+
+    Value *BroadcastI = B.CreateIntrinsic(
+        ReadlaneVal->getType(), Intrinsic::amdgcn_readfirstlane, ReadlaneVal);
+    if (TyBitWidth < 32)
+      BroadcastI = B.CreateTrunc(BroadcastI, Ty);
 
     // Now that we have the result of our single atomic operation, we need to
     // get our individual lane's slice into the result. We use the lane offset
diff --git a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-promote-i8.ll b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-promote-i8.ll
new file mode 100644
index 0000000000000..d3e591634503f
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-promote-i8.ll
@@ -0,0 +1,176 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -passes=amdgpu-atomic-optimizer %s | FileCheck %s
+
+define amdgpu_kernel void @uniform_or_i8(ptr addrspace(1) %result, ptr addrspace(1) %uniform.ptr, i8 %val) {
+; CHECK-LABEL: define amdgpu_kernel void @uniform_or_i8(
+; CHECK-SAME: ptr addrspace(1) [[RESULT:%.*]], ptr addrspace(1) [[UNIFORM_PTR:%.*]], i8 [[VAL:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call i64 @llvm.amdgcn.ballot.i64(i1 true)
+; CHECK-NEXT:    [[TMP2:%.*]] = trunc i64 [[TMP1]] to i32
+; CHECK-NEXT:    [[TMP3:%.*]] = lshr i64 [[TMP1]], 32
+; CHECK-NEXT:    [[TMP4:%.*]] = trunc i64 [[TMP3]] to i32
+; CHECK-NEXT:    [[TMP5:%.*]] = call i32 @llvm.amdgcn.mbcnt.lo(i32 [[TMP2]], i32 0)
+; CHECK-NEXT:    [[TMP6:%.*]] = call i32 @llvm.amdgcn.mbcnt.hi(i32 [[TMP4]], i32 [[TMP5]])
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i32 [[TMP6]], 0
+; CHECK-NEXT:    br i1 [[TMP7]], label %[[BB8:.*]], label %[[BB10:.*]]
+; CHECK:       [[BB8]]:
+; CHECK-NEXT:    [[TMP9:%.*]] = atomicrmw or ptr addrspace(1) [[UNIFORM_PTR]], i8 [[VAL]] monotonic, align 1
+; CHECK-NEXT:    br label %[[BB10]]
+; CHECK:       [[BB10]]:
+; CHECK-NEXT:    [[TMP11:%.*]] = phi i8 [ poison, [[TMP0:%.*]] ], [ [[TMP9]], %[[BB8]] ]
+; CHECK-NEXT:    [[TMP16:%.*]] = zext i8 [[TMP11]] to i32
+; CHECK-NEXT:    [[TMP17:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[TMP16]])
+; CHECK-NEXT:    [[TMP12:%.*]] = trunc i32 [[TMP17]] to i8
+; CHECK-NEXT:    [[TMP13:%.*]] = trunc i32 [[TMP6]] to i8
+; CHECK-NEXT:    [[TMP14:%.*]] = select i1 [[TMP7]], i8 0, i8 [[VAL]]
+; CHECK-NEXT:    [[TMP15:%.*]] = or i8 [[TMP12]], [[TMP14]]
+; CHECK-NEXT:    store i8 [[TMP15]], ptr addrspace(1) [[RESULT]], align 1
+; CHECK-NEXT:    ret void
+;
+  %rmw = atomicrmw or ptr addrspace(1) %uniform.ptr, i8 %val monotonic, align 1
+  store i8 %rmw, ptr addrspace(1) %result
+  ret void
+}
+
+define amdgpu_kernel void @uniform_add_i8(ptr addrspace(1) %result, ptr addrspace(1) %uniform.ptr, i8 %val) {
+; CHECK-LABEL: define amdgpu_kernel void @uniform_add_i8(
+; CHECK-SAME: ptr addrspace(1) [[RESULT:%.*]], ptr addrspace(1) [[UNIFORM_PTR:%.*]], i8 [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call i64 @llvm.amdgcn.ballot.i64(i1 true)
+; CHECK-NEXT:    [[TMP2:%.*]] = trunc i64 [[TMP1]] to i32
+; CHECK-NEXT:    [[TMP3:%.*]] = lshr i64 [[TMP1]], 32
+; CHECK-NEXT:    [[TMP4:%.*]] = trunc i64 [[TMP3]] to i32
+; CHECK-NEXT:    [[TMP5:%.*]] = call i32 @llvm.amdgcn.mbcnt.lo(i32 [[TMP2]], i32 0)
+; CHECK-NEXT:    [[TMP6:%.*]] = call i32 @llvm.amdgcn.mbcnt.hi(i32 [[TMP4]], i32 [[TMP5]])
+; CHECK-NEXT:    [[TMP7:%.*]] = call i64 @llvm.ctpop.i64(i64 [[TMP1]])
+; CHECK-NEXT:    [[TMP8:%.*]] = trunc i64 [[TMP7]] to i8
+; CHECK-NEXT:    [[TMP9:%.*]] = mul i8 [[VAL]], [[TMP8]]
+; CHECK-NEXT:    [[TMP10:%.*]] = icmp eq i32 [[TMP6]], 0
+; CHECK-NEXT:    br i1 [[TMP10]], label %[[BB11:.*]], label %[[BB13:.*]]
+; CHECK:       [[BB11]]:
+; CHECK-NEXT:    [[TMP12:%.*]] = atomicrmw add ptr addrspace(1) [[UNIFORM_PTR]], i8 [[TMP9]] monotonic, align 1
+; CHECK-NEXT:    br label %[[BB13]]
+; CHECK:       [[BB13]]:
+; CHECK-NEXT:    [[TMP14:%.*]] = phi i8 [ poison, [[TMP0:%.*]] ], [ [[TMP12]], %[[BB11]] ]
+; CHECK-NEXT:    [[TMP19:%.*]] = zext i8 [[TMP14]] to i32
+; CHECK-NEXT:    [[TMP20:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[TMP19]])
+; CHECK-NEXT:    [[TMP15:%.*]] = trunc i32 [[TMP20]] to i8
+; CHECK-NEXT:    [[TMP16:%.*]] = trunc i32 [[TMP6]] to i8
+; CHECK-NEXT:    [[TMP17:%.*]] = mul i8 [[VAL]], [[TMP16]]
+; CHECK-NEXT:    [[TMP18:%.*]] = add i8 [[TMP15]], [[TMP17]]
+; CHECK-NEXT:    store i8 [[TMP18]], ptr addrspace(1) [[RESULT]], align 1
+; CHECK-NEXT:    ret void
+;
+  %rmw = atomicrmw add ptr addrspace(1) %uniform.ptr, i8 %val monotonic, align 1
+  store i8 %rmw, ptr addrspace(1) %result
+  ret void
+}
+
+define amdgpu_kernel void @uniform_xchg_i8(ptr addrspace(1) %result, ptr addrspace(1) %uniform.ptr, i8 %val) {
+; CHECK-LABEL: define amdgpu_kernel void @uniform_xchg_i8(
+; CHECK-SAME: ptr addrspace(1) [[RESULT:%.*]], ptr addrspace(1) [[UNIFORM_PTR:%.*]], i8 [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[RMW:%.*]] = atomicrmw xchg ptr addrspace(1) [[UNIFORM_PTR]], i8 [[VAL]] monotonic, align 1
+; CHECK-NEXT:    store i8 [[RMW]], ptr addrspace(1) [[RESULT]], align 1
+; CHECK-NEXT:    ret void
+;
+  %rmw = atomicrmw xchg ptr addrspace(1) %uniform.ptr, i8 %val monotonic, align 1
+  store i8 %rmw, ptr addrspace(1) %result
+  ret void
+}
+
+define amdgpu_kernel void @uniform_or_i16(ptr addrspace(1) %result, ptr addrspace(1) %uniform.ptr, i16 %val) {
+; CHECK-LABEL: define amdgpu_kernel void @uniform_or_i16(
+; CHECK-SAME: ptr addrspace(1) [[RESULT:%.*]], ptr addrspace(1) [[UNIFORM_PTR:%.*]], i16 [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call i64 @llvm.amdgcn.ballot.i64(i1 true)
+; CHECK-NEXT:    [[TMP2:%.*]] = trunc i64 [[TMP1]] to i32
+; CHECK-NEXT:    [[TMP3:%.*]] = lshr i64 [[TMP1]], 32
+; CHECK-NEXT:    [[TMP4:%.*]] = trunc i64 [[TMP3]] to i32
+; CHECK-NEXT:    [[TMP5:%.*]] = call i32 @llvm.amdgcn.mbcnt.lo(i32 [[TMP2]], i32 0)
+; CHECK-NEXT:    [[TMP6:%.*]] = call i32 @llvm.amdgcn.mbcnt.hi(i32 [[TMP4]], i32 [[TMP5]])
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i32 [[TMP6]], 0
+; CHECK-NEXT:    br i1 [[TMP7]], label %[[BB8:.*]], label %[[BB10:.*]]
+; CHECK:       [[BB8]]:
+; CHECK-NEXT:    [[TMP9:%.*]] = atomicrmw or ptr addrspace(1) [[UNIFORM_PTR]], i16 [[VAL]] monotonic, align 2
+; CHECK-NEXT:    br label %[[BB10]]
+; CHECK:       [[BB10]]:
+; CHECK-NEXT:    [[TMP11:%.*]] = phi i16 [ poison, [[TMP0:%.*]] ], [ [[TMP9]], %[[BB8]] ]
+; CHECK-NEXT:    [[TMP16:%.*]] = zext i16 [[TMP11]] to i32
+; CHECK-NEXT:    [[TMP17:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[TMP16]])
+; CHECK-NEXT:    [[TMP12:%.*]] = trunc i32 [[TMP17]] to i16
+; CHECK-NEXT:    [[TMP13:%.*]] = trunc i32 [[TMP6]] to i16
+; CHECK-NEXT:    [[TMP14:%.*]] = select i1 [[TMP7]], i16 0, i16 [[VAL]]
+; CHECK-NEXT:    [[TMP15:%.*]] = or i16 [[TMP12]], [[TMP14]]
+; CHECK-NEXT:    store i16 [[TMP15]], ptr addrspace(1) [[RESULT]], align 2
+; CHECK-NEXT:    ret void
+;
+  %rmw = atomicrmw or ptr addrspace(1) %uniform.ptr, i16 %val monotonic, align 2
+  store i16 %rmw, ptr addrspace(1) %result
+  ret void
+}
+
+define amdgpu_kernel void @uniform_add_i16(ptr addrspace(1) %result, ptr addrspace(1) %uniform.ptr, i16 %val) {
+; CHECK-LABEL: define amdgpu_kernel void @uniform_add_i16(
+; CHECK-SAME: ptr addrspace(1) [[RESULT:%.*]], ptr addrspace(1) [[UNIFORM_PTR:%.*]], i16 [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call i64 @llvm.amdgcn.ballot.i64(i1 true)
+; CHECK-NEXT:    [[TMP2:%.*]] = trunc i64 [[TMP1]] to i32
+; CHECK-NEXT:    [[TMP3:%.*]] = lshr i64 [[TMP1]], 32
+; CHECK-NEXT:    [[TMP4:%.*]] = trunc i64 [[TMP3]] to i32
+; CHECK-NEXT:    [[TMP5:%.*]] = call i32 @llvm.amdgcn.mbcnt.lo(i32 [[TMP2]], i32 0)
+; CHECK-NEXT:    [[TMP6:%.*]] = call i32 @llvm.amdgcn.mbcnt.hi(i32 [[TMP4]], i32 [[TMP5]])
+; CHECK-NEXT:    [[TMP7:%.*]] = call i64 @llvm.ctpop.i64(i64 [[TMP1]])
+; CHECK-NEXT:    [[TMP8:%.*]] = trunc i64 [[TMP7]] to i16
+; CHECK-NEXT:    [[TMP9:%.*]] = mul i16 [[VAL]], [[TMP8]]
+; CHECK-NEXT:    [[TMP10:%.*]] = icmp eq i32 [[TMP6]], 0
+; CHECK-NEXT:    br i1 [[TMP10]], label %[[BB11:.*]], label %[[BB13:.*]]
+; CHECK:       [[BB11]]:
+; CHECK-NEXT:    [[TMP12:%.*]] = atomicrmw add ptr addrspace(1) [[UNIFORM_PTR]], i16 [[TMP9]] monotonic, align 2
+; CHECK-NEXT:    br label %[[BB13]]
+; CHECK:       [[BB13]]:
+; CHECK-NEXT:    [[TMP14:%.*]] = phi i16 [ poison, [[TMP0:%.*]] ], [ [[TMP12]], %[[BB11]] ]
+; CHECK-NEXT:    [[TMP19:%.*]] = zext i16 [[TMP14]] to i32
+; CHECK-NEXT:    [[TMP20:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[TMP19]])
+; CHECK-NEXT:    [[TMP15:%.*]] = trunc i32 [[TMP20]] to i16
+; CHECK-NEXT:    [[TMP16:%.*]] = trunc i32 [[TMP6]] to i16
+; CHECK-NEXT:    [[TMP17:%.*]] = mul i16 [[VAL]], [[TMP16]]
+; CHECK-NEXT:    [[TMP18:%.*]] = add i16 [[TMP15]], [[TMP17]]
+; CHECK-NEXT:    store i16 [[TMP18]], ptr addrspace(1) [[RESULT]], align 2
+; CHECK-NEXT:    ret void
+;
+  %rmw = atomicrmw add ptr addrspace(1) %uniform.ptr, i16 %val monotonic, align 2
+  store i16 %rmw, ptr addrspace(1) %result
+  ret void
+}
+
+define amdgpu_kernel void @uniform_xchg_i16(ptr addrspace(1) %result, ptr addrspace(1) %uniform.ptr, i16 %val) {
+; CHECK-LABEL: define amdgpu_kernel void @uniform_xchg_i16(
+; CHECK-SAME: ptr addrspace(1) [[RESULT:%.*]], ptr addrspace(1) [[UNIFORM_PTR:%.*]], i16 [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[RMW:%.*]] = atomicrmw xchg ptr addrspace(1) [[UNIFORM_PTR]], i16 [[VAL]] monotonic, align 2
+; CHECK-NEXT:    store i16 [[RMW]], ptr addrspace(1) [[RESULT]], align 2
+; CHECK-NEXT:    ret void
+;
+  %rmw = atomicrmw xchg ptr addrspace(1) %uniform.ptr, i16 %val monotonic, align 2
+  store i16 %rmw, ptr addrspace(1) %result
+  ret void
+}
+
+define amdgpu_kernel void @uniform_fadd_f16(ptr addrspace(1) %result, ptr addrspace(1) %uniform.ptr, half %val) {
+; CHECK-LABEL: define amdgpu_kernel void @uniform_fadd_f16(
+; CHECK-SAME: ptr addrspace(1) [[RESULT:%.*]], ptr addrspace(1) [[UNIFORM_PTR:%.*]], half [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[RMW:%.*]] = atomicrmw fadd ptr addrspace(1) [[UNIFORM_PTR]], half [[VAL]] monotonic, align 2
+; CHECK-NEXT:    store half [[RMW]], ptr addrspace(1) [[RESULT]], align 2
+; CHECK-NEXT:    ret void
+;
+  %rmw = atomicrmw fadd ptr addrspace(1) %uniform.ptr, half %val monotonic, align 2
+  store half %rmw, ptr addrspace(1) %result
+  ret void
+}
+
+define amdgpu_kernel void @uniform_fadd_bf16(ptr addrspace(1) %result, ptr addrspace(1) %uniform.ptr, bfloat %val) {
+; CHECK-LABEL: define amdgpu_kernel void @uniform_fadd_bf16(
+; CHECK-SAME: ptr addrspace(1) [[RESULT:%.*]], ptr addrspace(1) [[UNIFORM_PTR:%.*]], bfloat [[VAL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[RMW:%.*]] = atomicrmw fadd ptr addrspace(1) [[UNIFORM_PTR]], bfloat [[VAL]] monotonic, align 2
+; CHECK-NEXT:    store bfloat [[RMW]], ptr addrspace(1) [[RESULT]], align 2
+; CHECK-NEXT:    ret void
+;
+  %rmw = atomicrmw fadd ptr addrspace(1) %uniform.ptr, bfloat %val monotonic, align 2
+  store bfloat %rmw, ptr addrspace(1) %result
+  ret void
+}
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index bc89a186db010..3737cc414c58f 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -7149,3 +7149,4244 @@ entry:
   store i64 %old, ptr addrspace(1) %out
   ret void
 }
+
+define amdgpu_kernel void @uniform_or_i8(ptr addrspace(1) %result, ptr addrspace(1) %uniform.ptr, i8 %val) {
+; GFX7LESS-LABEL: uniform_or_i8:
+; GFX7LESS:       ; %bb.0:
+; GFX7LESS-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x9
+; GFX7LESS-NEXT:    s_load_dword s6, s[4:5], 0xd
+; GFX7LESS-NEXT:    v_mbcnt_lo_u32_b32_e64 v0, exec_lo, 0
+; GFX7LESS-NEXT:    v_mbcnt_hi_u32_b32_e32 v0, exec_hi, v0
+; GFX7LESS-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
+; GFX7LESS-NEXT:    ; implicit-def: $vgpr0
+; GFX7LESS-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX7LESS-NEXT:    s_cbranch_execz .LBB12_2
+; GFX7LESS-NEXT:  ; %bb.1:
+; GFX7LESS-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX7LESS-NEXT:    s_and_b32 s8, s2, -4
+; GFX7LESS-NEXT:    s_mov_b32 s11, 0xf000
+; GFX7LESS-NEXT:    s_and_b32 s2, s2, 3
+; GFX7LESS-NEXT:    s_lshl_b32 s2, s2, 3
+; GFX7LESS-NEXT:    s_and_b32 s7, s6, 0xff
+; GFX7LESS-NEXT:    s_lshl_b32 s7, s7, s2
+; GFX7LESS-NEXT:    s_mov_b32 s10, -1
+; GFX7LESS-NEXT:    s_mov_b32 s9, s3
+; GFX7LESS-NEXT:    v_mov_b32_e32 v0, s7
+; GFX7LESS-NEXT:    buffer_atomic_or v0, off, s[8:11], 0 glc
+; GFX7LESS-NEXT:    s_waitcnt vmcnt(0) expcnt(0)
+; GFX7LESS-NEXT:    v_lshrrev_b32_e32 v0, s2, v0
+; GFX7LESS-NEXT:  .LBB12_2:
+; GFX7LESS-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX7LESS-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX7LESS-NEXT:    s_mov_b32 s3, 0xf000
+; GFX7LESS-NEXT:    s_mov_b32 s2, -1
+; GFX7LESS-NEXT:    v_and_b32_e32 v0, 0xff, v0
+; GFX7LESS-NEXT:    v_mov_b32_e32 v1, s6
+; GFX7LESS-NEXT:    v_readfirstlane_b32 s4, v0
+; GFX7LESS-NEXT:    v_cndmask_b32_e64 v0, v1, 0, vcc
+; GFX7LESS-NEXT:    v_or_b32_e32 v0, s4, v0
+; GFX7LESS-NEXT:    buffer_store_byte v0, off, s[0:3], 0
+; GFX7LESS-NEXT:    s_endpgm
+;
+; GFX8-LABEL: uniform_or_i8:
+; GFX8:       ; %bb.0:
+; GFX8-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX8-NEXT:    s_load_dword s6, s[4:5], 0x34
+; GFX8-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
+; GFX8-NEXT:    v_mbcnt_hi_u32_b32 v0, exec_hi, v0
+; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
+; GFX8-NEXT:    ; implicit-def: $vgpr0
+; GFX8-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX8-NEXT:    s_cbranch_execz .LBB12_2
+; GFX8-NEXT:  ; %bb.1:
+; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8-NEXT:    s_and_b32 s8, s2, -4
+; GFX8-NEXT:    s_and_b32 s2, s2, 3
+; GFX8-NEXT:    s_mov_b32 s9, s3
+; GFX8-NEXT:    s_lshl_b32 s2, s2, 3
+; GFX8-NEXT:    s_and_b32 s3, s6, 0xff
+; GFX8-NEXT:    s_lshl_b32 s3, s3, s2
+; GFX8-NEXT:    s_mov_b32 s11, 0xf000
+; GFX8-NEXT:    s_mov_b32 s10, -1
+; GFX8-NEXT:    v_mov_b32_e32 v0, s3
+; GFX8-NEXT:    buffer_atomic_or v0, off, s[8:11], 0 glc
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_lshrrev_b32_e32 v0, s2, v0
+; GFX8-NEXT:  .LBB12_2:
+; GFX8-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX8-NEXT:    v_and_b32_e32 v0, 0xff, v0
+; GFX8-NEXT:    v_readfirstlane_b32 s4, v0
+; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8-NEXT:    v_mov_b32_e32 v0, s6
+; GFX8-NEXT:    v_cndmask_b32_e64 v0, v0, 0, vcc
+; GFX8-NEXT:    s_mov_b32 s3, 0xf000
+; GFX8-NEXT:    s_mov_b32 s2, -1
+; GFX8-NEXT:    v_or_b32_e32 v0, s4, v0
+; GFX8-NEXT:    buffer_store_byte v0, off, s[0:3], 0
+; GFX8-NEXT:    s_endpgm
+;
+; GFX9-LABEL: uniform_or_i8:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX9-NEXT:    s_load_dword s6, s[4:5], 0x34
+; GFX9-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
+; GFX9-NEXT:    v_mbcnt_hi_u32_b32 v0, exec_hi, v0
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
+; GFX9-NEXT:    ; implicit-def: $vgpr0
+; GFX9-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX9-NEXT:    s_cbranch_execz .LBB12_2
+; GFX9-NEXT:  ; %bb.1:
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_and_b32 s8, s2, -4
+; GFX9-NEXT:    s_and_b32 s2, s2, 3
+; GFX9-NEXT:    s_mov_b32 s9, s3
+; GFX9-NEXT:    s_lshl_b32 s2, s2, 3
+; GFX9-NEXT:    s_and_b32 s3, s6, 0xff
+; GFX9-NEXT:    s_lshl_b32 s3, s3, s2
+; GFX9-NEXT:    s_mov_b32 s11, 0xf000
+; GFX9-NEXT:    s_mov_b32 s10, -1
+; GFX9-NEXT:    v_mov_b32_e32 v0, s3
+; GFX9-NEXT:    buffer_atomic_or v0, off, s[8:11], 0 glc
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_lshrrev_b32_e32 v0, s2, v0
+; GFX9-NEXT:  .LBB12_2:
+; GFX9-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX9-NEXT:    v_and_b32_e32 v0, 0xff, v0
+; GFX9-NEXT:    v_readfirstlane_b32 s4, v0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    v_mov_b32_e32 v0, s6
+; GFX9-NEXT:    v_cndmask_b32_e64 v0, v0, 0, vcc
+; GFX9-NEXT:    s_mov_b32 s3, 0xf000
+; GFX9-NEXT:    s_mov_b32 s2, -1
+; GFX9-NEXT:    v_or_b32_e32 v0, s4, v0
+; GFX9-NEXT:    buffer_store_byte v0, off, s[0:3], 0
+; GFX9-NEXT:    s_endpgm
+;
+; GFX1064-LABEL: uniform_or_i8:
+; GFX1064:       ; %bb.0:
+; GFX1064-NEXT:    s_clause 0x1
+; GFX1064-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX1064-NEXT:    s_load_dword s6, s[4:5], 0x34
+; GFX1064-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
+; GFX1064-NEXT:    v_mbcnt_hi_u32_b32 v0, exec_hi, v0
+; GFX1064-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
+; GFX1064-NEXT:    ; implicit-def: $vgpr0
+; GFX1064-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX1064-NEXT:    s_cbranch_execz .LBB12_2
+; GFX1064-NEXT:  ; %bb.1:
+; GFX1064-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064-NEXT:    s_and_b32 s7, s2, 3
+; GFX1064-NEXT:    s_and_b32 s8, s6, 0xff
+; GFX1064-NEXT:    s_lshl_b32 s7, s7, 3
+; GFX1064-NEXT:    s_mov_b32 s11, 0x31016000
+; GFX1064-NEXT:    s_lshl_b32 s9, s8, s7
+; GFX1064-NEXT:    s_and_b32 s8, s2, -4
+; GFX1064-NEXT:    v_mov_b32_e32 v0, s9
+; GFX1064-NEXT:    s_mov_b32 s10, -1
+; GFX1064-NEXT:    s_mov_b32 s9, s3
+; GFX1064-NEXT:    buffer_atomic_or v0, off, s[8:11], 0 glc
+; GFX1064-NEXT:    s_waitcnt vmcnt(0)
+; GFX1064-NEXT:    v_lshrrev_b32_e32 v0, s7, v0
+; GFX1064-NEXT:  .LBB12_2:
+; GFX1064-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX1064-NEXT:    v_and_b32_e32 v0, 0xff, v0
+; GFX1064-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064-NEXT:    s_mov_b32 s3, 0x31016000
+; GFX1064-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1064-NEXT:    v_cndmask_b32_e64 v0, s6, 0, vcc
+; GFX1064-NEXT:    v_or_b32_e32 v0, s2, v0
+; GFX1064-NEXT:    s_mov_b32 s2, -1
+; GFX1064-NEXT:    buffer_store_byte v0, off, s[0:3], 0
+; GFX1064-NEXT:    s_endpgm
+;
+; GFX1032-LABEL: uniform_or_i8:
+; GFX1032:       ; %bb.0:
+; GFX1032-NEXT:    s_clause 0x1
+; GFX1032-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX1032-NEXT:    s_load_dword s6, s[4:5], 0x34
+; GFX1032-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
+; GFX1032-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v0
+; GFX1032-NEXT:    ; implicit-def: $vgpr0
+; GFX1032-NEXT:    s_and_saveexec_b32 s4, vcc_lo
+; GFX1032-NEXT:    s_cbranch_execz .LBB12_2
+; GFX1032-NEXT:  ; %bb.1:
+; GFX1032-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032-NEXT:    s_and_b32 s5, s2, 3
+; GFX1032-NEXT:    s_and_b32 s7, s6, 0xff
+; GFX1032-NEXT:    s_lshl_b32 s5, s5, 3
+; GFX1032-NEXT:    s_and_b32 s8, s2, -4
+; GFX1032-NEXT:    s_lshl_b32 s7, s7, s5
+; GFX1032-NEXT:    s_mov_b32 s11, 0x31016000
+; GFX1032-NEXT:    v_mov_b32_e32 v0, s7
+; GFX1032-NEXT:    s_mov_b32 s10, -1
+; GFX1032-NEXT:    s_mov_b32 s9, s3
+; GFX1032-NEXT:    buffer_atomic_or v0, off, s[8:11], 0 glc
+; GFX1032-NEXT:    s_waitcnt vmcnt(0)
+; GFX1032-NEXT:    v_lshrrev_b32_e32 v0, s5, v0
+; GFX1032-NEXT:  .LBB12_2:
+; GFX1032-NEXT:    s_or_b32 exec_lo, exec_lo, s4
+; GFX1032-NEXT:    v_and_b32_e32 v0, 0xff, v0
+; GFX1032-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032-NEXT:    s_mov_b32 s3, 0x31016000
+; GFX1032-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1032-NEXT:    v_cndmask_b32_e64 v0, s6, 0, vcc_lo
+; GFX1032-NEXT:    v_or_b32_e32 v0, s2, v0
+; GFX1032-NEXT:    s_mov_b32 s2, -1
+; GFX1032-NEXT:    buffer_store_byte v0, off, s[0:3], 0
+; GFX1032-NEXT:    s_endpgm
+;
+; GFX1164-LABEL: uniform_or_i8:
+; GFX1164:       ; %bb.0:
+; GFX1164-NEXT:    s_clause 0x1
+; GFX1164-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX1164-NEXT:    s_load_b32 s6, s[4:5], 0x34
+; GFX1164-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
+; GFX1164-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164-NEXT:    v_mbcnt_hi_u32_b32 v0, exec_hi, v0
+; GFX1164-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
+; GFX1164-NEXT:    ; implicit-def: $vgpr0
+; GFX1164-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX1164-NEXT:    s_cbranch_execz .LBB12_2
+; GFX1164-NEXT:  ; %bb.1:
+; GFX1164-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164-NEXT:    s_and_b32 s7, s2, 3
...
[truncated]

jhuber6

Thanks

arsenm · 2025-02-23T14:44:20Z

Since 5feb32b this shouldn't be necessary, but it's not working

pravinjagtap · 2025-02-24T05:35:52Z

but it's not working

Right, after 5feb32b i8/i16 are no longer illegal for the readfirstlane. What exactly is the problem? Can you please share the reproducer if its handy?

arsenm · 2025-02-24T05:38:16Z

Right, after 5feb32b i8/i16 are no longer illegal for the readfirstlane. What exactly is the problem? Can you please share the reproducer if its handy?

The reproducer is most of the tests in the commit. The i8 intrinsics are not legalized. setOperationAction is probably missing for i8 or Other

pravinjagtap

LGTM

arsenm · 2025-02-24T11:39:42Z

LGTM

Per the change description, i8 is not legal here. It is only handling legal types. This makes sense to me, although I could adjust the patch here to only use i16. However that's just adding extra steps for legalization

AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer

6b92c78

We need to promote 8/16-bit cases to 32-bit. Unfortunately we are missing demanded bits optimizations on readfirstlane, so we end up emitting an and instruction on the input.

arsenm added the backend:AMDGPU label Feb 23, 2025 — with Graphite App

arsenm requested review from pravinjagtap, rampitec, shiltian, ssahasra and vikramRH February 23, 2025 02:58

arsenm marked this pull request as ready for review February 23, 2025 02:59

arsenm mentioned this pull request Feb 23, 2025

AMDGPU is missing simplify demanded bits optimizations of readfirstlane and similar operations #128390

Open

jhuber6 approved these changes Feb 23, 2025

View reviewed changes

pravinjagtap approved these changes Feb 24, 2025

View reviewed changes

arsenm merged commit 6aea630 into main Feb 24, 2025
15 checks passed

arsenm deleted the users/arsenm/amdgpu/fix-creating-illegal-typed-readfirstlane-atomic-optimizer branch February 24, 2025 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer #128388

AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer #128388

Uh oh!

arsenm commented Feb 23, 2025 •

edited

Loading

Uh oh!

arsenm commented Feb 23, 2025

Uh oh!

llvmbot commented Feb 23, 2025

Uh oh!

jhuber6 left a comment

Uh oh!

arsenm commented Feb 23, 2025

Uh oh!

pravinjagtap commented Feb 24, 2025

Uh oh!

arsenm commented Feb 24, 2025

Uh oh!

pravinjagtap left a comment

Uh oh!

arsenm commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer #128388

AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer #128388

Uh oh!

Conversation

arsenm commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm commented Feb 23, 2025

Uh oh!

llvmbot commented Feb 23, 2025

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm commented Feb 23, 2025

Uh oh!

pravinjagtap commented Feb 24, 2025

Uh oh!

arsenm commented Feb 24, 2025

Uh oh!

pravinjagtap left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arsenm commented Feb 23, 2025 •

edited

Loading