-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Rework i1->i32 zext/anyext translation #114721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-amdgpu Author: None (doraeneko) Changesto distinguish uniform and divergent cases (#87938), similarly to sext_inreg handling. Patch is 1.38 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/114721.diff 76 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index c8a46217190a1d..4d0fdc50a37070 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -2343,14 +2343,30 @@ def : GCNPat <
/*src1mod*/(i32 0), /*src1*/(i32 -1), i1:$src0)
>;
-class Ext32Pat <SDNode ext> : GCNPat <
- (i32 (ext i1:$src0)),
- (V_CNDMASK_B32_e64 /*src0mod*/(i32 0), /*src0*/(i32 0),
- /*src1mod*/(i32 0), /*src1*/(i32 1), i1:$src0)
+
+class UniformExt32<SDNode ext> : PatFrag<
+ (ops node:$src),
+ (i32 (ext $src)),
+ [{ return !N->isDivergent(); }]>;
+
+class DivergentExt32<SDNode ext> : PatFrag<
+ (ops node:$src),
+ (i32 (ext $src))>;
+
+class UniformExt32Pat<SDNode ext> : GCNPat<
+ (UniformExt32<ext> SCC),
+ (S_CSELECT_B32 (i32 1), (i32 0))
>;
-def : Ext32Pat <zext>;
-def : Ext32Pat <anyext>;
+class DivergentExt32Pat<SDNode ext> : GCNPat<
+ (DivergentExt32<ext> i1:$src),
+ (V_CNDMASK_B32_e64 /*src0mod*/(i32 0), /*src0*/(i32 0),
+ /*src1mod*/(i32 0), /*src1*/(i32 1), i1:$src)>;
+
+def : UniformExt32Pat<zext>;
+def : UniformExt32Pat<anyext>;
+def : DivergentExt32Pat<zext>;
+def : DivergentExt32Pat<anyext>;
// The multiplication scales from [0,1) to the unsigned integer range,
// rounding down a bit to avoid unwanted overflow.
diff --git a/llvm/test/CodeGen/AMDGPU/add.ll b/llvm/test/CodeGen/AMDGPU/add.ll
index 3c9d43a88a0fda..96d16ae968e1a2 100644
--- a/llvm/test/CodeGen/AMDGPU/add.ll
+++ b/llvm/test/CodeGen/AMDGPU/add.ll
@@ -1156,15 +1156,22 @@ define amdgpu_kernel void @add64_in_branch(ptr addrspace(1) %out, ptr addrspace(
; GFX6-NEXT: s_waitcnt lgkmcnt(0)
; GFX6-NEXT: v_cmp_ne_u64_e64 s[10:11], s[4:5], 0
; GFX6-NEXT: s_and_b64 vcc, exec, s[10:11]
-; GFX6-NEXT: s_cbranch_vccz .LBB9_4
+; GFX6-NEXT: s_cbranch_vccz .LBB9_2
; GFX6-NEXT: ; %bb.1: ; %else
; GFX6-NEXT: s_add_u32 s4, s4, s6
; GFX6-NEXT: s_addc_u32 s5, s5, s7
-; GFX6-NEXT: s_andn2_b64 vcc, exec, s[8:9]
-; GFX6-NEXT: s_cbranch_vccnz .LBB9_3
-; GFX6-NEXT: .LBB9_2: ; %if
+; GFX6-NEXT: s_branch .LBB9_3
+; GFX6-NEXT: .LBB9_2:
+; GFX6-NEXT: s_mov_b64 s[8:9], -1
+; GFX6-NEXT: ; implicit-def: $sgpr4_sgpr5
+; GFX6-NEXT: .LBB9_3: ; %Flow
+; GFX6-NEXT: s_and_b64 s[6:7], s[8:9], exec
+; GFX6-NEXT: s_cselect_b32 s6, 1, 0
+; GFX6-NEXT: s_cmp_lg_u32 s6, 1
+; GFX6-NEXT: s_cbranch_scc1 .LBB9_5
+; GFX6-NEXT: ; %bb.4: ; %if
; GFX6-NEXT: s_load_dwordx2 s[4:5], s[2:3], 0x0
-; GFX6-NEXT: .LBB9_3: ; %endif
+; GFX6-NEXT: .LBB9_5: ; %endif
; GFX6-NEXT: s_waitcnt lgkmcnt(0)
; GFX6-NEXT: v_mov_b32_e32 v0, s4
; GFX6-NEXT: s_mov_b32 s3, 0xf000
@@ -1172,9 +1179,6 @@ define amdgpu_kernel void @add64_in_branch(ptr addrspace(1) %out, ptr addrspace(
; GFX6-NEXT: v_mov_b32_e32 v1, s5
; GFX6-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0
; GFX6-NEXT: s_endpgm
-; GFX6-NEXT: .LBB9_4:
-; GFX6-NEXT: ; implicit-def: $sgpr4_sgpr5
-; GFX6-NEXT: s_branch .LBB9_2
;
; GFX8-LABEL: add64_in_branch:
; GFX8: ; %bb.0: ; %entry
@@ -1182,15 +1186,22 @@ define amdgpu_kernel void @add64_in_branch(ptr addrspace(1) %out, ptr addrspace(
; GFX8-NEXT: s_mov_b64 s[8:9], 0
; GFX8-NEXT: s_waitcnt lgkmcnt(0)
; GFX8-NEXT: s_cmp_lg_u64 s[4:5], 0
-; GFX8-NEXT: s_cbranch_scc0 .LBB9_4
+; GFX8-NEXT: s_cbranch_scc0 .LBB9_2
; GFX8-NEXT: ; %bb.1: ; %else
; GFX8-NEXT: s_add_u32 s4, s4, s6
; GFX8-NEXT: s_addc_u32 s5, s5, s7
-; GFX8-NEXT: s_andn2_b64 vcc, exec, s[8:9]
-; GFX8-NEXT: s_cbranch_vccnz .LBB9_3
-; GFX8-NEXT: .LBB9_2: ; %if
+; GFX8-NEXT: s_branch .LBB9_3
+; GFX8-NEXT: .LBB9_2:
+; GFX8-NEXT: s_mov_b64 s[8:9], -1
+; GFX8-NEXT: ; implicit-def: $sgpr4_sgpr5
+; GFX8-NEXT: .LBB9_3: ; %Flow
+; GFX8-NEXT: s_and_b64 s[6:7], s[8:9], exec
+; GFX8-NEXT: s_cselect_b32 s6, 1, 0
+; GFX8-NEXT: s_cmp_lg_u32 s6, 1
+; GFX8-NEXT: s_cbranch_scc1 .LBB9_5
+; GFX8-NEXT: ; %bb.4: ; %if
; GFX8-NEXT: s_load_dwordx2 s[4:5], s[2:3], 0x0
-; GFX8-NEXT: .LBB9_3: ; %endif
+; GFX8-NEXT: .LBB9_5: ; %endif
; GFX8-NEXT: s_waitcnt lgkmcnt(0)
; GFX8-NEXT: v_mov_b32_e32 v2, s4
; GFX8-NEXT: v_mov_b32_e32 v0, s0
@@ -1198,9 +1209,6 @@ define amdgpu_kernel void @add64_in_branch(ptr addrspace(1) %out, ptr addrspace(
; GFX8-NEXT: v_mov_b32_e32 v3, s5
; GFX8-NEXT: flat_store_dwordx2 v[0:1], v[2:3]
; GFX8-NEXT: s_endpgm
-; GFX8-NEXT: .LBB9_4:
-; GFX8-NEXT: ; implicit-def: $sgpr4_sgpr5
-; GFX8-NEXT: s_branch .LBB9_2
;
; GFX9-LABEL: add64_in_branch:
; GFX9: ; %bb.0: ; %entry
@@ -1208,90 +1216,114 @@ define amdgpu_kernel void @add64_in_branch(ptr addrspace(1) %out, ptr addrspace(
; GFX9-NEXT: s_mov_b64 s[2:3], 0
; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: s_cmp_lg_u64 s[8:9], 0
-; GFX9-NEXT: s_cbranch_scc0 .LBB9_4
+; GFX9-NEXT: s_cbranch_scc0 .LBB9_2
; GFX9-NEXT: ; %bb.1: ; %else
; GFX9-NEXT: s_add_u32 s0, s8, s10
; GFX9-NEXT: s_addc_u32 s1, s9, s11
-; GFX9-NEXT: s_andn2_b64 vcc, exec, s[2:3]
-; GFX9-NEXT: s_cbranch_vccnz .LBB9_3
-; GFX9-NEXT: .LBB9_2: ; %if
+; GFX9-NEXT: s_branch .LBB9_3
+; GFX9-NEXT: .LBB9_2:
+; GFX9-NEXT: s_mov_b64 s[2:3], -1
+; GFX9-NEXT: ; implicit-def: $sgpr0_sgpr1
+; GFX9-NEXT: .LBB9_3: ; %Flow
+; GFX9-NEXT: s_and_b64 s[2:3], s[2:3], exec
+; GFX9-NEXT: s_cselect_b32 s2, 1, 0
+; GFX9-NEXT: s_cmp_lg_u32 s2, 1
+; GFX9-NEXT: s_cbranch_scc1 .LBB9_5
+; GFX9-NEXT: ; %bb.4: ; %if
; GFX9-NEXT: s_load_dwordx2 s[0:1], s[6:7], 0x0
-; GFX9-NEXT: .LBB9_3: ; %endif
+; GFX9-NEXT: .LBB9_5: ; %endif
; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: v_mov_b32_e32 v0, s0
; GFX9-NEXT: v_mov_b32_e32 v2, 0
; GFX9-NEXT: v_mov_b32_e32 v1, s1
; GFX9-NEXT: global_store_dwordx2 v2, v[0:1], s[4:5]
; GFX9-NEXT: s_endpgm
-; GFX9-NEXT: .LBB9_4:
-; GFX9-NEXT: ; implicit-def: $sgpr0_sgpr1
-; GFX9-NEXT: s_branch .LBB9_2
;
; GFX10-LABEL: add64_in_branch:
; GFX10: ; %bb.0: ; %entry
; GFX10-NEXT: s_load_dwordx8 s[4:11], s[2:3], 0x24
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_cmp_lg_u64 s[8:9], 0
-; GFX10-NEXT: s_cbranch_scc0 .LBB9_4
+; GFX10-NEXT: s_cbranch_scc0 .LBB9_2
; GFX10-NEXT: ; %bb.1: ; %else
; GFX10-NEXT: s_add_u32 s0, s8, s10
; GFX10-NEXT: s_addc_u32 s1, s9, s11
-; GFX10-NEXT: s_cbranch_execnz .LBB9_3
-; GFX10-NEXT: .LBB9_2: ; %if
+; GFX10-NEXT: s_mov_b32 s2, 0
+; GFX10-NEXT: s_branch .LBB9_3
+; GFX10-NEXT: .LBB9_2:
+; GFX10-NEXT: s_mov_b32 s2, -1
+; GFX10-NEXT: ; implicit-def: $sgpr0_sgpr1
+; GFX10-NEXT: .LBB9_3: ; %Flow
+; GFX10-NEXT: s_and_b32 s2, s2, exec_lo
+; GFX10-NEXT: s_cselect_b32 s2, 1, 0
+; GFX10-NEXT: s_cmp_lg_u32 s2, 1
+; GFX10-NEXT: s_cbranch_scc1 .LBB9_5
+; GFX10-NEXT: ; %bb.4: ; %if
; GFX10-NEXT: s_load_dwordx2 s[0:1], s[6:7], 0x0
-; GFX10-NEXT: .LBB9_3: ; %endif
+; GFX10-NEXT: .LBB9_5: ; %endif
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: v_mov_b32_e32 v0, s0
; GFX10-NEXT: v_mov_b32_e32 v2, 0
; GFX10-NEXT: v_mov_b32_e32 v1, s1
; GFX10-NEXT: global_store_dwordx2 v2, v[0:1], s[4:5]
; GFX10-NEXT: s_endpgm
-; GFX10-NEXT: .LBB9_4:
-; GFX10-NEXT: ; implicit-def: $sgpr0_sgpr1
-; GFX10-NEXT: s_branch .LBB9_2
;
; GFX11-LABEL: add64_in_branch:
; GFX11: ; %bb.0: ; %entry
; GFX11-NEXT: s_load_b256 s[0:7], s[2:3], 0x24
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-NEXT: s_cmp_lg_u64 s[4:5], 0
-; GFX11-NEXT: s_cbranch_scc0 .LBB9_4
+; GFX11-NEXT: s_cbranch_scc0 .LBB9_2
; GFX11-NEXT: ; %bb.1: ; %else
; GFX11-NEXT: s_add_u32 s4, s4, s6
; GFX11-NEXT: s_addc_u32 s5, s5, s7
-; GFX11-NEXT: s_cbranch_execnz .LBB9_3
-; GFX11-NEXT: .LBB9_2: ; %if
+; GFX11-NEXT: s_mov_b32 s6, 0
+; GFX11-NEXT: s_branch .LBB9_3
+; GFX11-NEXT: .LBB9_2:
+; GFX11-NEXT: s_mov_b32 s6, -1
+; GFX11-NEXT: ; implicit-def: $sgpr4_sgpr5
+; GFX11-NEXT: .LBB9_3: ; %Flow
+; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT: s_and_b32 s6, s6, exec_lo
+; GFX11-NEXT: s_cselect_b32 s6, 1, 0
+; GFX11-NEXT: s_cmp_lg_u32 s6, 1
+; GFX11-NEXT: s_cbranch_scc1 .LBB9_5
+; GFX11-NEXT: ; %bb.4: ; %if
; GFX11-NEXT: s_load_b64 s[4:5], s[2:3], 0x0
-; GFX11-NEXT: .LBB9_3: ; %endif
+; GFX11-NEXT: .LBB9_5: ; %endif
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-NEXT: v_mov_b32_e32 v0, s4
; GFX11-NEXT: v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v1, s5
; GFX11-NEXT: global_store_b64 v2, v[0:1], s[0:1]
; GFX11-NEXT: s_endpgm
-; GFX11-NEXT: .LBB9_4:
-; GFX11-NEXT: ; implicit-def: $sgpr4_sgpr5
-; GFX11-NEXT: s_branch .LBB9_2
;
; GFX12-LABEL: add64_in_branch:
; GFX12: ; %bb.0: ; %entry
; GFX12-NEXT: s_load_b256 s[0:7], s[2:3], 0x24
; GFX12-NEXT: s_wait_kmcnt 0x0
; GFX12-NEXT: s_cmp_lg_u64 s[4:5], 0
-; GFX12-NEXT: s_cbranch_scc0 .LBB9_4
+; GFX12-NEXT: s_cbranch_scc0 .LBB9_2
; GFX12-NEXT: ; %bb.1: ; %else
; GFX12-NEXT: s_add_nc_u64 s[4:5], s[4:5], s[6:7]
-; GFX12-NEXT: s_cbranch_execnz .LBB9_3
-; GFX12-NEXT: .LBB9_2: ; %if
+; GFX12-NEXT: s_mov_b32 s6, 0
+; GFX12-NEXT: s_branch .LBB9_3
+; GFX12-NEXT: .LBB9_2:
+; GFX12-NEXT: s_mov_b32 s6, -1
+; GFX12-NEXT: ; implicit-def: $sgpr4_sgpr5
+; GFX12-NEXT: .LBB9_3: ; %Flow
+; GFX12-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT: s_and_b32 s6, s6, exec_lo
+; GFX12-NEXT: s_cselect_b32 s6, 1, 0
+; GFX12-NEXT: s_cmp_lg_u32 s6, 1
+; GFX12-NEXT: s_cbranch_scc1 .LBB9_5
+; GFX12-NEXT: ; %bb.4: ; %if
; GFX12-NEXT: s_load_b64 s[4:5], s[2:3], 0x0
-; GFX12-NEXT: .LBB9_3: ; %endif
+; GFX12-NEXT: .LBB9_5: ; %endif
; GFX12-NEXT: s_wait_kmcnt 0x0
; GFX12-NEXT: v_mov_b32_e32 v0, s4
; GFX12-NEXT: v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v1, s5
; GFX12-NEXT: global_store_b64 v2, v[0:1], s[0:1]
; GFX12-NEXT: s_endpgm
-; GFX12-NEXT: .LBB9_4:
-; GFX12-NEXT: ; implicit-def: $sgpr4_sgpr5
-; GFX12-NEXT: s_branch .LBB9_2
entry:
%0 = icmp eq i64 %a, 0
br i1 %0, label %if, label %else
diff --git a/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll b/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
index 4d26453e1a0d6d..4688c7a6879bd5 100644
--- a/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
+++ b/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
@@ -557,31 +557,31 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
; GFX908-NEXT: s_mul_hi_u32 s9, s0, s7
; GFX908-NEXT: s_mul_i32 s0, s0, s7
; GFX908-NEXT: s_add_i32 s1, s9, s1
-; GFX908-NEXT: s_lshl_b64 s[14:15], s[0:1], 5
+; GFX908-NEXT: s_lshl_b64 s[0:1], s[0:1], 5
; GFX908-NEXT: s_branch .LBB3_2
; GFX908-NEXT: .LBB3_1: ; %Flow20
; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT: s_andn2_b64 vcc, exec, s[0:1]
-; GFX908-NEXT: s_cbranch_vccz .LBB3_12
+; GFX908-NEXT: s_and_b64 s[14:15], s[14:15], exec
+; GFX908-NEXT: s_cselect_b32 s7, 1, 0
+; GFX908-NEXT: s_cmp_lg_u32 s7, 1
+; GFX908-NEXT: s_cbranch_scc0 .LBB3_14
; GFX908-NEXT: .LBB3_2: ; %bb9
; GFX908-NEXT: ; =>This Loop Header: Depth=1
-; GFX908-NEXT: ; Child Loop BB3_5 Depth 2
+; GFX908-NEXT: ; Child Loop BB3_6 Depth 2
; GFX908-NEXT: s_mov_b64 s[16:17], -1
-; GFX908-NEXT: s_cbranch_scc0 .LBB3_10
+; GFX908-NEXT: s_cbranch_scc0 .LBB3_12
; GFX908-NEXT: ; %bb.3: ; %bb14
; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
; GFX908-NEXT: global_load_dwordx2 v[2:3], v[0:1], off
-; GFX908-NEXT: v_cmp_gt_i64_e64 s[0:1], s[4:5], -1
; GFX908-NEXT: s_mov_b32 s9, s8
-; GFX908-NEXT: v_cndmask_b32_e64 v6, 0, 1, s[0:1]
; GFX908-NEXT: v_mov_b32_e32 v4, s8
-; GFX908-NEXT: v_cmp_ne_u32_e64 s[0:1], 1, v6
; GFX908-NEXT: v_mov_b32_e32 v6, s8
; GFX908-NEXT: v_mov_b32_e32 v8, s8
; GFX908-NEXT: v_mov_b32_e32 v5, s9
; GFX908-NEXT: v_mov_b32_e32 v7, s9
; GFX908-NEXT: v_mov_b32_e32 v9, s9
-; GFX908-NEXT: v_cmp_lt_i64_e64 s[16:17], s[4:5], 0
+; GFX908-NEXT: v_cmp_lt_i64_e64 s[14:15], s[4:5], 0
+; GFX908-NEXT: v_cmp_gt_i64_e64 s[16:17], s[4:5], -1
; GFX908-NEXT: v_mov_b32_e32 v11, v5
; GFX908-NEXT: s_mov_b64 s[18:19], s[10:11]
; GFX908-NEXT: v_mov_b32_e32 v10, v4
@@ -596,18 +596,22 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
; GFX908-NEXT: s_add_i32 s9, s20, s9
; GFX908-NEXT: s_mul_i32 s7, s2, s7
; GFX908-NEXT: s_add_i32 s9, s9, s21
-; GFX908-NEXT: s_branch .LBB3_5
+; GFX908-NEXT: s_branch .LBB3_6
; GFX908-NEXT: .LBB3_4: ; %bb58
-; GFX908-NEXT: ; in Loop: Header=BB3_5 Depth=2
+; GFX908-NEXT: ; in Loop: Header=BB3_6 Depth=2
; GFX908-NEXT: v_add_co_u32_sdwa v2, vcc, v2, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
; GFX908-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v3, vcc
-; GFX908-NEXT: s_add_u32 s18, s18, s14
+; GFX908-NEXT: s_add_u32 s18, s18, s0
; GFX908-NEXT: v_cmp_lt_i64_e64 s[22:23], -1, v[2:3]
-; GFX908-NEXT: s_addc_u32 s19, s19, s15
+; GFX908-NEXT: s_addc_u32 s19, s19, s1
; GFX908-NEXT: s_mov_b64 s[20:21], 0
-; GFX908-NEXT: s_andn2_b64 vcc, exec, s[22:23]
-; GFX908-NEXT: s_cbranch_vccz .LBB3_9
-; GFX908-NEXT: .LBB3_5: ; %bb16
+; GFX908-NEXT: .LBB3_5: ; %Flow18
+; GFX908-NEXT: ; in Loop: Header=BB3_6 Depth=2
+; GFX908-NEXT: s_and_b64 s[22:23], s[22:23], exec
+; GFX908-NEXT: s_cselect_b32 s22, 1, 0
+; GFX908-NEXT: s_cmp_lg_u32 s22, 1
+; GFX908-NEXT: s_cbranch_scc0 .LBB3_11
+; GFX908-NEXT: .LBB3_6: ; %bb16
; GFX908-NEXT: ; Parent Loop BB3_2 Depth=1
; GFX908-NEXT: ; => This Inner Loop Header: Depth=2
; GFX908-NEXT: s_add_u32 s20, s18, s7
@@ -622,11 +626,13 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
; GFX908-NEXT: s_waitcnt vmcnt(0)
; GFX908-NEXT: ds_read_b64 v[12:13], v19
; GFX908-NEXT: ds_read_b64 v[14:15], v0
-; GFX908-NEXT: s_and_b64 vcc, exec, s[0:1]
+; GFX908-NEXT: s_and_b64 s[20:21], s[16:17], exec
+; GFX908-NEXT: s_cselect_b32 s20, 1, 0
+; GFX908-NEXT: s_cmp_lg_u32 s20, 1
; GFX908-NEXT: s_waitcnt lgkmcnt(0)
-; GFX908-NEXT: s_cbranch_vccnz .LBB3_7
-; GFX908-NEXT: ; %bb.6: ; %bb51
-; GFX908-NEXT: ; in Loop: Header=BB3_5 Depth=2
+; GFX908-NEXT: s_cbranch_scc1 .LBB3_8
+; GFX908-NEXT: ; %bb.7: ; %bb51
+; GFX908-NEXT: ; in Loop: Header=BB3_6 Depth=2
; GFX908-NEXT: v_cvt_f32_f16_sdwa v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
; GFX908-NEXT: v_cvt_f32_f16_e32 v21, v21
; GFX908-NEXT: v_cvt_f32_f16_sdwa v23, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
@@ -648,31 +654,37 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
; GFX908-NEXT: v_add_f32_e32 v10, v10, v12
; GFX908-NEXT: v_add_f32_e32 v11, v11, v13
; GFX908-NEXT: s_mov_b64 s[20:21], -1
-; GFX908-NEXT: s_branch .LBB3_4
-; GFX908-NEXT: .LBB3_7: ; in Loop: Header=BB3_5 Depth=2
-; GFX908-NEXT: s_mov_b64 s[20:21], s[16:17]
-; GFX908-NEXT: s_andn2_b64 vcc, exec, s[20:21]
-; GFX908-NEXT: s_cbranch_vccz .LBB3_4
-; GFX908-NEXT: ; %bb.8: ; in Loop: Header=BB3_2 Depth=1
+; GFX908-NEXT: s_branch .LBB3_9
+; GFX908-NEXT: .LBB3_8: ; in Loop: Header=BB3_6 Depth=2
+; GFX908-NEXT: s_mov_b64 s[20:21], s[14:15]
+; GFX908-NEXT: .LBB3_9: ; %Flow
+; GFX908-NEXT: ; in Loop: Header=BB3_6 Depth=2
+; GFX908-NEXT: s_and_b64 s[22:23], s[20:21], exec
+; GFX908-NEXT: s_cselect_b32 s22, 1, 0
+; GFX908-NEXT: s_cmp_lg_u32 s22, 1
+; GFX908-NEXT: s_cbranch_scc0 .LBB3_4
+; GFX908-NEXT: ; %bb.10: ; in Loop: Header=BB3_6 Depth=2
+; GFX908-NEXT: s_mov_b64 s[22:23], -1
; GFX908-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX908-NEXT: ; implicit-def: $sgpr18_sgpr19
-; GFX908-NEXT: .LBB3_9: ; %loop.exit.guard
+; GFX908-NEXT: s_branch .LBB3_5
+; GFX908-NEXT: .LBB3_11: ; %loop.exit.guard
; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
; GFX908-NEXT: s_xor_b64 s[16:17], s[20:21], -1
-; GFX908-NEXT: .LBB3_10: ; %Flow19
+; GFX908-NEXT: .LBB3_12: ; %Flow19
; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT: s_mov_b64 s[0:1], -1
+; GFX908-NEXT: s_mov_b64 s[14:15], -1
; GFX908-NEXT: s_and_b64 vcc, exec, s[16:17]
; GFX908-NEXT: s_cbranch_vccz .LBB3_1
-; GFX908-NEXT: ; %bb.11: ; %bb12
+; GFX908-NEXT: ; %bb.13: ; %bb12
; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
; GFX908-NEXT: s_add_u32 s4, s4, s6
; GFX908-NEXT: s_addc_u32 s5, s5, 0
; GFX908-NEXT: s_add_u32 s10, s10, s12
; GFX908-NEXT: s_addc_u32 s11, s11, s13
-; GFX908-NEXT: s_mov_b64 s[0:1], 0
+; GFX908-NEXT: s_mov_b64 s[14:15], 0
; GFX908-NEXT: s_branch .LBB3_1
-; GFX908-NEXT: .LBB3_12: ; %DummyReturnBlock
+; GFX908-NEXT: .LBB3_14: ; %DummyReturnBlock
; GFX908-NEXT: s_endpgm
;
; GFX90A-LABEL: introduced_copy_to_sgpr:
@@ -720,28 +732,28 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
; GFX90A-NEXT: s_mul_hi_u32 s9, s0, s7
; GFX90A-NEXT: s_mul_i32 s0, s0, s7
; GFX90A-NEXT: s_add_i32 s1, s9, s1
-; GFX90A-NEXT: s_lshl_b64 s[14:15], s[0:1], 5
+; GFX90A-NEXT: s_lshl_b64 s[0:1], s[0:1], 5
; GFX90A-NEXT: s_branch .LBB3_2
; GFX90A-NEXT: .LBB3_1: ; %Flow20
; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1
-; GFX90A-NEXT: s_andn2_b64 vcc, exec, s[0:1]
-; GFX90A-NEXT: s_cbranch_vccz .LBB3_12
+; GFX90A-NEXT: s_and_b64 s[14:15], s[14:15], exec
+; GFX90A-NEXT: s_cselect_b32 s7, 1, 0
+; GFX90A-NEXT: s_cmp_lg_u32 s7, 1
+; GFX90A-NEXT: s_cbranch_scc0 .LBB3_14
; GFX90A-NEXT: .LBB3_2: ; %bb9
; GFX90A-NEXT: ; =>This Loop Header: Depth=1
-; GFX90A-NEXT: ; Child Loop BB3_5 Depth 2
+; GFX90A-NEXT: ; Child Loop BB3_6 Depth 2
; GFX90A-NEXT: s_mov_b64 s[16:17], -1
-; GFX90A-NEXT: s_cbranch_scc0 .LBB3_10
+; GFX90A-NEXT: s_cbranch_scc0 .LBB3_12
; GFX90A-NEXT: ; %bb.3: ; %bb14
; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1
; GFX90A-NEXT: global_load_dwordx2 v[4:5], v[0:1], off
-; GFX90A-NEXT: v_cmp_gt_i64_e64 s[0:1], s[4:5], -1
; GFX90A-NEXT: s_mov_b32 s9, s8
-; GFX90A-NEXT: v_cndmask_b32_e64 v8, 0, 1, s[0:1]
; GFX90A-NEXT: v_pk_mov_b32 v[6:7], s[8:9], s[8:9] op_sel:[0,1]
-; GFX90A-NEXT: v_cmp_ne_u32_e64 s[0:1], 1, v8
; GFX90A-NEXT: v_pk_mov_b32 v[8:9], s[8:9], s[8:9] op_sel:[0,1]
; GFX90A-NEXT: v_pk_mov_b32 v[10:11], s[8:9], s[8:9] op_sel:[0,1]
-; GFX90A-NEXT: v_cmp_lt_i64_e64 s[16:17], s[4:5], 0
+; GFX90A-NEXT: v_cmp_lt_i64_e64 s[14:15], s[4:5], 0
+; GFX90A-NEXT: v_cmp_gt_i64_e64 s[16:17], s[4:5], -1
; GFX90A-NEXT: s_mov_b64 s[18:19], s[10:11]
; GFX90A-NEXT: v_pk_mov_b32 v[12:13], v[6:7], v[6:7] op_sel:[0,1]
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -755,18 +767,22 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
; GFX90A-NEXT: s_add_i32 s9, s20, s9
; GFX90A-NEXT: s_mul_i32 s7, s2, s7
; GFX90A-NEXT: s_add_i32 s9, s9, s21
-; GFX90A-NEXT: s_branch .LBB3_5
+; GFX90A-NEXT: s_branch .LBB3_6
; GFX90A-NEXT: .LBB3_4: ; %bb58
-; GFX90A-NEXT: ; in Loop: Header=BB3_5 Depth=2
+; GFX90A-NEXT: ; in Loop: Header=BB3_6 Depth=2
; GFX90A-NEXT: v_add_co_u32_sdwa v4, vcc, v4, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
; GFX90A-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v5, vcc
-; GFX90A-NEXT: s_add_u32 s18, s18, s14
-; GFX90A-NEXT: s_addc_u32 s19, s19, s15
+; GFX90A-NEXT: s_add_u32 s18, s18, s0
+; GFX90A-NEXT: s_addc_u32 s19, s19, s1
; GFX90A-NEXT: v_cmp_lt_i64_e64 s[22:23], -1, v[4:5]
; GFX90A-NEXT: s_mov_b64 s[20:21], 0
-; GFX90A-NEXT: s_andn2_b64 vcc, exec, s[22:23]
-; GFX90A-NEXT: s_cbranch_vccz .LBB3_9
-; GFX90A-NEXT: .LBB3_5: ; %bb16
+; GFX90A-NEXT: .LBB3_5: ; %Flow18
+; GFX90A-NEXT: ; in Loop: Header=BB3_6 Depth=2
+; GFX90A-NEXT: s_and_b64 s[22:23], s[22:23], exec
+; GFX90A-NEXT: s_cselect_b32 s22, 1, 0
+; GFX90A-NEXT: s_cmp_lg_u32 s22, 1
+; GFX90A-NEXT: s_cbranch_scc0 .LBB3_11
+; GFX90A-NEXT: .LBB3_6: ; %bb16
; GFX90A-NEXT: ; Parent Loop BB3_2 Depth=1
; GFX90A-NEXT: ; => This Inner Loop Header: Depth=2
; GFX90A-NEXT: s_add_u32 s20, s18, s7
@@ -781,12 +797,14 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i...
[truncated]
|
…ergent cases (llvm#87938), similarly to sext_inreg handling. adapt flat_atomic test case
7f90382 to
1cbb81f
Compare
jayfoad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky area to get right. I don't think your patch should be merged as-is because it seems to cause a lot of code quality regressions. I've called out some of them inline, but there are plenty more if you look through the diffs.
| ; GFX7-NEXT: s_and_b64 s[4:5], s[4:5], exec | ||
| ; GFX7-NEXT: s_cselect_b32 s4, 1, 0 | ||
| ; GFX7-NEXT: s_cmp_lg_u32 s4, 1 | ||
| ; GFX7-NEXT: s_cbranch_scc1 .LBB0_6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regression here.
| ; SI-NEXT: s_and_b64 s[4:5], s[4:5], exec | ||
| ; SI-NEXT: s_cselect_b32 s4, 1, 0 | ||
| ; SI-NEXT: s_mov_b32 s2, -1 | ||
| ; SI-NEXT: v_mov_b32_e32 v0, s4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regression here. This looks like a case where the result (s4) is uniform, but it will be needed in a vgpr (v0) anyway, so we might as well use v_cndmask in the first place. Maybe #113705 would help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make a difference if you do? I would expect the default priority would check the uniform predicate first, as it has the additional predicate.
| ; SI-NEXT: v_cndmask_b32_e64 [[RESULT:v[0-9]+]], 0, 1, [[NEG]] | ||
| ; SI: buffer_store_byte [[RESULT]] | ||
| ; SI-NEXT: s_and_b64 {{s\[[0-9]+:[0-9]+\]}}, [[NEG]], exec | ||
| ; SI: buffer_store_byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is missing something, you can't go directly from this exec mask and to the store
| ; GFX8-NEXT: v_readfirstlane_b32 s34, v0 | ||
| ; GFX8-NEXT: s_setreg_b32 hwreg(HW_REG_MODE, 0, 4), s34 | ||
| ; GFX8-NEXT: s_setpc_b64 s[30:31] | ||
| ; GFX678-LABEL: s_set_rounding_select_0_1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are improvements
to distinguish uniform and divergent cases (#87938), similarly to sext_inreg handling.