-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU][SIInsertWaitCnts] Gfx12.5 - Refactor xcnt optimization #164357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU][SIInsertWaitCnts] Gfx12.5 - Refactor xcnt optimization #164357
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-backend-amdgpu Author: Ryan Mitchell (RyanRio) ChangesRefactor the XCnt optimization checks so that they can be checked when applying a pre-existing waitcnt. This removes unnecessary xcnt waits when taking a loop backedge. Patch is 26.44 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/164357.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 6dcbced010a5a..1aeb25248436c 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -646,6 +646,8 @@ class WaitcntBrackets {
void applyWaitcnt(const AMDGPU::Waitcnt &Wait);
void applyWaitcnt(InstCounterType T, unsigned Count);
+ bool hasRedundantXCntWithKmCnt(const AMDGPU::Waitcnt &Wait);
+ bool canOptimizeXCntWithLoadCnt(const AMDGPU::Waitcnt &Wait);
void applyXcnt(const AMDGPU::Waitcnt &Wait);
void updateByEvent(WaitEventType E, MachineInstr &MI);
@@ -1287,20 +1289,25 @@ void WaitcntBrackets::applyWaitcnt(InstCounterType T, unsigned Count) {
}
}
-void WaitcntBrackets::applyXcnt(const AMDGPU::Waitcnt &Wait) {
+bool WaitcntBrackets::hasRedundantXCntWithKmCnt(const AMDGPU::Waitcnt &Wait) {
// Wait on XCNT is redundant if we are already waiting for a load to complete.
// SMEM can return out of order, so only omit XCNT wait if we are waiting till
// zero.
- if (Wait.KmCnt == 0 && hasPendingEvent(SMEM_GROUP))
- return applyWaitcnt(X_CNT, 0);
+ return Wait.KmCnt == 0 && hasPendingEvent(SMEM_GROUP);
+}
+bool WaitcntBrackets::canOptimizeXCntWithLoadCnt(const AMDGPU::Waitcnt &Wait) {
// If we have pending store we cannot optimize XCnt because we do not wait for
// stores. VMEM loads retun in order, so if we only have loads XCnt is
// decremented to the same number as LOADCnt.
- if (Wait.LoadCnt != ~0u && hasPendingEvent(VMEM_GROUP) &&
- !hasPendingEvent(STORE_CNT))
- return applyWaitcnt(X_CNT, std::min(Wait.XCnt, Wait.LoadCnt));
+ return Wait.LoadCnt != ~0u && hasPendingEvent(VMEM_GROUP) && !hasPendingEvent(STORE_CNT);
+}
+void WaitcntBrackets::applyXcnt(const AMDGPU::Waitcnt &Wait) {
+ if (hasRedundantXCntWithKmCnt(Wait))
+ return applyWaitcnt(X_CNT, 0);
+ if (canOptimizeXCntWithLoadCnt(Wait))
+ return applyWaitcnt(X_CNT, std::min(Wait.XCnt, Wait.LoadCnt));
applyWaitcnt(X_CNT, Wait.XCnt);
}
@@ -1729,6 +1736,12 @@ bool WaitcntGeneratorGFX12Plus::applyPreexistingWaitcnt(
if (!WaitInstrs[CT])
continue;
+ if ((CT == KM_CNT && ScoreBrackets.hasRedundantXCntWithKmCnt(Wait)) ||
+ (CT == LOAD_CNT && ScoreBrackets.canOptimizeXCntWithLoadCnt(Wait)))
+ // Xcnt may need to be updated depending on a pre-existing KM/LOAD_CNT
+ // due to taking the backedge of a block.
+ ScoreBrackets.applyXcnt(Wait);
+
unsigned NewCnt = getWait(Wait, CT);
if (NewCnt != ~0u) {
Modified |= updateOperandIfDifferent(*WaitInstrs[CT],
diff --git a/llvm/test/CodeGen/AMDGPU/flat-saddr-load.ll b/llvm/test/CodeGen/AMDGPU/flat-saddr-load.ll
index b5b2655246c3f..fd1e8807885a5 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-saddr-load.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-saddr-load.ll
@@ -2115,7 +2115,7 @@ define amdgpu_ps void @flat_addr_64bit_lsr_iv(ptr inreg %arg) {
; GFX1250-SDAG-NEXT: s_mov_b64 s[0:1], 0
; GFX1250-SDAG-NEXT: .LBB116_1: ; %bb3
; GFX1250-SDAG-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1250-SDAG-NEXT: s_wait_xcnt 0x0
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_add_nc_u64 s[4:5], s[2:3], s[0:1]
; GFX1250-SDAG-NEXT: s_add_nc_u64 s[0:1], s[0:1], 4
; GFX1250-SDAG-NEXT: s_wait_dscnt 0x0
@@ -2134,7 +2134,6 @@ define amdgpu_ps void @flat_addr_64bit_lsr_iv(ptr inreg %arg) {
; GFX1250-GISEL-NEXT: .LBB116_1: ; %bb3
; GFX1250-GISEL-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX1250-GISEL-NEXT: s_wait_dscnt 0x0
-; GFX1250-GISEL-NEXT: s_wait_xcnt 0x0
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v4, vcc_lo, v0, v2
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v5, null, v1, v3, vcc_lo
@@ -2170,7 +2169,7 @@ define amdgpu_ps void @flat_addr_64bit_lsr_iv_multiload(ptr inreg %arg, ptr inre
; GFX1250-SDAG-NEXT: s_mov_b64 s[0:1], 0
; GFX1250-SDAG-NEXT: .LBB117_1: ; %bb3
; GFX1250-SDAG-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1250-SDAG-NEXT: s_wait_xcnt 0x0
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_add_nc_u64 s[4:5], s[2:3], s[0:1]
; GFX1250-SDAG-NEXT: s_add_nc_u64 s[0:1], s[0:1], 4
; GFX1250-SDAG-NEXT: s_wait_dscnt 0x0
@@ -2191,7 +2190,7 @@ define amdgpu_ps void @flat_addr_64bit_lsr_iv_multiload(ptr inreg %arg, ptr inre
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[0:1]
; GFX1250-GISEL-NEXT: .LBB117_1: ; %bb3
; GFX1250-GISEL-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1250-GISEL-NEXT: s_wait_xcnt 0x0
+; GFX1250-GISEL-NEXT: s_wait_dscnt 0x0
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v4, vcc_lo, v0, v2
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v5, null, v1, v3, vcc_lo
diff --git a/llvm/test/CodeGen/AMDGPU/fmin3.ll b/llvm/test/CodeGen/AMDGPU/fmin3.ll
index 6a6f232c55e24..2756472652bc9 100644
--- a/llvm/test/CodeGen/AMDGPU/fmin3.ll
+++ b/llvm/test/CodeGen/AMDGPU/fmin3.ll
@@ -1233,7 +1233,6 @@ define amdgpu_kernel void @test_fmin3_olt_0_f64(ptr addrspace(1) %out, ptr addrs
; GFX1250-NEXT: s_wait_loadcnt 0x0
; GFX1250-NEXT: buffer_load_b64 v[2:3], off, s[16:19], null scope:SCOPE_SYS
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: s_wait_xcnt 0x1
; GFX1250-NEXT: s_mov_b32 s4, s14
; GFX1250-NEXT: s_mov_b32 s5, s15
; GFX1250-NEXT: s_mov_b32 s0, s8
@@ -1443,7 +1442,6 @@ define amdgpu_kernel void @test_fmin3_olt_1_f64(ptr addrspace(1) %out, ptr addrs
; GFX1250-NEXT: s_wait_loadcnt 0x0
; GFX1250-NEXT: buffer_load_b64 v[2:3], off, s[16:19], null scope:SCOPE_SYS
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: s_wait_xcnt 0x1
; GFX1250-NEXT: s_mov_b32 s4, s14
; GFX1250-NEXT: s_mov_b32 s5, s15
; GFX1250-NEXT: s_mov_b32 s0, s8
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll
index a50791e10f5a2..ed565ca43f9a3 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll
@@ -8814,7 +8814,7 @@ define half @global_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-TRUE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-TRUE16-NEXT: s_cbranch_execnz .LBB44_1
; GFX1250-TRUE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -8857,7 +8857,7 @@ define half @global_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-FAKE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-FAKE16-NEXT: s_cbranch_execnz .LBB44_1
; GFX1250-FAKE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -9322,7 +9322,7 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-TRUE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-TRUE16-NEXT: s_cbranch_execnz .LBB45_1
; GFX1250-TRUE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -9365,7 +9365,7 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-FAKE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-FAKE16-NEXT: s_cbranch_execnz .LBB45_1
; GFX1250-FAKE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -9844,7 +9844,7 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-TRUE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-TRUE16-NEXT: s_cbranch_execnz .LBB46_1
; GFX1250-TRUE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -9888,7 +9888,7 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-FAKE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-FAKE16-NEXT: s_cbranch_execnz .LBB46_1
; GFX1250-FAKE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -10365,7 +10365,6 @@ define void @global_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX1250-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
; GFX1250-TRUE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -10407,7 +10406,6 @@ define void @global_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX1250-FAKE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
; GFX1250-FAKE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -10857,7 +10855,6 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX1250-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
; GFX1250-TRUE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -10899,7 +10896,6 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX1250-FAKE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
; GFX1250-FAKE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -11363,7 +11359,6 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX1250-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
; GFX1250-TRUE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -11406,7 +11401,6 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX1250-FAKE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
; GFX1250-FAKE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -11861,7 +11855,7 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v3, v5
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-TRUE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-TRUE16-NEXT: s_cbranch_execnz .LBB50_1
; GFX1250-TRUE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -11893,7 +11887,7 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v3, v5
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-FAKE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-FAKE16-NEXT: s_cbranch_execnz .LBB50_1
; GFX1250-FAKE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -12245,7 +12239,6 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX1250-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v3, v5
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
; GFX1250-TRUE16-NEXT: v_mov_b32_e32 v5, v3
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -12276,7 +12269,6 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX1250-FAKE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v3, v5
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
; GFX1250-FAKE16-NEXT: v_mov_b32_e32 v5, v3
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -12631,7 +12623,7 @@ define half @global_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_SYS
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-TRUE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-TRUE16-NEXT: s_cbranch_execnz .LBB52_1
; GFX1250-TRUE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -12674,7 +12666,7 @@ define half @global_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_SYS
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-FAKE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-FAKE16-NEXT: s_cbranch_execnz .LBB52_1
; GFX1250-FAKE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -13154,7 +13146,6 @@ define void @global_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX1250-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_SYS
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
; GFX1250-TRUE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -13196,7 +13187,6 @@ define void @global_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX1250-FAKE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_SYS
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
; GFX1250-FAKE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -13676,7 +13666,7 @@ define bfloat @global_agent_atomic_fadd_ret_bf16__amdgpu_no_fine_grained_memory(
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-TRUE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-TRUE16-NEXT: s_cbranch_execnz .LBB54_1
; GFX1250-TRUE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -13722,7 +13712,7 @@ define bfloat @global_agent_atomic_fadd_ret_bf16__amdgpu_no_fine_grained_memory(
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-FAKE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-FAKE16-NEXT: s_cbranch_execnz .LBB54_1
; GFX1250-FAKE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -14273,7 +14263,7 @@ define bfloat @global_agent_atomic_fadd_ret_bf16__offset12b_pos__amdgpu_no_fine_
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-TRUE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-TRUE16-NEXT: s_cbranch_execnz .LBB55_1
; GFX1250-TRUE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -14319,7 +14309,7 @@ define bfloat @global_agent_atomic_fadd_ret_bf16__offset12b_pos__amdgpu_no_fine_
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-FAKE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-FAKE16-NEXT: s_cbranch_execnz .LBB55_1
; GFX1250-FAKE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -14888,7 +14878,7 @@ define bfloat @global_agent_atomic_fadd_ret_bf16__offset12b_neg__amdgpu_no_fine_
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-TRUE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-TRUE16-NEXT: s_cbranch_execnz .LBB56_1
; GFX1250-TRUE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -14936,7 +14926,7 @@ define bfloat @global_agent_atomic_fadd_ret_bf16__offset12b_neg__amdgpu_no_fine_
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v5, v7
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-FAKE16-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
; GFX1250-FAKE16-NEXT: s_cbranch_execnz .LBB56_1
; GFX1250-FAKE16-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -15502,7 +15492,6 @@ define void @global_agent_atomic_fadd_noret_bf16__amdgpu_no_fine_grained_memory(
; GFX1250-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-TRUE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-TRUE16-NEXT: s_wait_xcnt 0x0
; GFX1250-TRUE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-TRUE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -15547,7 +15536,6 @@ define void @global_agent_atomic_fadd_noret_bf16__amdgpu_no_fine_grained_memory(
; GFX1250-FAKE16-NEXT: s_wait_loadcnt 0x0
; GFX1250-FAKE16-NEXT: global_inv scope:SCOPE_DEV
; GFX1250-FAKE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, v4, v5
-; GFX1250-FAKE16-NEXT: s_wait_xcnt 0x0
; GFX1250-FAKE16-NEXT: v_mov_b32_e32 v5, v4
; GFX1250-FAKE16-NEXT: s_or_b32 s0, vcc_lo, s0
; GFX1250-FAKE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -16...
[truncated]
|
8ad5479 to
921c950
Compare
nhaehnle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me.
921c950 to
0f781b6
Compare
|
@jayfoad @easyonaadit Since my PR also makes xcnt changes, it would be great if we could collaborate on the required changes from #166154 #165201 here. I've pushed what I think is a fix, although more conservative in the LoadCnt case. The issue your test points out, Jay, is that xcnt can only be optimized to the min of loadcnt/xcnt, and so the event can't just be fully cleared unless it's 0, correct? Edit: I've added the test, all the prior tests remain the same and Jay's is now fixed. So I think mine is correct. |
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
shiltian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some format nits for code style consistency
|
@shiltian Addressed your nits, feel free to re-trigger the workflows, thanks. |
|
@jayfoad Feel free to merge if you agree with my changes. |
Hi @RyanRio, it looks to me like all the optimizations done in applyXcnt (called from applyWaitcnt) should instead be done in simplifyWaitcnt, which is already called on both paths (generating new waitcnts and applying pre-existing waitcnts). Can you try that please? (There are also some Xcnt optimizations in generateWaitcnt that I don't fully understand, but perhaps they could also be subsumed by simplifyWaitcnt?) |
|
@jayfoad retrigger and merge if acceptable, please. |
|
Is an xcnt necessary here? |
Dunno! :) I assume that global_wb does not increment xcnt since it has no address to translate. So the only pending translation would be for the s_load_b64, so I think you would only need an s_wait_xcnt if you are about to clobber s[4:5], which we're not. And also the s_wait_kmcnt would make it redundant. But maybe I'm missing something. |
Refactor the XCnt optimization checks so that they can be checked when applying a pre-existing waitcnt. This has the effect of removing unnecessary xcnt waits when taking a loop backedge.
This reverts commit d3ed9c9.
f4a5085 to
a1ee5b8
Compare
| // XCnt may be already consumed by a load wait. | ||
| if (Wait.XCnt != ~0u) { | ||
| if (Wait.KmCnt == 0 && !ScoreBrackets.hasPendingEvent(SMEM_GROUP)) | ||
| Wait.XCnt = ~0u; | ||
|
|
||
| if (Wait.LoadCnt == 0 && !ScoreBrackets.hasPendingEvent(VMEM_GROUP)) | ||
| Wait.XCnt = ~0u; | ||
|
|
||
| // Since the translation for VMEM addresses occur in-order, we can skip the | ||
| // XCnt if the current instruction is of VMEM type and has a memory | ||
| // dependency with another VMEM instruction in flight. | ||
| if (isVmemAccess(*It)) | ||
| Wait.XCnt = ~0u; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nhaehnle @jayfoad Thoughts on this case (see changes in bf16.ll)? -
; GFX1250-LABEL: test_load_store_f32_to_bf16:
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
; GFX1250-NEXT: global_load_b32 v0, v[0:1], off
Debug output before the next instruction:
LOAD_CNT(1): 0:v0 0:v1
DS_CNT(0):
EXP_CNT(0):
STORE_CNT(63):
SAMPLE_CNT(0):
BVH_CNT(0):
KM_CNT(0):
X_CNT(1): 0:v0 0:v1 0:v2 0:v3 0:s126 0:s127
Pending Events: VMEM_READ_ACCESS, VMEM_WRITE_ACCESS, SCRATCH_WRITE_ACCESS, VMEM_GROUP
; GFX1250-NEXT: s_wait_loadcnt 0x0
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: v_cvt_pk_bf16_f32 v0, v0, s0
; GFX1250-NEXT: global_store_b16 v[2:3], v0, off
; GFX1250-NEXT: s_set_pc_i64 s[30:31]
%val = load float, ptr addrspace(1) %in
%val.bf16 = fptrunc float %val to bfloat
store bfloat %val.bf16, ptr addrspace(1) %out
ret void
}
Previously the xcnt was not waited on because of the check on L2168, but now it also requires !hasPendingEvent(STORE_CNT). I think my change is correct because only VMEM loads return in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is waiting for #166779 though (currently just includes Jay's fix in this PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just have this one supersede it, up to you Jay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently there was a late change to gfx1250 that made it so every s_wait_loadcnt (and storecnt) also waits for the equivalent xcnt value. We need to get some clarity on that internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I also see that hardware waits for xcnt==0 at every branch/call, which probably means that SIInsertWaitcnts does not after all have to handle a mixture of pending SMEM Xcnts and VMEM Xcnts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a todo to revisit.
|
@RyanRio Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR. Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues. How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
Refactor the XCnt optimization checks so that they can be checked when applying a pre-existing waitcnt. This removes unnecessary xcnt waits when taking a loop backedge.