-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[AMDGPU] Change default loop alignment #155343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Align small loops aggresively to 32 bytes and larger loops to 16 bytes
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-backend-amdgpu Author: None (hjagasiaAMD) ChangesAlign small loops aggresively to 32 bytes and larger loops to 16 bytes Patch is 3.86 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/155343.diff 153 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 66c1dfc71c2f5..13fc92f64e8b1 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -17501,26 +17501,18 @@ Align SITargetLowering::computeKnownAlignForTargetInstr(
Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
const Align PrefAlign = TargetLowering::getPrefLoopAlignment(ML);
const Align CacheLineAlign = Align(64);
-
- // Pre-GFX10 target did not benefit from loop alignment
- if (!ML || DisableLoopAlignment || !getSubtarget()->hasInstPrefetch() ||
- getSubtarget()->hasInstFwdPrefetchBug())
+ if (!ML || DisableLoopAlignment)
return PrefAlign;
-
- // On GFX10 I$ is 4 x 64 bytes cache lines.
- // By default prefetcher keeps one cache line behind and reads two ahead.
- // We can modify it with S_INST_PREFETCH for larger loops to have two lines
- // behind and one ahead.
- // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
- // If loop fits 64 bytes it always spans no more than two cache lines and
- // does not need an alignment.
- // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
- // Else if loop is less or equal 192 bytes we need two lines behind.
-
const SIInstrInfo *TII = getSubtarget()->getInstrInfo();
const MachineBasicBlock *Header = ML->getHeader();
if (Header->getAlignment() != PrefAlign)
return Header->getAlignment(); // Already processed.
+ const MachineFunction *MF = Header->getParent();
+ const Function &Fn = MF->getFunction();
+ for (auto &BB : Fn)
+ for (auto &I : BB)
+ if (isa<llvm::UnreachableInst>(&I))
+ return PrefAlign;
unsigned LoopSize = 0;
for (const MachineBasicBlock *MBB : ML->blocks()) {
@@ -17531,13 +17523,41 @@ Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
for (const MachineInstr &MI : *MBB) {
LoopSize += TII->getInstSizeInBytes(MI);
- if (LoopSize > 192)
- return PrefAlign;
}
+ if (LoopSize > 192)
+ break;
+ }
+
+ if (!getSubtarget()->hasInstPrefetch() ||
+ getSubtarget()->hasInstFwdPrefetchBug()) {
+ // Align loops < 32 bytes agrressively
+ if (LoopSize <= 32)
+ return Align(32);
+ // Align larger loops less aggressively
+ if (!ML->isInnermost())
+ return PrefAlign;
+ return Align(16);
+ }
+
+ // On GFX10 I$ is 4 x 64 bytes cache lines.
+ // By default prefetcher keeps one cache line behind and reads two ahead.
+ // We can modify it with S_INST_PREFETCH for larger loops to have two lines
+ // behind and one ahead.
+ // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
+ // If loop fits 64 bytes it always spans no more than two cache lines and
+ // does not need an alignment driven by prefetch considerations.
+ // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
+ // Else if loop is less or equal 192 bytes we need two lines behind.
+
+ // Align larger loops less aggressively
+ if (LoopSize > 192) {
+ if (!ML->isInnermost())
+ return PrefAlign;
+ return Align(16);
}
if (LoopSize <= 64)
- return PrefAlign;
+ return Align(32);
if (LoopSize <= 128)
return CacheLineAlign;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index 666523c88860c..969a86c9810bc 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -331,6 +331,7 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
; GFX942-NEXT: global_load_dword v3, v[0:1], off
; GFX942-NEXT: s_mov_b64 s[0:1], 0
; GFX942-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB4_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0)
@@ -376,6 +377,7 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
; GFX90A-NEXT: global_load_dword v3, v[0:1], off
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB4_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -400,6 +402,7 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
; GFX908-NEXT: global_load_dword v3, v[0:1], off
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB4_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -424,6 +427,7 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
; GFX8-NEXT: flat_load_dword v3, v[0:1]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v2, 1.0, v2
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB4_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -478,6 +482,7 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
; GFX942-NEXT: global_load_dword v3, v[0:1], off
; GFX942-NEXT: s_mov_b64 s[0:1], 0
; GFX942-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB5_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0)
@@ -522,6 +527,7 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
; GFX90A-NEXT: global_load_dword v3, v[0:1], off
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB5_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -545,6 +551,7 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
; GFX908-NEXT: global_load_dword v3, v[0:1], off
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB5_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -568,6 +575,7 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
; GFX8-NEXT: flat_load_dword v3, v[0:1]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v4, 1.0, v2
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB5_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -610,6 +618,7 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
; GFX12-NEXT: global_load_b64 v[4:5], v[0:1], off
; GFX12-NEXT: v_max_num_f64_e32 v[2:3], v[2:3], v[2:3]
; GFX12-NEXT: s_mov_b32 s0, 0
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB6_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt 0x0
@@ -648,6 +657,7 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
; GFX11-NEXT: global_load_b64 v[4:5], v[0:1], off
; GFX11-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX11-NEXT: s_mov_b32 s0, 0
+; GFX11-NEXT: .p2align
; GFX11-NEXT: .LBB6_1: ; %atomicrmw.start
; GFX11-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-NEXT: s_waitcnt vmcnt(0)
@@ -694,6 +704,7 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
; GFX908-NEXT: global_load_dwordx2 v[4:5], v[0:1], off
; GFX908-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB6_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -720,6 +731,7 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
; GFX8-NEXT: flat_load_dwordx2 v[4:5], v[0:1]
; GFX8-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB6_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -767,6 +779,7 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
; GFX12-NEXT: global_load_b64 v[4:5], v[0:1], off
; GFX12-NEXT: v_max_num_f64_e32 v[6:7], v[2:3], v[2:3]
; GFX12-NEXT: s_mov_b32 s0, 0
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB7_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt 0x0
@@ -804,6 +817,7 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
; GFX11-NEXT: global_load_b64 v[4:5], v[0:1], off
; GFX11-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX11-NEXT: s_mov_b32 s0, 0
+; GFX11-NEXT: .p2align
; GFX11-NEXT: .LBB7_1: ; %atomicrmw.start
; GFX11-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-NEXT: s_waitcnt vmcnt(0)
@@ -849,6 +863,7 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
; GFX908-NEXT: global_load_dwordx2 v[4:5], v[0:1], off
; GFX908-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB7_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -873,6 +888,7 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
; GFX8-NEXT: flat_load_dwordx2 v[4:5], v[0:1]
; GFX8-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB7_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -926,6 +942,7 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX942-NEXT: flat_load_dword v3, v[0:1]
; GFX942-NEXT: s_mov_b64 s[0:1], 0
; GFX942-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB8_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -971,6 +988,7 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX90A-NEXT: flat_load_dword v3, v[0:1]
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB8_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -995,6 +1013,7 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX908-NEXT: flat_load_dword v3, v[0:1]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB8_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1019,6 +1038,7 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX8-NEXT: flat_load_dword v3, v[0:1]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v2, 1.0, v2
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB8_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1069,6 +1089,7 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX942-NEXT: flat_load_dword v3, v[0:1]
; GFX942-NEXT: s_mov_b64 s[0:1], 0
; GFX942-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB9_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1115,6 +1136,7 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX90A-NEXT: flat_load_dword v3, v[0:1]
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB9_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1138,6 +1160,7 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX908-NEXT: flat_load_dword v3, v[0:1]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB9_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1161,6 +1184,7 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX8-NEXT: flat_load_dword v3, v[0:1]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v4, 1.0, v2
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB9_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1200,6 +1224,7 @@ define double @flat_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX12-NEXT: flat_load_b64 v[4:5], v[0:1]
; GFX12-NEXT: v_max_num_f64_e32 v[2:3], v[2:3], v[2:3]
; GFX12-NEXT: s_mov_b32 s0, 0
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt_dscnt 0x0
@@ -1238,6 +1263,7 @@ define double @flat_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX11-NEXT: flat_load_b64 v[4:5], v[0:1]
; GFX11-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX11-NEXT: s_mov_b32 s0, 0
+; GFX11-NEXT: .p2align
; GFX11-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1284,6 +1310,7 @@ define double @flat_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX908-NEXT: flat_load_dwordx2 v[4:5], v[0:1]
; GFX908-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1313,6 +1340,7 @@ define double @flat_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX8-NEXT: flat_load_dword v5, v[5:6]
; GFX8-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1355,6 +1383,7 @@ define void @flat_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX12-NEXT: flat_load_b64 v[4:5], v[0:1]
; GFX12-NEXT: v_max_num_f64_e32 v[6:7], v[2:3], v[2:3]
; GFX12-NEXT: s_mov_b32 s0, 0
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt_dscnt 0x0
@@ -1392,6 +1421,7 @@ define void @flat_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX11-NEXT: flat_load_b64 v[4:5], v[0:1]
; GFX11-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX11-NEXT: s_mov_b32 s0, 0
+; GFX11-NEXT: .p2align
; GFX11-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX11-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1438,6 +1468,7 @@ define void @flat_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX908-NEXT: flat_load_dwordx2 v[4:5], v[0:1]
; GFX908-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1465,6 +1496,7 @@ define void @flat_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX8-NEXT: flat_load_dword v5, v[5:6]
; GFX8-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1518,6 +1550,7 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX942-NEXT: buffer_load_dword v0, v2, s[0:3], 0 offen
; GFX942-NEXT: s_mov_b64 s[4:5], 0
; GFX942-NEXT: v_max_f32_e32 v3, v1, v1
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB12_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0)
@@ -1567,6 +1600,7 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX90A-NEXT: buffer_load_dword v0, v2, s[16:19], 0 offen
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v3, v1, v1
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB12_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -1593,6 +1627,7 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX908-NEXT: buffer_load_dword v0, v2, s[16:19], 0 offen
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v3, v1, v1
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB12_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -1620,6 +1655,7 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX8-NEXT: buffer_load_dword v0, v2, s[16:19], 0 offen
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v3, 1.0, v1
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB12_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -1674,6 +1710,7 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX942-NEXT: buffer_load_dword v1, v2, s[0:3], 0 offen
; GFX942-NEXT: s_mov_b64 s[4:5], 0
; GFX942-NEXT: v_max_f32_e32 v3, v0, v0
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB13_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0)
@@ -1722,6 +1759,7 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX90A-NEXT: buffer_load_dword v1, v2, s[16:19], 0 offen
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v3, v0, v0
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB13_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -1747,6 +1785,7 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX908-NEXT: buffer_load_dword v1, v2, s[16:19], 0 offen
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v3, v0, v0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB13_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -1773,6 +1812,7 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX8-NEXT: buffer_load_dword v1, v2, s[16:19], 0 offen
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v3, 1.0, v0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB13_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -1817,6 +1857,7 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX12-NEXT: s_mov_b32 s4, 0
; GFX12-NEXT: buffer_load_b64 v[0:1], v6, s[0:3], null offen
; GFX12-NEXT: v_max_num_f64_e32 v[4:5], v[2:3], v[2:3]
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB14_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt 0x0
@@ -1859,6 +1900,7 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX11-NEXT: s_mov_b32 s4, 0
; GFX11-NEXT: buffer_loa...
[truncated]
|
|
@llvm/pr-subscribers-llvm-globalisel Author: None (hjagasiaAMD) ChangesAlign small loops aggresively to 32 bytes and larger loops to 16 bytes Patch is 3.86 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/155343.diff 153 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 66c1dfc71c2f5..13fc92f64e8b1 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -17501,26 +17501,18 @@ Align SITargetLowering::computeKnownAlignForTargetInstr(
Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
const Align PrefAlign = TargetLowering::getPrefLoopAlignment(ML);
const Align CacheLineAlign = Align(64);
-
- // Pre-GFX10 target did not benefit from loop alignment
- if (!ML || DisableLoopAlignment || !getSubtarget()->hasInstPrefetch() ||
- getSubtarget()->hasInstFwdPrefetchBug())
+ if (!ML || DisableLoopAlignment)
return PrefAlign;
-
- // On GFX10 I$ is 4 x 64 bytes cache lines.
- // By default prefetcher keeps one cache line behind and reads two ahead.
- // We can modify it with S_INST_PREFETCH for larger loops to have two lines
- // behind and one ahead.
- // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
- // If loop fits 64 bytes it always spans no more than two cache lines and
- // does not need an alignment.
- // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
- // Else if loop is less or equal 192 bytes we need two lines behind.
-
const SIInstrInfo *TII = getSubtarget()->getInstrInfo();
const MachineBasicBlock *Header = ML->getHeader();
if (Header->getAlignment() != PrefAlign)
return Header->getAlignment(); // Already processed.
+ const MachineFunction *MF = Header->getParent();
+ const Function &Fn = MF->getFunction();
+ for (auto &BB : Fn)
+ for (auto &I : BB)
+ if (isa<llvm::UnreachableInst>(&I))
+ return PrefAlign;
unsigned LoopSize = 0;
for (const MachineBasicBlock *MBB : ML->blocks()) {
@@ -17531,13 +17523,41 @@ Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
for (const MachineInstr &MI : *MBB) {
LoopSize += TII->getInstSizeInBytes(MI);
- if (LoopSize > 192)
- return PrefAlign;
}
+ if (LoopSize > 192)
+ break;
+ }
+
+ if (!getSubtarget()->hasInstPrefetch() ||
+ getSubtarget()->hasInstFwdPrefetchBug()) {
+ // Align loops < 32 bytes agrressively
+ if (LoopSize <= 32)
+ return Align(32);
+ // Align larger loops less aggressively
+ if (!ML->isInnermost())
+ return PrefAlign;
+ return Align(16);
+ }
+
+ // On GFX10 I$ is 4 x 64 bytes cache lines.
+ // By default prefetcher keeps one cache line behind and reads two ahead.
+ // We can modify it with S_INST_PREFETCH for larger loops to have two lines
+ // behind and one ahead.
+ // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
+ // If loop fits 64 bytes it always spans no more than two cache lines and
+ // does not need an alignment driven by prefetch considerations.
+ // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
+ // Else if loop is less or equal 192 bytes we need two lines behind.
+
+ // Align larger loops less aggressively
+ if (LoopSize > 192) {
+ if (!ML->isInnermost())
+ return PrefAlign;
+ return Align(16);
}
if (LoopSize <= 64)
- return PrefAlign;
+ return Align(32);
if (LoopSize <= 128)
return CacheLineAlign;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index 666523c88860c..969a86c9810bc 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -331,6 +331,7 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
; GFX942-NEXT: global_load_dword v3, v[0:1], off
; GFX942-NEXT: s_mov_b64 s[0:1], 0
; GFX942-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB4_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0)
@@ -376,6 +377,7 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
; GFX90A-NEXT: global_load_dword v3, v[0:1], off
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB4_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -400,6 +402,7 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
; GFX908-NEXT: global_load_dword v3, v[0:1], off
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB4_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -424,6 +427,7 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
; GFX8-NEXT: flat_load_dword v3, v[0:1]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v2, 1.0, v2
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB4_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -478,6 +482,7 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
; GFX942-NEXT: global_load_dword v3, v[0:1], off
; GFX942-NEXT: s_mov_b64 s[0:1], 0
; GFX942-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB5_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0)
@@ -522,6 +527,7 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
; GFX90A-NEXT: global_load_dword v3, v[0:1], off
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB5_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -545,6 +551,7 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
; GFX908-NEXT: global_load_dword v3, v[0:1], off
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB5_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -568,6 +575,7 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
; GFX8-NEXT: flat_load_dword v3, v[0:1]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v4, 1.0, v2
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB5_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -610,6 +618,7 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
; GFX12-NEXT: global_load_b64 v[4:5], v[0:1], off
; GFX12-NEXT: v_max_num_f64_e32 v[2:3], v[2:3], v[2:3]
; GFX12-NEXT: s_mov_b32 s0, 0
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB6_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt 0x0
@@ -648,6 +657,7 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
; GFX11-NEXT: global_load_b64 v[4:5], v[0:1], off
; GFX11-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX11-NEXT: s_mov_b32 s0, 0
+; GFX11-NEXT: .p2align
; GFX11-NEXT: .LBB6_1: ; %atomicrmw.start
; GFX11-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-NEXT: s_waitcnt vmcnt(0)
@@ -694,6 +704,7 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
; GFX908-NEXT: global_load_dwordx2 v[4:5], v[0:1], off
; GFX908-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB6_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -720,6 +731,7 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
; GFX8-NEXT: flat_load_dwordx2 v[4:5], v[0:1]
; GFX8-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB6_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -767,6 +779,7 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
; GFX12-NEXT: global_load_b64 v[4:5], v[0:1], off
; GFX12-NEXT: v_max_num_f64_e32 v[6:7], v[2:3], v[2:3]
; GFX12-NEXT: s_mov_b32 s0, 0
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB7_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt 0x0
@@ -804,6 +817,7 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
; GFX11-NEXT: global_load_b64 v[4:5], v[0:1], off
; GFX11-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX11-NEXT: s_mov_b32 s0, 0
+; GFX11-NEXT: .p2align
; GFX11-NEXT: .LBB7_1: ; %atomicrmw.start
; GFX11-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-NEXT: s_waitcnt vmcnt(0)
@@ -849,6 +863,7 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
; GFX908-NEXT: global_load_dwordx2 v[4:5], v[0:1], off
; GFX908-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB7_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -873,6 +888,7 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
; GFX8-NEXT: flat_load_dwordx2 v[4:5], v[0:1]
; GFX8-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB7_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -926,6 +942,7 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX942-NEXT: flat_load_dword v3, v[0:1]
; GFX942-NEXT: s_mov_b64 s[0:1], 0
; GFX942-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB8_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -971,6 +988,7 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX90A-NEXT: flat_load_dword v3, v[0:1]
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB8_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -995,6 +1013,7 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX908-NEXT: flat_load_dword v3, v[0:1]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v2, v2, v2
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB8_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1019,6 +1038,7 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX8-NEXT: flat_load_dword v3, v[0:1]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v2, 1.0, v2
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB8_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1069,6 +1089,7 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX942-NEXT: flat_load_dword v3, v[0:1]
; GFX942-NEXT: s_mov_b64 s[0:1], 0
; GFX942-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB9_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1115,6 +1136,7 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX90A-NEXT: flat_load_dword v3, v[0:1]
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB9_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1138,6 +1160,7 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX908-NEXT: flat_load_dword v3, v[0:1]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v4, v2, v2
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB9_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1161,6 +1184,7 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
; GFX8-NEXT: flat_load_dword v3, v[0:1]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v4, 1.0, v2
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB9_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1200,6 +1224,7 @@ define double @flat_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX12-NEXT: flat_load_b64 v[4:5], v[0:1]
; GFX12-NEXT: v_max_num_f64_e32 v[2:3], v[2:3], v[2:3]
; GFX12-NEXT: s_mov_b32 s0, 0
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt_dscnt 0x0
@@ -1238,6 +1263,7 @@ define double @flat_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX11-NEXT: flat_load_b64 v[4:5], v[0:1]
; GFX11-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX11-NEXT: s_mov_b32 s0, 0
+; GFX11-NEXT: .p2align
; GFX11-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1284,6 +1310,7 @@ define double @flat_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX908-NEXT: flat_load_dwordx2 v[4:5], v[0:1]
; GFX908-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1313,6 +1340,7 @@ define double @flat_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX8-NEXT: flat_load_dword v5, v[5:6]
; GFX8-NEXT: v_max_f64 v[2:3], v[2:3], v[2:3]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1355,6 +1383,7 @@ define void @flat_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX12-NEXT: flat_load_b64 v[4:5], v[0:1]
; GFX12-NEXT: v_max_num_f64_e32 v[6:7], v[2:3], v[2:3]
; GFX12-NEXT: s_mov_b32 s0, 0
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt_dscnt 0x0
@@ -1392,6 +1421,7 @@ define void @flat_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX11-NEXT: flat_load_b64 v[4:5], v[0:1]
; GFX11-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX11-NEXT: s_mov_b32 s0, 0
+; GFX11-NEXT: .p2align
; GFX11-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX11-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1438,6 +1468,7 @@ define void @flat_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX908-NEXT: flat_load_dwordx2 v[4:5], v[0:1]
; GFX908-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX908-NEXT: s_mov_b64 s[4:5], 0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1465,6 +1496,7 @@ define void @flat_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(ptr
; GFX8-NEXT: flat_load_dword v5, v[5:6]
; GFX8-NEXT: v_max_f64 v[6:7], v[2:3], v[2:3]
; GFX8-NEXT: s_mov_b64 s[4:5], 0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -1518,6 +1550,7 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX942-NEXT: buffer_load_dword v0, v2, s[0:3], 0 offen
; GFX942-NEXT: s_mov_b64 s[4:5], 0
; GFX942-NEXT: v_max_f32_e32 v3, v1, v1
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB12_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0)
@@ -1567,6 +1600,7 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX90A-NEXT: buffer_load_dword v0, v2, s[16:19], 0 offen
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v3, v1, v1
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB12_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -1593,6 +1627,7 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX908-NEXT: buffer_load_dword v0, v2, s[16:19], 0 offen
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v3, v1, v1
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB12_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -1620,6 +1655,7 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX8-NEXT: buffer_load_dword v0, v2, s[16:19], 0 offen
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v3, 1.0, v1
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB12_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -1674,6 +1710,7 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX942-NEXT: buffer_load_dword v1, v2, s[0:3], 0 offen
; GFX942-NEXT: s_mov_b64 s[4:5], 0
; GFX942-NEXT: v_max_f32_e32 v3, v0, v0
+; GFX942-NEXT: .p2align
; GFX942-NEXT: .LBB13_1: ; %atomicrmw.start
; GFX942-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX942-NEXT: s_waitcnt vmcnt(0)
@@ -1722,6 +1759,7 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX90A-NEXT: buffer_load_dword v1, v2, s[16:19], 0 offen
; GFX90A-NEXT: s_mov_b64 s[4:5], 0
; GFX90A-NEXT: v_max_f32_e32 v3, v0, v0
+; GFX90A-NEXT: .p2align
; GFX90A-NEXT: .LBB13_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -1747,6 +1785,7 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX908-NEXT: buffer_load_dword v1, v2, s[16:19], 0 offen
; GFX908-NEXT: s_mov_b64 s[4:5], 0
; GFX908-NEXT: v_max_f32_e32 v3, v0, v0
+; GFX908-NEXT: .p2align
; GFX908-NEXT: .LBB13_1: ; %atomicrmw.start
; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX908-NEXT: s_waitcnt vmcnt(0)
@@ -1773,6 +1812,7 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX8-NEXT: buffer_load_dword v1, v2, s[16:19], 0 offen
; GFX8-NEXT: s_mov_b64 s[4:5], 0
; GFX8-NEXT: v_mul_f32_e32 v3, 1.0, v0
+; GFX8-NEXT: .p2align
; GFX8-NEXT: .LBB13_1: ; %atomicrmw.start
; GFX8-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX8-NEXT: s_waitcnt vmcnt(0)
@@ -1817,6 +1857,7 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX12-NEXT: s_mov_b32 s4, 0
; GFX12-NEXT: buffer_load_b64 v[0:1], v6, s[0:3], null offen
; GFX12-NEXT: v_max_num_f64_e32 v[4:5], v[2:3], v[2:3]
+; GFX12-NEXT: .p2align
; GFX12-NEXT: .LBB14_1: ; %atomicrmw.start
; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-NEXT: s_wait_loadcnt 0x0
@@ -1859,6 +1900,7 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX11-NEXT: s_mov_b32 s4, 0
; GFX11-NEXT: buffer_loa...
[truncated]
|
|
Gfx9 CK Test results Overall, there is up to 3% performance gain from alignment on average. The flag is also now being used for the branch-relaxation-gfx10-branch-offset-bug.ll lit test. Otherwise p2align changes the conditions for the test for targets with the branch_3f_offset_bug. Without patch for gfx1030 With patch |
|
I cannot reason 3% diff with a very small set of tests. The test coverage must be increased significantly. You also need to factor the idea that an inner loop may be nested 10 levels deep, which is not unusual for MIOpen. Then the inner one tends to be small. |
Can you pls elaborate on this. All small loops up to 3 cache lines (192 bytes) are aligned, nested or not. Innermost Loops larger than 3 cache lines are also aligned. |
When you have a deep loop nest and align the inner loop it automatically grows the size of all outer loops. |
Yes true, we are already doing that though for loops between 1-3 cache lines on gfx10/gfx11 |
|
triton-aiter.xlsx I see two options
Would like reviewer suggestions |
The hit is really big, I do not think we can afford it. Also the current code mentions that pre-gfx10 targets do not benefit from the alignment, but affected. |
|
Abandoning the PR |
Align small loops aggresively to 32 bytes and larger loops to 16 bytes