-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[AMDGPU] Implement Waitcnt Expansion for Profiling #169345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[AMDGPU] Implement Waitcnt Expansion for Profiling #169345
Conversation
|
@llvm/pr-subscribers-clang-driver @llvm/pr-subscribers-backend-amdgpu Author: Pankaj Dwivedi (PankajDwivedi-25) ChangesThis is the initial attempt to understand and get the requirement in the issue ROCm#67 Implemented compiler option Releated previous work for Reference
Patch is 21.02 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169345.diff 7 Files Affected:
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 11e81e032d5fc..c0ba716484b6a 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -5497,7 +5497,10 @@ defm wavefrontsize64 : SimpleMFlag<"wavefrontsize64",
" mode (AMDGPU only)">;
defm amdgpu_precise_memory_op
: SimpleMFlag<"amdgpu-precise-memory-op", "Enable", "Disable",
- " precise memory mode (AMDGPU only)">;
+ " precise memory mode (AMDGPU only)", m_amdgpu_Features_Group>;
+defm amdgpu_expand_waitcnt_profiling
+ : SimpleMFlag<"amdgpu-expand-waitcnt-profiling", "Enable", "Disable",
+ " waitcnt expansion for profiling (AMDGPU only)", m_amdgpu_Features_Group>;
def munsafe_fp_atomics : Flag<["-"], "munsafe-fp-atomics">,
Visibility<[ClangOption, CC1Option, FlangOption, FC1Option]>, Alias<fatomic_ignore_denormal_mode>;
diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp b/clang/lib/Driver/ToolChains/AMDGPU.cpp
index 1a243fef9532d..f4ddb48c9abc6 100644
--- a/clang/lib/Driver/ToolChains/AMDGPU.cpp
+++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp
@@ -700,6 +700,10 @@ void amdgpu::getAMDGPUTargetFeatures(const Driver &D,
options::OPT_mno_amdgpu_precise_memory_op, false))
Features.push_back("+precise-memory");
+ if (Args.hasFlag(options::OPT_mamdgpu_expand_waitcnt_profiling,
+ options::OPT_mno_amdgpu_expand_waitcnt_profiling, false))
+ Features.push_back("+expand-waitcnt-profiling");
+
handleTargetFeaturesGroup(D, Triple, Args, Features,
options::OPT_m_amdgpu_Features_Group);
}
diff --git a/clang/test/Driver/amdgpu-features.c b/clang/test/Driver/amdgpu-features.c
index 864744db203e9..16b3f4121ab7a 100644
--- a/clang/test/Driver/amdgpu-features.c
+++ b/clang/test/Driver/amdgpu-features.c
@@ -38,3 +38,9 @@
// RUN: %clang -### -target amdgcn -mcpu=gfx1010 -mno-amdgpu-precise-memory-op %s 2>&1 | FileCheck --check-prefix=NO-PREC-MEM %s
// NO-PREC-MEM-NOT: {{".*precise-memory"}}
+
+// RUN: %clang -### -target amdgcn -mcpu=gfx900 -mamdgpu-expand-waitcnt-profiling %s 2>&1 | FileCheck --check-prefix=EXPAND-WAITCNT %s
+// EXPAND-WAITCNT: "-target-feature" "+expand-waitcnt-profiling"
+
+// RUN: %clang -### -target amdgcn -mcpu=gfx900 -mno-amdgpu-expand-waitcnt-profiling %s 2>&1 | FileCheck --check-prefix=NO-EXPAND-WAITCNT %s
+// NO-EXPAND-WAITCNT-NOT: "{{[+]}}expand-waitcnt-profiling"
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 54d94b1f8682e..3f9166f48ea22 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -223,6 +223,10 @@ def FeaturePreciseMemory
: SubtargetFeature<"precise-memory", "EnablePreciseMemory",
"true", "Enable precise memory mode">;
+def FeatureExpandWaitcntProfiling
+ : SubtargetFeature<"expand-waitcnt-profiling", "EnableExpandWaitcntProfiling",
+ "true", "Expand waitcnt instructions for profiling">;
+
def FeatureSGPRInitBug : SubtargetFeature<"sgpr-init-bug",
"SGPRInitBug",
"true",
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index f377b8aaf1333..f2b885a790f41 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -90,6 +90,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
bool EnableCuMode = false;
bool TrapHandler = false;
bool EnablePreciseMemory = false;
+ bool EnableExpandWaitcntProfiling = false;
// Used as options.
bool EnableLoadStoreOpt = false;
@@ -674,6 +675,10 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
bool isPreciseMemoryEnabled() const { return EnablePreciseMemory; }
+ bool isExpandWaitcntProfilingEnabled() const {
+ return EnableExpandWaitcntProfiling;
+ }
+
bool hasFlatAddressSpace() const {
return FlatAddressSpace;
}
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index b7fa899678ec7..4a70479358bad 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -494,6 +494,16 @@ class SIInsertWaitcnts {
bool isVMEMOrFlatVMEM(const MachineInstr &MI) const;
bool run(MachineFunction &MF);
+ // Methods for expanding waitcnt instructions for profiling
+ bool expandWaitcntsForProfiling(MachineFunction &MF);
+ bool expandSingleWaitcnt(MachineInstr &MI, MachineBasicBlock &MBB);
+ bool expandSingleCounterWait(MachineInstr &MI, MachineBasicBlock &MBB,
+ InstCounterType CT);
+ bool expandCounterSequence(MachineBasicBlock &MBB,
+ MachineBasicBlock::iterator InsertPos,
+ InstCounterType CT, unsigned CountValue,
+ DebugLoc DL);
+
void setForceEmitWaitcnt() {
// For non-debug builds, ForceEmitWaitcnt has been initialized to false;
// For debug builds, get the debug counter info and adjust if need be
@@ -2725,6 +2735,156 @@ SIInsertWaitcntsPass::run(MachineFunction &MF,
.preserve<AAManager>();
}
+/// Expand waitcnt instructions for profiling by inserting a sequence of
+/// decreasing counter values. This helps identify which specific memory
+/// operation is a bottleneck during PC sampling.
+bool SIInsertWaitcnts::expandWaitcntsForProfiling(MachineFunction &MF) {
+ if (!ST->isExpandWaitcntProfilingEnabled())
+ return false;
+
+ bool Modified = false;
+
+ // Iterate through all basic blocks
+ for (MachineBasicBlock &MBB : MF) {
+ for (auto I = MBB.begin(), E = MBB.end(); I != E;) {
+ MachineInstr &MI = *I;
+ ++I; // Advance iterator before potential expansion
+
+ if (ST->hasExtendedWaitCounts()) {
+ // GFX12+: Handle separate wait instructions
+ if (auto CT = counterTypeForInstr(MI.getOpcode())) {
+ Modified |= expandSingleCounterWait(MI, MBB, *CT);
+ }
+ } else {
+ // Pre-GFX12: Handle combined S_WAITCNT
+ if (MI.getOpcode() == AMDGPU::S_WAITCNT) {
+ Modified |= expandSingleWaitcnt(MI, MBB);
+ }
+ }
+ }
+ }
+
+ return Modified;
+}
+
+/// Expand a single S_WAITCNT instruction (pre-GFX12)
+bool SIInsertWaitcnts::expandSingleWaitcnt(MachineInstr &MI,
+ MachineBasicBlock &MBB) {
+ assert(MI.getOpcode() == AMDGPU::S_WAITCNT);
+
+ // Decode the waitcnt immediate
+ unsigned Imm = MI.getOperand(0).getImm();
+ AMDGPU::IsaVersion IV = AMDGPU::getIsaVersion(ST->getCPU());
+ AMDGPU::Waitcnt Wait = AMDGPU::decodeWaitcnt(IV, Imm);
+
+ // Insert expanded waitcnts BEFORE the original instruction
+ auto InsertPos = MI.getIterator();
+ DebugLoc DL = MI.getDebugLoc();
+
+ bool Modified = false;
+
+ // Expand each counter independently
+ // For independent counters (Case 2 from requirements):
+ // vmcnt and lgkmcnt can be separated
+ Modified |= expandCounterSequence(MBB, InsertPos, LOAD_CNT, Wait.LoadCnt, DL);
+ Modified |= expandCounterSequence(MBB, InsertPos, DS_CNT, Wait.DsCnt, DL);
+ Modified |= expandCounterSequence(MBB, InsertPos, EXP_CNT, Wait.ExpCnt, DL);
+ Modified |=
+ expandCounterSequence(MBB, InsertPos, STORE_CNT, Wait.StoreCnt, DL);
+
+ // If we expanded anything, remove the original waitcnt
+ if (Modified) {
+ MI.eraseFromParent();
+ }
+
+ return Modified;
+}
+
+/// Expand a single counter wait instruction (GFX12+)
+bool SIInsertWaitcnts::expandSingleCounterWait(MachineInstr &MI,
+ MachineBasicBlock &MBB,
+ InstCounterType CT) {
+ // Get the counter value from the instruction
+ unsigned CountValue = MI.getOperand(0).getImm();
+
+ // Insert expanded waitcnts BEFORE the original instruction
+ auto InsertPos = MI.getIterator();
+ DebugLoc DL = MI.getDebugLoc();
+
+ bool Modified = expandCounterSequence(MBB, InsertPos, CT, CountValue, DL);
+
+ // If we expanded, remove the original instruction
+ if (Modified) {
+ MI.eraseFromParent();
+ }
+
+ return Modified;
+}
+
+/// Insert a sequence of wait instructions with decreasing counter values
+bool SIInsertWaitcnts::expandCounterSequence(
+ MachineBasicBlock &MBB, MachineBasicBlock::iterator InsertPos,
+ InstCounterType CT, unsigned CountValue, DebugLoc DL) {
+ // Skip if counter is already at zero, not active, or at max (wait not needed)
+ if (CountValue == 0 || CountValue == ~0u)
+ return false;
+
+ unsigned MaxCount = getWaitCountMax(CT);
+ if (CountValue >= MaxCount)
+ return false;
+
+ bool Modified = false;
+
+ // Generate decreasing sequence: CountValue-1, CountValue-2, ..., 1, 0
+ // We start from CountValue-1 because the original waitcnt already handles
+ // CountValue
+ for (int i = CountValue - 1; i >= 0; --i) {
+ if (ST->hasExtendedWaitCounts()) {
+ // GFX12+: Use separate wait instructions
+ unsigned Opcode = instrsForExtendedCounterTypes[CT];
+ BuildMI(MBB, InsertPos, DL, TII->get(Opcode)).addImm(i);
+ } else {
+ // Pre-GFX12: Use combined S_WAITCNT with only this counter set
+ AMDGPU::Waitcnt Wait;
+ switch (CT) {
+ case LOAD_CNT:
+ Wait.LoadCnt = i;
+ break;
+ case DS_CNT:
+ Wait.DsCnt = i;
+ break;
+ case EXP_CNT:
+ Wait.ExpCnt = i;
+ break;
+ case STORE_CNT:
+ Wait.StoreCnt = i;
+ break;
+ case SAMPLE_CNT:
+ Wait.SampleCnt = i;
+ break;
+ case BVH_CNT:
+ Wait.BvhCnt = i;
+ break;
+ case KM_CNT:
+ Wait.KmCnt = i;
+ break;
+ case X_CNT:
+ Wait.XCnt = i;
+ break;
+ default:
+ break;
+ }
+
+ AMDGPU::IsaVersion IV = AMDGPU::getIsaVersion(ST->getCPU());
+ unsigned Enc = AMDGPU::encodeWaitcnt(IV, Wait);
+ BuildMI(MBB, InsertPos, DL, TII->get(AMDGPU::S_WAITCNT)).addImm(Enc);
+ }
+ Modified = true;
+ }
+
+ return Modified;
+}
+
bool SIInsertWaitcnts::run(MachineFunction &MF) {
ST = &MF.getSubtarget<GCNSubtarget>();
TII = ST->getInstrInfo();
@@ -2963,5 +3123,10 @@ bool SIInsertWaitcnts::run(MachineFunction &MF) {
PreheadersToFlush.clear();
SLoadAddresses.clear();
+ // Expand waitcnts for profiling if requested
+ if (ST->isExpandWaitcntProfilingEnabled()) {
+ Modified |= expandWaitcntsForProfiling(MF);
+ }
+
return Modified;
}
diff --git a/llvm/test/CodeGen/AMDGPU/expand-waitcnt-profiling.ll b/llvm/test/CodeGen/AMDGPU/expand-waitcnt-profiling.ll
new file mode 100644
index 0000000000000..cc99c457677ad
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/expand-waitcnt-profiling.ll
@@ -0,0 +1,239 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=+expand-waitcnt-profiling -verify-machineinstrs < %s | FileCheck --check-prefix=EXPAND %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=-expand-waitcnt-profiling -verify-machineinstrs < %s | FileCheck --check-prefix=NOEXPAND %s
+
+; NOTE: These simple test cases are optimized to generate waitcnt(0) by the
+; time values are needed. The expansion feature correctly does NOT expand waitcnt(0).
+
+; Pattern: Multiple scalar loads that increment lgkmcnt, followed by use
+; Expected on real kernels with non-zero lgkmcnt:
+; WITHOUT expansion: s_waitcnt lgkmcnt(0)
+; WITH expansion: s_waitcnt lgkmcnt(2)
+; s_waitcnt lgkmcnt(1)
+; s_waitcnt lgkmcnt(0)
+
+define amdgpu_kernel void @case1_single_counter_lgkmcnt(
+; EXPAND-LABEL: case1_single_counter_lgkmcnt:
+; EXPAND: ; %bb.0:
+; EXPAND-NEXT: s_load_dwordx8 s[8:15], s[4:5], 0x24
+; EXPAND-NEXT: v_mov_b32_e32 v0, 0
+; EXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT: s_load_dword s0, s[8:9], 0x0
+; EXPAND-NEXT: s_load_dword s1, s[10:11], 0x0
+; EXPAND-NEXT: s_load_dword s2, s[12:13], 0x0
+; EXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT: s_add_i32 s0, s0, s1
+; EXPAND-NEXT: s_add_i32 s0, s0, s2
+; EXPAND-NEXT: v_mov_b32_e32 v1, s0
+; EXPAND-NEXT: global_store_dword v0, v1, s[14:15]
+; EXPAND-NEXT: s_endpgm
+;
+; NOEXPAND-LABEL: case1_single_counter_lgkmcnt:
+; NOEXPAND: ; %bb.0:
+; NOEXPAND-NEXT: s_load_dwordx8 s[8:15], s[4:5], 0x24
+; NOEXPAND-NEXT: v_mov_b32_e32 v0, 0
+; NOEXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT: s_load_dword s0, s[8:9], 0x0
+; NOEXPAND-NEXT: s_load_dword s1, s[10:11], 0x0
+; NOEXPAND-NEXT: s_load_dword s2, s[12:13], 0x0
+; NOEXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT: s_add_i32 s0, s0, s1
+; NOEXPAND-NEXT: s_add_i32 s0, s0, s2
+; NOEXPAND-NEXT: v_mov_b32_e32 v1, s0
+; NOEXPAND-NEXT: global_store_dword v0, v1, s[14:15]
+; NOEXPAND-NEXT: s_endpgm
+ ptr addrspace(4) %ptr_a,
+ ptr addrspace(4) %ptr_b,
+ ptr addrspace(4) %ptr_c,
+ ptr addrspace(1) %out) {
+ ; Three scalar loads - increment lgkmcnt
+ %val_a = load i32, ptr addrspace(4) %ptr_a, align 4
+ %val_b = load i32, ptr addrspace(4) %ptr_b, align 4
+ %val_c = load i32, ptr addrspace(4) %ptr_c, align 4
+
+ ; Use all three values
+ %sum1 = add i32 %val_a, %val_b
+ %sum2 = add i32 %sum1, %val_c
+
+ store i32 %sum2, ptr addrspace(1) %out, align 4
+ ret void
+}
+
+; Pattern: Global load (vmcnt) and scalar load (lgkmcnt) can be separated
+; Expected on real kernels with non-zero counters:
+; WITHOUT expansion: s_waitcnt vmcnt(0) lgkmcnt(0)
+; WITH expansion: s_waitcnt vmcnt(0)
+; s_waitcnt lgkmcnt(0)
+
+define amdgpu_kernel void @case2_independent_counters(
+; EXPAND-LABEL: case2_independent_counters:
+; EXPAND: ; %bb.0:
+; EXPAND-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
+; EXPAND-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x34
+; EXPAND-NEXT: v_mov_b32_e32 v0, 0
+; EXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT: s_load_dword s4, s[0:1], 0x0
+; EXPAND-NEXT: s_load_dword s5, s[2:3], 0x0
+; EXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT: s_add_i32 s0, s4, s5
+; EXPAND-NEXT: v_mov_b32_e32 v1, s0
+; EXPAND-NEXT: global_store_dword v0, v1, s[6:7]
+; EXPAND-NEXT: s_endpgm
+;
+; NOEXPAND-LABEL: case2_independent_counters:
+; NOEXPAND: ; %bb.0:
+; NOEXPAND-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
+; NOEXPAND-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x34
+; NOEXPAND-NEXT: v_mov_b32_e32 v0, 0
+; NOEXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT: s_load_dword s4, s[0:1], 0x0
+; NOEXPAND-NEXT: s_load_dword s5, s[2:3], 0x0
+; NOEXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT: s_add_i32 s0, s4, s5
+; NOEXPAND-NEXT: v_mov_b32_e32 v1, s0
+; NOEXPAND-NEXT: global_store_dword v0, v1, s[6:7]
+; NOEXPAND-NEXT: s_endpgm
+ ptr addrspace(1) %global_ptr,
+ ptr addrspace(4) %scalar_ptr,
+ ptr addrspace(1) %out) {
+ ; Global memory load - increments vmcnt
+ %global_val = load i32, ptr addrspace(1) %global_ptr, align 4
+
+ ; Scalar memory load - increments lgkmcnt
+ %scalar_val = load i32, ptr addrspace(4) %scalar_ptr, align 4
+
+ ; Use both values - compiler must wait for both counters
+ %result = add i32 %global_val, %scalar_val
+
+ store i32 %result, ptr addrspace(1) %out, align 4
+ ret void
+}
+
+; Pattern: Multiple buffer stores followed by a load (all affect vmcnt)
+; Expected on real kernels with many stores (e.g., 12 stores):
+; WITHOUT expansion: s_waitcnt vmcnt(0)
+; WITH expansion: s_waitcnt vmcnt(11)
+; s_waitcnt vmcnt(10)
+; ...
+; s_waitcnt vmcnt(1)
+; s_waitcnt vmcnt(0)
+
+define amdgpu_kernel void @case3_overlapping_counters(
+; EXPAND-LABEL: case3_overlapping_counters:
+; EXPAND: ; %bb.0:
+; EXPAND-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
+; EXPAND-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x34
+; EXPAND-NEXT: v_mov_b32_e32 v0, 0
+; EXPAND-NEXT: v_mov_b32_e32 v1, 1
+; EXPAND-NEXT: v_mov_b32_e32 v2, 2
+; EXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v1, s[0:1]
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:4
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:8
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:12
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:16
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:20
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:24
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:28
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:32
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:36
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:40
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:44
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: s_add_u32 s2, s2, s6
+; EXPAND-NEXT: s_addc_u32 s3, s3, s7
+; EXPAND-NEXT: global_load_dword v1, v0, s[2:3] glc
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:48
+; EXPAND-NEXT: s_waitcnt vmcnt(0)
+; EXPAND-NEXT: s_endpgm
+;
+; NOEXPAND-LABEL: case3_overlapping_counters:
+; NOEXPAND: ; %bb.0:
+; NOEXPAND-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
+; NOEXPAND-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x34
+; NOEXPAND-NEXT: v_mov_b32_e32 v0, 0
+; NOEXPAND-NEXT: v_mov_b32_e32 v1, 1
+; NOEXPAND-NEXT: v_mov_b32_e32 v2, 2
+; NOEXPAND-NEXT: s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v1, s[0:1]
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:4
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:8
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:12
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:16
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:20
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:24
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:28
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:32
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:36
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:40
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v2, s[0:1] offset:44
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: s_add_u32 s2, s2, s6
+; NOEXPAND-NEXT: s_addc_u32 s3, s3, s7
+; NOEXPAND-NEXT: global_load_dword v1, v0, s[2:3] glc
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: global_store_dword v0, v1, s[0:1] offset:48
+; NOEXPAND-NEXT: s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT: s_endpgm
+ ptr addrspace(1) %buf,
+ ptr addrspace(1) %data,
+ i64 %offset) {
+ ; Issue 12 stores to buffer - each increments vmcnt
+ %ptr0 = getelementptr i32, ptr addrspace(1) %buf, i64 0
+ store volatile i32 1, ptr addrspace(1) %ptr0, align 4
+ %ptr1 = getelementptr i32, ptr addrspace(1) %buf, i64 1
+ store volatile i32 2, ptr addrspace(1) %ptr1, align 4
+ %ptr2 = getelementptr i32, ptr addrspace(1) %buf, i64 2
+ store volatile i32 1, ptr addrspace(1) %ptr2, align 4
+ %ptr3 = getelementptr i32, ptr addrspace(1) %buf, i64 3
+ store volatile i32 2, ptr addrspace(1) %ptr3, align 4
+ %ptr4 = getelementptr i32, ptr addrspace(1) %buf, i64 4
+ store volatile i32 1, ptr addrspace(1) %ptr4, align 4
+ %ptr5 = getelementptr i32, ptr addrspace(1) %buf, i64 5
+ store volatile i32 2, ptr addrspace(1) %ptr5, align 4
+ %ptr6 = getelementptr i32, ptr addrspa...
[truncated]
|
Why would you restrict this to "non-zero counter values"? Why does this need a new subtarget feature? |
When a waitcnt already has a zero counter value expanding it would just generate another waitcnt(0), which provides no additional profiling granularity.
On this, I am still not fully sure what would be the best approach to handle. I there any suggestion from you? |
The requirement is that instead of emitting e.g. The starting value "4" here is assuming that SIInsertWaitcnts already knows that the upper bound on this counter's value is 5, so 4 is the highest value you can wait for that will have any effect. Similarly instead of
I don't know why you added a subtarget feature. I would suggest not adding it -- unless there is some reason for it that I am missing. |
🐧 Linux x64 Test Results
All tests passed but another part of the build failed. Click on a failure below to see the details. lib/Target/AMDGPU/CMakeFiles/LLVMAMDGPUCodeGen.dir/SIInsertWaitcnts.cpp.oIf these failures are unrelated to your changes (for example tests are broken or flaky at HEAD), please open an issue at https://github.com/llvm/llvm-project/issues and add the |
looks like outstanding value is always 0 in these cases. |
| cl::desc("Force all waitcnt load counters to wait until 0"), | ||
| cl::init(false), cl::Hidden); | ||
|
|
||
| static cl::opt<bool> ExpandWaitcntProfiling( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cl::opts should be reserved for debugging, if clang is going to set this it should not be a cl::opt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is for debugging purposes only? I've removed all driver integration. This is similar to existing profiling flags like: -amdgpu-waitcnt-forcezero
Is cl::opt acceptable for this use case, or would you prefer: Function attribute (per-kernel control).
5242cfc to
0745347
Compare
0745347 to
a83972b
Compare
|
I'm still not sure how this works with out of order counters. This is only useful if you have multiple loads of the same type issued back-to-back and you're trying to find the problematic one, correct?
Memory operations to different address can complete in any order, even if the waitcnt is decremented in a fixed order. I'm not sure of how useful this is in practice. |
…ation - Fix getWaitCountMax() to use correct bitmasks based on architecture: - Pre-GFX12: Use getVmcntBitMask/getLgkmcntBitMask for LOAD_CNT/DS_CNT - GFX12+: Use getLoadcntBitMask/getDscntBitMask for LOAD_CNT/DS_CNT - Refactor repetitive if-blocks for LOAD_CNT, DS_CNT, EXP_CNT into a single loop using getCounterRef helper function - Fix X_CNT to return proper getXcntBitMask(IV) instead of 0
|
You're correct that if operations complete internally out of order, the last one to complete might not be the slowest. However, the information is still useful because: It tells you which group of operations (in program order) is still pending, It's more granular than a single waitcnt(0), which provides no insight, and can be used as a hint, assuming the real-world program will be much bigger. |
I see the point, you mean the event which share the same counter might finish out-of-order right? this will cause issue because I haven't made any check here. Is it problematic or it is still useful for profiling purposes? One quick fix that arise in my mind is skip expanding out of order events. exploring if we can also check if they are really different events rather than just returning true if more than one bit is set. @Pierre-vh |
Yes that's the right thing to do. |
"the waitcnt is decremented in a fixed order" is all you need: that allows you to associate each s_waitcnt instruction with a corresponding load instruction (at least in straight line code). I can see how that would be useful for profiling. |
Thank you so much for the hint, I am still going through details. Let me see how I can address this. |
Reference issue: ROCm#67
Add a backend option to expand waitcnt instructions into sequences with decreasing counter values to support PC-sampling profiling. This allows profilers to identify which specific memory operation is causing a stall.
When enabled, instead of emitting a single waitcnt, the pass generates a sequence that waits for each outstanding operation individually. For example, if there are 5 outstanding SMEM operations and the target is to wait until 2 remain:
Original:
s_waitcnt lgkmcnt(2)
Expanded:
s_waitcnt lgkmcnt(4)
s_waitcnt lgkmcnt(3)
s_waitcnt lgkmcnt(2)
The expansion starts from (Outstanding - 1) down to the target value, since waitcnt(Outstanding) would be a no-op (the counter is already at that value).
Releated previous work for Reference
-amdgpu-waitcnt-forcezero)