Skip to content

Conversation

@PankajDwivedi-25
Copy link
Contributor

@PankajDwivedi-25 PankajDwivedi-25 commented Nov 24, 2025

Reference issue: ROCm#67

Add a backend option to expand waitcnt instructions into sequences with decreasing counter values to support PC-sampling profiling. This allows profilers to identify which specific memory operation is causing a stall.

When enabled, instead of emitting a single waitcnt, the pass generates a sequence that waits for each outstanding operation individually. For example, if there are 5 outstanding SMEM operations and the target is to wait until 2 remain:

Original:
s_waitcnt lgkmcnt(2)

Expanded:
s_waitcnt lgkmcnt(4)
s_waitcnt lgkmcnt(3)
s_waitcnt lgkmcnt(2)

The expansion starts from (Outstanding - 1) down to the target value, since waitcnt(Outstanding) would be a no-op (the counter is already at that value).

  • Uses ScoreBrackets to determine the actual number of outstanding operations
  • Only expands when there are operations to wait for (Outstanding > Target)
  • Memory operations may complete out-of-order, so profiling results are hints, not guarantees

Releated previous work for Reference

@llvmbot llvmbot added backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Nov 24, 2025
@llvmbot
Copy link
Member

llvmbot commented Nov 24, 2025

@llvm/pr-subscribers-clang-driver

@llvm/pr-subscribers-backend-amdgpu

Author: Pankaj Dwivedi (PankajDwivedi-25)

Changes

This is the initial attempt to understand and get the requirement in the issue ROCm#67

Implemented compiler option -mamdgpu-expand-waitcnt-profiling to expand s_waitcnt instructions with non-zero counter values into sequences for PC-sampling profiling.
Simple test cases generate waitcnt(0) due to optimization. To see actual expansion logic working correctly, need to get clarity on how to write one.

Releated previous work for Reference

  • PR: llvm/llvm-project#79236 (related -amdgpu-waitcnt-forcezero)

Patch is 21.02 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169345.diff

7 Files Affected:

  • (modified) clang/include/clang/Driver/Options.td (+4-1)
  • (modified) clang/lib/Driver/ToolChains/AMDGPU.cpp (+4)
  • (modified) clang/test/Driver/amdgpu-features.c (+6)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+4)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+5)
  • (modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+165)
  • (added) llvm/test/CodeGen/AMDGPU/expand-waitcnt-profiling.ll (+239)
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 11e81e032d5fc..c0ba716484b6a 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -5497,7 +5497,10 @@ defm wavefrontsize64 : SimpleMFlag<"wavefrontsize64",
   " mode (AMDGPU only)">;
 defm amdgpu_precise_memory_op
     : SimpleMFlag<"amdgpu-precise-memory-op", "Enable", "Disable",
-                  " precise memory mode (AMDGPU only)">;
+                  " precise memory mode (AMDGPU only)", m_amdgpu_Features_Group>;
+defm amdgpu_expand_waitcnt_profiling
+    : SimpleMFlag<"amdgpu-expand-waitcnt-profiling", "Enable", "Disable",
+                  " waitcnt expansion for profiling (AMDGPU only)", m_amdgpu_Features_Group>;
 
 def munsafe_fp_atomics : Flag<["-"], "munsafe-fp-atomics">,
   Visibility<[ClangOption, CC1Option, FlangOption, FC1Option]>, Alias<fatomic_ignore_denormal_mode>;
diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp b/clang/lib/Driver/ToolChains/AMDGPU.cpp
index 1a243fef9532d..f4ddb48c9abc6 100644
--- a/clang/lib/Driver/ToolChains/AMDGPU.cpp
+++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp
@@ -700,6 +700,10 @@ void amdgpu::getAMDGPUTargetFeatures(const Driver &D,
                    options::OPT_mno_amdgpu_precise_memory_op, false))
     Features.push_back("+precise-memory");
 
+  if (Args.hasFlag(options::OPT_mamdgpu_expand_waitcnt_profiling,
+                   options::OPT_mno_amdgpu_expand_waitcnt_profiling, false))
+    Features.push_back("+expand-waitcnt-profiling");
+
   handleTargetFeaturesGroup(D, Triple, Args, Features,
                             options::OPT_m_amdgpu_Features_Group);
 }
diff --git a/clang/test/Driver/amdgpu-features.c b/clang/test/Driver/amdgpu-features.c
index 864744db203e9..16b3f4121ab7a 100644
--- a/clang/test/Driver/amdgpu-features.c
+++ b/clang/test/Driver/amdgpu-features.c
@@ -38,3 +38,9 @@
 
 // RUN: %clang -### -target amdgcn -mcpu=gfx1010 -mno-amdgpu-precise-memory-op %s 2>&1 | FileCheck --check-prefix=NO-PREC-MEM %s
 // NO-PREC-MEM-NOT: {{".*precise-memory"}}
+
+// RUN: %clang -### -target amdgcn -mcpu=gfx900 -mamdgpu-expand-waitcnt-profiling %s 2>&1 | FileCheck --check-prefix=EXPAND-WAITCNT %s
+// EXPAND-WAITCNT: "-target-feature" "+expand-waitcnt-profiling"
+
+// RUN: %clang -### -target amdgcn -mcpu=gfx900 -mno-amdgpu-expand-waitcnt-profiling %s 2>&1 | FileCheck --check-prefix=NO-EXPAND-WAITCNT %s
+// NO-EXPAND-WAITCNT-NOT: "{{[+]}}expand-waitcnt-profiling"
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 54d94b1f8682e..3f9166f48ea22 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -223,6 +223,10 @@ def FeaturePreciseMemory
     : SubtargetFeature<"precise-memory", "EnablePreciseMemory",
                        "true", "Enable precise memory mode">;
 
+def FeatureExpandWaitcntProfiling
+    : SubtargetFeature<"expand-waitcnt-profiling", "EnableExpandWaitcntProfiling",
+                       "true", "Expand waitcnt instructions for profiling">;
+
 def FeatureSGPRInitBug : SubtargetFeature<"sgpr-init-bug",
   "SGPRInitBug",
   "true",
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index f377b8aaf1333..f2b885a790f41 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -90,6 +90,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   bool EnableCuMode = false;
   bool TrapHandler = false;
   bool EnablePreciseMemory = false;
+  bool EnableExpandWaitcntProfiling = false;
 
   // Used as options.
   bool EnableLoadStoreOpt = false;
@@ -674,6 +675,10 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
 
   bool isPreciseMemoryEnabled() const { return EnablePreciseMemory; }
 
+  bool isExpandWaitcntProfilingEnabled() const {
+    return EnableExpandWaitcntProfiling;
+  }
+
   bool hasFlatAddressSpace() const {
     return FlatAddressSpace;
   }
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index b7fa899678ec7..4a70479358bad 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -494,6 +494,16 @@ class SIInsertWaitcnts {
   bool isVMEMOrFlatVMEM(const MachineInstr &MI) const;
   bool run(MachineFunction &MF);
 
+  // Methods for expanding waitcnt instructions for profiling
+  bool expandWaitcntsForProfiling(MachineFunction &MF);
+  bool expandSingleWaitcnt(MachineInstr &MI, MachineBasicBlock &MBB);
+  bool expandSingleCounterWait(MachineInstr &MI, MachineBasicBlock &MBB,
+                               InstCounterType CT);
+  bool expandCounterSequence(MachineBasicBlock &MBB,
+                             MachineBasicBlock::iterator InsertPos,
+                             InstCounterType CT, unsigned CountValue,
+                             DebugLoc DL);
+
   void setForceEmitWaitcnt() {
 // For non-debug builds, ForceEmitWaitcnt has been initialized to false;
 // For debug builds, get the debug counter info and adjust if need be
@@ -2725,6 +2735,156 @@ SIInsertWaitcntsPass::run(MachineFunction &MF,
       .preserve<AAManager>();
 }
 
+/// Expand waitcnt instructions for profiling by inserting a sequence of
+/// decreasing counter values. This helps identify which specific memory
+/// operation is a bottleneck during PC sampling.
+bool SIInsertWaitcnts::expandWaitcntsForProfiling(MachineFunction &MF) {
+  if (!ST->isExpandWaitcntProfilingEnabled())
+    return false;
+
+  bool Modified = false;
+
+  // Iterate through all basic blocks
+  for (MachineBasicBlock &MBB : MF) {
+    for (auto I = MBB.begin(), E = MBB.end(); I != E;) {
+      MachineInstr &MI = *I;
+      ++I; // Advance iterator before potential expansion
+
+      if (ST->hasExtendedWaitCounts()) {
+        // GFX12+: Handle separate wait instructions
+        if (auto CT = counterTypeForInstr(MI.getOpcode())) {
+          Modified |= expandSingleCounterWait(MI, MBB, *CT);
+        }
+      } else {
+        // Pre-GFX12: Handle combined S_WAITCNT
+        if (MI.getOpcode() == AMDGPU::S_WAITCNT) {
+          Modified |= expandSingleWaitcnt(MI, MBB);
+        }
+      }
+    }
+  }
+
+  return Modified;
+}
+
+/// Expand a single S_WAITCNT instruction (pre-GFX12)
+bool SIInsertWaitcnts::expandSingleWaitcnt(MachineInstr &MI,
+                                           MachineBasicBlock &MBB) {
+  assert(MI.getOpcode() == AMDGPU::S_WAITCNT);
+
+  // Decode the waitcnt immediate
+  unsigned Imm = MI.getOperand(0).getImm();
+  AMDGPU::IsaVersion IV = AMDGPU::getIsaVersion(ST->getCPU());
+  AMDGPU::Waitcnt Wait = AMDGPU::decodeWaitcnt(IV, Imm);
+
+  // Insert expanded waitcnts BEFORE the original instruction
+  auto InsertPos = MI.getIterator();
+  DebugLoc DL = MI.getDebugLoc();
+
+  bool Modified = false;
+
+  // Expand each counter independently
+  // For independent counters (Case 2 from requirements):
+  // vmcnt and lgkmcnt can be separated
+  Modified |= expandCounterSequence(MBB, InsertPos, LOAD_CNT, Wait.LoadCnt, DL);
+  Modified |= expandCounterSequence(MBB, InsertPos, DS_CNT, Wait.DsCnt, DL);
+  Modified |= expandCounterSequence(MBB, InsertPos, EXP_CNT, Wait.ExpCnt, DL);
+  Modified |=
+      expandCounterSequence(MBB, InsertPos, STORE_CNT, Wait.StoreCnt, DL);
+
+  // If we expanded anything, remove the original waitcnt
+  if (Modified) {
+    MI.eraseFromParent();
+  }
+
+  return Modified;
+}
+
+/// Expand a single counter wait instruction (GFX12+)
+bool SIInsertWaitcnts::expandSingleCounterWait(MachineInstr &MI,
+                                               MachineBasicBlock &MBB,
+                                               InstCounterType CT) {
+  // Get the counter value from the instruction
+  unsigned CountValue = MI.getOperand(0).getImm();
+
+  // Insert expanded waitcnts BEFORE the original instruction
+  auto InsertPos = MI.getIterator();
+  DebugLoc DL = MI.getDebugLoc();
+
+  bool Modified = expandCounterSequence(MBB, InsertPos, CT, CountValue, DL);
+
+  // If we expanded, remove the original instruction
+  if (Modified) {
+    MI.eraseFromParent();
+  }
+
+  return Modified;
+}
+
+/// Insert a sequence of wait instructions with decreasing counter values
+bool SIInsertWaitcnts::expandCounterSequence(
+    MachineBasicBlock &MBB, MachineBasicBlock::iterator InsertPos,
+    InstCounterType CT, unsigned CountValue, DebugLoc DL) {
+  // Skip if counter is already at zero, not active, or at max (wait not needed)
+  if (CountValue == 0 || CountValue == ~0u)
+    return false;
+
+  unsigned MaxCount = getWaitCountMax(CT);
+  if (CountValue >= MaxCount)
+    return false;
+
+  bool Modified = false;
+
+  // Generate decreasing sequence: CountValue-1, CountValue-2, ..., 1, 0
+  // We start from CountValue-1 because the original waitcnt already handles
+  // CountValue
+  for (int i = CountValue - 1; i >= 0; --i) {
+    if (ST->hasExtendedWaitCounts()) {
+      // GFX12+: Use separate wait instructions
+      unsigned Opcode = instrsForExtendedCounterTypes[CT];
+      BuildMI(MBB, InsertPos, DL, TII->get(Opcode)).addImm(i);
+    } else {
+      // Pre-GFX12: Use combined S_WAITCNT with only this counter set
+      AMDGPU::Waitcnt Wait;
+      switch (CT) {
+      case LOAD_CNT:
+        Wait.LoadCnt = i;
+        break;
+      case DS_CNT:
+        Wait.DsCnt = i;
+        break;
+      case EXP_CNT:
+        Wait.ExpCnt = i;
+        break;
+      case STORE_CNT:
+        Wait.StoreCnt = i;
+        break;
+      case SAMPLE_CNT:
+        Wait.SampleCnt = i;
+        break;
+      case BVH_CNT:
+        Wait.BvhCnt = i;
+        break;
+      case KM_CNT:
+        Wait.KmCnt = i;
+        break;
+      case X_CNT:
+        Wait.XCnt = i;
+        break;
+      default:
+        break;
+      }
+
+      AMDGPU::IsaVersion IV = AMDGPU::getIsaVersion(ST->getCPU());
+      unsigned Enc = AMDGPU::encodeWaitcnt(IV, Wait);
+      BuildMI(MBB, InsertPos, DL, TII->get(AMDGPU::S_WAITCNT)).addImm(Enc);
+    }
+    Modified = true;
+  }
+
+  return Modified;
+}
+
 bool SIInsertWaitcnts::run(MachineFunction &MF) {
   ST = &MF.getSubtarget<GCNSubtarget>();
   TII = ST->getInstrInfo();
@@ -2963,5 +3123,10 @@ bool SIInsertWaitcnts::run(MachineFunction &MF) {
   PreheadersToFlush.clear();
   SLoadAddresses.clear();
 
+  // Expand waitcnts for profiling if requested
+  if (ST->isExpandWaitcntProfilingEnabled()) {
+    Modified |= expandWaitcntsForProfiling(MF);
+  }
+
   return Modified;
 }
diff --git a/llvm/test/CodeGen/AMDGPU/expand-waitcnt-profiling.ll b/llvm/test/CodeGen/AMDGPU/expand-waitcnt-profiling.ll
new file mode 100644
index 0000000000000..cc99c457677ad
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/expand-waitcnt-profiling.ll
@@ -0,0 +1,239 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=+expand-waitcnt-profiling -verify-machineinstrs < %s | FileCheck --check-prefix=EXPAND %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=-expand-waitcnt-profiling -verify-machineinstrs < %s | FileCheck --check-prefix=NOEXPAND %s
+
+; NOTE: These simple test cases are optimized to generate waitcnt(0) by the
+; time values are needed. The expansion feature correctly does NOT expand waitcnt(0).
+
+; Pattern: Multiple scalar loads that increment lgkmcnt, followed by use
+; Expected on real kernels with non-zero lgkmcnt:
+;   WITHOUT expansion: s_waitcnt lgkmcnt(0)
+;   WITH expansion:    s_waitcnt lgkmcnt(2)
+;                      s_waitcnt lgkmcnt(1)
+;                      s_waitcnt lgkmcnt(0)
+
+define amdgpu_kernel void @case1_single_counter_lgkmcnt(
+; EXPAND-LABEL: case1_single_counter_lgkmcnt:
+; EXPAND:       ; %bb.0:
+; EXPAND-NEXT:    s_load_dwordx8 s[8:15], s[4:5], 0x24
+; EXPAND-NEXT:    v_mov_b32_e32 v0, 0
+; EXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT:    s_load_dword s0, s[8:9], 0x0
+; EXPAND-NEXT:    s_load_dword s1, s[10:11], 0x0
+; EXPAND-NEXT:    s_load_dword s2, s[12:13], 0x0
+; EXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT:    s_add_i32 s0, s0, s1
+; EXPAND-NEXT:    s_add_i32 s0, s0, s2
+; EXPAND-NEXT:    v_mov_b32_e32 v1, s0
+; EXPAND-NEXT:    global_store_dword v0, v1, s[14:15]
+; EXPAND-NEXT:    s_endpgm
+;
+; NOEXPAND-LABEL: case1_single_counter_lgkmcnt:
+; NOEXPAND:       ; %bb.0:
+; NOEXPAND-NEXT:    s_load_dwordx8 s[8:15], s[4:5], 0x24
+; NOEXPAND-NEXT:    v_mov_b32_e32 v0, 0
+; NOEXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT:    s_load_dword s0, s[8:9], 0x0
+; NOEXPAND-NEXT:    s_load_dword s1, s[10:11], 0x0
+; NOEXPAND-NEXT:    s_load_dword s2, s[12:13], 0x0
+; NOEXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT:    s_add_i32 s0, s0, s1
+; NOEXPAND-NEXT:    s_add_i32 s0, s0, s2
+; NOEXPAND-NEXT:    v_mov_b32_e32 v1, s0
+; NOEXPAND-NEXT:    global_store_dword v0, v1, s[14:15]
+; NOEXPAND-NEXT:    s_endpgm
+    ptr addrspace(4) %ptr_a,
+    ptr addrspace(4) %ptr_b,
+    ptr addrspace(4) %ptr_c,
+    ptr addrspace(1) %out) {
+  ; Three scalar loads - increment lgkmcnt
+  %val_a = load i32, ptr addrspace(4) %ptr_a, align 4
+  %val_b = load i32, ptr addrspace(4) %ptr_b, align 4
+  %val_c = load i32, ptr addrspace(4) %ptr_c, align 4
+
+  ; Use all three values
+  %sum1 = add i32 %val_a, %val_b
+  %sum2 = add i32 %sum1, %val_c
+
+  store i32 %sum2, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+; Pattern: Global load (vmcnt) and scalar load (lgkmcnt) can be separated
+; Expected on real kernels with non-zero counters:
+;   WITHOUT expansion: s_waitcnt vmcnt(0) lgkmcnt(0)
+;   WITH expansion:    s_waitcnt vmcnt(0)
+;                      s_waitcnt lgkmcnt(0)
+
+define amdgpu_kernel void @case2_independent_counters(
+; EXPAND-LABEL: case2_independent_counters:
+; EXPAND:       ; %bb.0:
+; EXPAND-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; EXPAND-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; EXPAND-NEXT:    v_mov_b32_e32 v0, 0
+; EXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT:    s_load_dword s4, s[0:1], 0x0
+; EXPAND-NEXT:    s_load_dword s5, s[2:3], 0x0
+; EXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT:    s_add_i32 s0, s4, s5
+; EXPAND-NEXT:    v_mov_b32_e32 v1, s0
+; EXPAND-NEXT:    global_store_dword v0, v1, s[6:7]
+; EXPAND-NEXT:    s_endpgm
+;
+; NOEXPAND-LABEL: case2_independent_counters:
+; NOEXPAND:       ; %bb.0:
+; NOEXPAND-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; NOEXPAND-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; NOEXPAND-NEXT:    v_mov_b32_e32 v0, 0
+; NOEXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT:    s_load_dword s4, s[0:1], 0x0
+; NOEXPAND-NEXT:    s_load_dword s5, s[2:3], 0x0
+; NOEXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT:    s_add_i32 s0, s4, s5
+; NOEXPAND-NEXT:    v_mov_b32_e32 v1, s0
+; NOEXPAND-NEXT:    global_store_dword v0, v1, s[6:7]
+; NOEXPAND-NEXT:    s_endpgm
+    ptr addrspace(1) %global_ptr,
+    ptr addrspace(4) %scalar_ptr,
+    ptr addrspace(1) %out) {
+  ; Global memory load - increments vmcnt
+  %global_val = load i32, ptr addrspace(1) %global_ptr, align 4
+
+  ; Scalar memory load - increments lgkmcnt
+  %scalar_val = load i32, ptr addrspace(4) %scalar_ptr, align 4
+
+  ; Use both values - compiler must wait for both counters
+  %result = add i32 %global_val, %scalar_val
+
+  store i32 %result, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+; Pattern: Multiple buffer stores followed by a load (all affect vmcnt)
+; Expected on real kernels with many stores (e.g., 12 stores):
+;   WITHOUT expansion: s_waitcnt vmcnt(0)
+;   WITH expansion:    s_waitcnt vmcnt(11)
+;                      s_waitcnt vmcnt(10)
+;                      ...
+;                      s_waitcnt vmcnt(1)
+;                      s_waitcnt vmcnt(0)
+
+define amdgpu_kernel void @case3_overlapping_counters(
+; EXPAND-LABEL: case3_overlapping_counters:
+; EXPAND:       ; %bb.0:
+; EXPAND-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; EXPAND-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; EXPAND-NEXT:    v_mov_b32_e32 v0, 0
+; EXPAND-NEXT:    v_mov_b32_e32 v1, 1
+; EXPAND-NEXT:    v_mov_b32_e32 v2, 2
+; EXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v1, s[0:1]
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:4
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:8
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:12
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:16
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:20
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:24
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:28
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:32
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:36
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:40
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:44
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    s_add_u32 s2, s2, s6
+; EXPAND-NEXT:    s_addc_u32 s3, s3, s7
+; EXPAND-NEXT:    global_load_dword v1, v0, s[2:3] glc
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:48
+; EXPAND-NEXT:    s_waitcnt vmcnt(0)
+; EXPAND-NEXT:    s_endpgm
+;
+; NOEXPAND-LABEL: case3_overlapping_counters:
+; NOEXPAND:       ; %bb.0:
+; NOEXPAND-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; NOEXPAND-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; NOEXPAND-NEXT:    v_mov_b32_e32 v0, 0
+; NOEXPAND-NEXT:    v_mov_b32_e32 v1, 1
+; NOEXPAND-NEXT:    v_mov_b32_e32 v2, 2
+; NOEXPAND-NEXT:    s_waitcnt lgkmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v1, s[0:1]
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:4
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:8
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:12
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:16
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:20
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:24
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:28
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:32
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:36
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:40
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v2, s[0:1] offset:44
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    s_add_u32 s2, s2, s6
+; NOEXPAND-NEXT:    s_addc_u32 s3, s3, s7
+; NOEXPAND-NEXT:    global_load_dword v1, v0, s[2:3] glc
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    global_store_dword v0, v1, s[0:1] offset:48
+; NOEXPAND-NEXT:    s_waitcnt vmcnt(0)
+; NOEXPAND-NEXT:    s_endpgm
+    ptr addrspace(1) %buf,
+    ptr addrspace(1) %data,
+    i64 %offset) {
+  ; Issue 12 stores to buffer - each increments vmcnt
+  %ptr0 = getelementptr i32, ptr addrspace(1) %buf, i64 0
+  store volatile i32 1, ptr addrspace(1) %ptr0, align 4
+  %ptr1 = getelementptr i32, ptr addrspace(1) %buf, i64 1
+  store volatile i32 2, ptr addrspace(1) %ptr1, align 4
+  %ptr2 = getelementptr i32, ptr addrspace(1) %buf, i64 2
+  store volatile i32 1, ptr addrspace(1) %ptr2, align 4
+  %ptr3 = getelementptr i32, ptr addrspace(1) %buf, i64 3
+  store volatile i32 2, ptr addrspace(1) %ptr3, align 4
+  %ptr4 = getelementptr i32, ptr addrspace(1) %buf, i64 4
+  store volatile i32 1, ptr addrspace(1) %ptr4, align 4
+  %ptr5 = getelementptr i32, ptr addrspace(1) %buf, i64 5
+  store volatile i32 2, ptr addrspace(1) %ptr5, align 4
+  %ptr6 = getelementptr i32, ptr addrspa...
[truncated]

@jayfoad
Copy link
Contributor

jayfoad commented Nov 24, 2025

Implemented compiler option -mamdgpu-expand-waitcnt-profiling to expand s_waitcnt instructions with non-zero counter values into sequences for PC-sampling profiling.

Why would you restrict this to "non-zero counter values"?

Why does this need a new subtarget feature?

@PankajDwivedi-25
Copy link
Contributor Author

Why would you restrict this to "non-zero counter values"?

When a waitcnt already has a zero counter value expanding it would just generate another waitcnt(0), which provides no additional profiling granularity.
If you believe there's a use case for expanding waitcnt(0), I'd be happy to discuss it.

Why does this need a new subtarget feature?

On this, I am still not fully sure what would be the best approach to handle. I there any suggestion from you?

@jayfoad
Copy link
Contributor

jayfoad commented Nov 24, 2025

Why would you restrict this to "non-zero counter values"?

When a waitcnt already has a zero counter value expanding it would just generate another waitcnt(0), which provides no additional profiling granularity. If you believe there's a use case for expanding waitcnt(0), I'd be happy to discuss it.

The requirement is that instead of emitting e.g. s_waitcnt vmcnt(2) you should emit e.g.:

s_waitcnt vmcnt(4)
s_waitcnt vmcnt(3)
s_waitcnt vmcnt(2)

The starting value "4" here is assuming that SIInsertWaitcnts already knows that the upper bound on this counter's value is 5, so 4 is the highest value you can wait for that will have any effect.

Similarly instead of s_waitcnt vmcnt(0) you should emit:

s_waitcnt vmcnt(4)
s_waitcnt vmcnt(3)
s_waitcnt vmcnt(2)
s_waitcnt vmcnt(1)
s_waitcnt vmcnt(0)

Why does this need a new subtarget feature?

On this, I am still not fully sure what would be the best approach to handle. I there any suggestion from you?

I don't know why you added a subtarget feature. I would suggest not adding it -- unless there is some reason for it that I am missing.

@github-actions
Copy link

🐧 Linux x64 Test Results

  • 3053 tests passed
  • 7 tests skipped

All tests passed but another part of the build failed. Click on a failure below to see the details.

lib/Target/AMDGPU/CMakeFiles/LLVMAMDGPUCodeGen.dir/SIInsertWaitcnts.cpp.o
FAILED: lib/Target/AMDGPU/CMakeFiles/LLVMAMDGPUCodeGen.dir/SIInsertWaitcnts.cpp.o
sccache /opt/llvm/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GLIBCXX_USE_CXX11_ABI=1 -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/gha/actions-runner/_work/llvm-project/llvm-project/build/lib/Target/AMDGPU -I/home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Target/AMDGPU -I/home/gha/actions-runner/_work/llvm-project/llvm-project/build/include -I/home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include -gmlt -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wno-pass-failed -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -std=c++17 -fvisibility=hidden  -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -MD -MT lib/Target/AMDGPU/CMakeFiles/LLVMAMDGPUCodeGen.dir/SIInsertWaitcnts.cpp.o -MF lib/Target/AMDGPU/CMakeFiles/LLVMAMDGPUCodeGen.dir/SIInsertWaitcnts.cpp.o.d -o lib/Target/AMDGPU/CMakeFiles/LLVMAMDGPUCodeGen.dir/SIInsertWaitcnts.cpp.o -c /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
/home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp:2748:12: error: no member named 'isExpandWaitcntProfilingEnabled' in 'llvm::GCNSubtarget'
2748 |   if (!ST->isExpandWaitcntProfilingEnabled())
|        ~~  ^
/home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp:3133:11: error: no member named 'isExpandWaitcntProfilingEnabled' in 'llvm::GCNSubtarget'
3133 |   if (ST->isExpandWaitcntProfilingEnabled()) {
|       ~~  ^
2 errors generated.

If these failures are unrelated to your changes (for example tests are broken or flaky at HEAD), please open an issue at https://github.com/llvm/llvm-project/issues and add the infrastructure label.

@PankajDwivedi-25
Copy link
Contributor Author

Why would you restrict this to "non-zero counter values"?

When a waitcnt already has a zero counter value expanding it would just generate another waitcnt(0), which provides no additional profiling granularity. If you believe there's a use case for expanding waitcnt(0), I'd be happy to discuss it.

The requirement is that instead of emitting e.g. s_waitcnt vmcnt(2) you should emit e.g.:

s_waitcnt vmcnt(4)
s_waitcnt vmcnt(3)
s_waitcnt vmcnt(2)

The starting value "4" here is assuming that SIInsertWaitcnts already knows that the upper bound on this counter's value is 5, so 4 is the highest value you can wait for that will have any effect.

Similarly instead of s_waitcnt vmcnt(0) you should emit:

s_waitcnt vmcnt(4)
s_waitcnt vmcnt(3)
s_waitcnt vmcnt(2)
s_waitcnt vmcnt(1)
s_waitcnt vmcnt(0)

looks like outstanding value is always 0 in these cases.

cl::desc("Force all waitcnt load counters to wait until 0"),
cl::init(false), cl::Hidden);

static cl::opt<bool> ExpandWaitcntProfiling(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cl::opts should be reserved for debugging, if clang is going to set this it should not be a cl::opt

Copy link
Contributor Author

@PankajDwivedi-25 PankajDwivedi-25 Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is for debugging purposes only? I've removed all driver integration. This is similar to existing profiling flags like: -amdgpu-waitcnt-forcezero

Is cl::opt acceptable for this use case, or would you prefer: Function attribute (per-kernel control).

@PankajDwivedi-25 PankajDwivedi-25 force-pushed the users/Pankajdwivedi-25/Compiler_option_for_expanding_s_waitcnt_instructions branch from 5242cfc to 0745347 Compare November 25, 2025 17:02
@PankajDwivedi-25 PankajDwivedi-25 force-pushed the users/Pankajdwivedi-25/Compiler_option_for_expanding_s_waitcnt_instructions branch from 0745347 to a83972b Compare November 25, 2025 17:27
@Pierre-vh
Copy link
Contributor

I'm still not sure how this works with out of order counters. This is only useful if you have multiple loads of the same type issued back-to-back and you're trying to find the problematic one, correct?

This allows profilers to identify which specific memory operation is causing a stall.

Memory operations to different address can complete in any order, even if the waitcnt is decremented in a fixed order. I'm not sure of how useful this is in practice.

…ation

- Fix getWaitCountMax() to use correct bitmasks based on architecture:
  - Pre-GFX12: Use getVmcntBitMask/getLgkmcntBitMask for LOAD_CNT/DS_CNT
  - GFX12+: Use getLoadcntBitMask/getDscntBitMask for LOAD_CNT/DS_CNT
- Refactor repetitive if-blocks for LOAD_CNT, DS_CNT, EXP_CNT into
  a single loop using getCounterRef helper function
- Fix X_CNT to return proper getXcntBitMask(IV) instead of 0
@PankajDwivedi-25
Copy link
Contributor Author

PankajDwivedi-25 commented Nov 26, 2025

I'm still not sure how this works with out of order counters. This is only useful if you have multiple loads of the same type issued back-to-back and you're trying to find the problematic one, correct?

Yes.

@PankajDwivedi-25
Copy link
Contributor Author

PankajDwivedi-25 commented Nov 26, 2025

Memory operations to different address can complete in any order, even if the waitcnt is decremented in a fixed order. I'm not sure of how useful this is in practice.

You're correct that if operations complete internally out of order, the last one to complete might not be the slowest. However, the information is still useful because: It tells you which group of operations (in program order) is still pending, It's more granular than a single waitcnt(0), which provides no insight, and can be used as a hint, assuming the real-world program will be much bigger.

@PankajDwivedi-25
Copy link
Contributor Author

I'm still not sure how this works with out of order counters.

I see the point, you mean the event which share the same counter might finish out-of-order right? this will cause issue because I haven't made any check here.

Is it problematic or it is still useful for profiling purposes?

One quick fix that arise in my mind is skip expanding out of order events. exploring if we can also check if they are really different events rather than just returning true if more than one bit is set. @Pierre-vh

@jayfoad
Copy link
Contributor

jayfoad commented Nov 27, 2025

skip expanding out of order events

Yes that's the right thing to do.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 27, 2025

Memory operations to different address can complete in any order, even if the waitcnt is decremented in a fixed order. I'm not sure of how useful this is in practice.

"the waitcnt is decremented in a fixed order" is all you need: that allows you to associate each s_waitcnt instruction with a corresponding load instruction (at least in straight line code). I can see how that would be useful for profiling.

@PankajDwivedi-25
Copy link
Contributor Author

Memory operations to different address can complete in any order, even if the waitcnt is decremented in a fixed order. I'm not sure of how useful this is in practice.

"the waitcnt is decremented in a fixed order" is all you need: that allows you to associate each s_waitcnt instruction with a corresponding load instruction (at least in straight line code). I can see how that would be useful for profiling.

Thank you so much for the hint, I am still going through details. Let me see how I can address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl'

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants