[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs #166779

jayfoad · 2025-11-06T14:19:53Z

No description provided.

Using different offsets for the two global loads just makes it more obvious that the second one could fail address translation even if the first one succeeds.

This is just stylistic. It allows using the identical condition in the SMEM and VMEM cases and potentially guards against more X_CNT event types being added in future.

llvmbot · 2025-11-06T14:20:24Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jay Foad (jayfoad)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/166779.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/wait-xcnt.mir (+5-5)

diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index b7fa899678ec7..306d59d0867cd 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -1291,21 +1291,15 @@ void WaitcntBrackets::applyXcnt(const AMDGPU::Waitcnt &Wait) {
   // On entry to a block with multiple predescessors, there may
   // be pending SMEM and VMEM events active at the same time.
   // In such cases, only clear one active event at a time.
-  auto applyPendingXcntGroup = [this](unsigned E) {
-    unsigned LowerBound = getScoreLB(X_CNT);
-    applyWaitcnt(X_CNT, 0);
-    PendingEvents |= (1 << E);
-    setScoreLB(X_CNT, LowerBound);
-  };
 
   // Wait on XCNT is redundant if we are already waiting for a load to complete.
   // SMEM can return out of order, so only omit XCNT wait if we are waiting till
   // zero.
   if (Wait.KmCnt == 0 && hasPendingEvent(SMEM_GROUP)) {
-    if (hasPendingEvent(VMEM_GROUP))
-      applyPendingXcntGroup(VMEM_GROUP);
-    else
+    if (!hasMixedPendingEvents(X_CNT))
       applyWaitcnt(X_CNT, 0);
+    else
+      PendingEvents &= ~(1 << SMEM_GROUP);
     return;
   }
 
@@ -1314,10 +1308,10 @@ void WaitcntBrackets::applyXcnt(const AMDGPU::Waitcnt &Wait) {
   // decremented to the same number as LOADCnt.
   if (Wait.LoadCnt != ~0u && hasPendingEvent(VMEM_GROUP) &&
       !hasPendingEvent(STORE_CNT)) {
-    if (hasPendingEvent(SMEM_GROUP))
-      applyPendingXcntGroup(SMEM_GROUP);
-    else
+    if (!hasMixedPendingEvents(X_CNT))
       applyWaitcnt(X_CNT, std::min(Wait.XCnt, Wait.LoadCnt));
+    else if (Wait.LoadCnt == 0)
+      PendingEvents &= ~(1 << VMEM_GROUP);
     return;
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/wait-xcnt.mir b/llvm/test/CodeGen/AMDGPU/wait-xcnt.mir
index f964480dcc633..fe16f0d44dd1c 100644
--- a/llvm/test/CodeGen/AMDGPU/wait-xcnt.mir
+++ b/llvm/test/CodeGen/AMDGPU/wait-xcnt.mir
@@ -1069,7 +1069,6 @@ body: |
     $sgpr0 = S_MOV_B32 $sgpr0
 ...
 
-# FIXME: Missing S_WAIT_XCNT before overwriting vgpr0.
 ---
 name: mixed_pending_events
 tracksRegLiveness: true
@@ -1088,8 +1087,8 @@ body: |
   ; GCN-NEXT:   successors: %bb.2(0x80000000)
   ; GCN-NEXT:   liveins: $vgpr0_vgpr1, $sgpr2
   ; GCN-NEXT: {{  $}}
-  ; GCN-NEXT:   $vgpr2 = GLOBAL_LOAD_DWORD $vgpr0_vgpr1, 0, 0, implicit $exec
-  ; GCN-NEXT:   $vgpr3 = GLOBAL_LOAD_DWORD $vgpr0_vgpr1, 0, 0, implicit $exec
+  ; GCN-NEXT:   $vgpr2 = GLOBAL_LOAD_DWORD $vgpr0_vgpr1, 100, 0, implicit $exec
+  ; GCN-NEXT:   $vgpr3 = GLOBAL_LOAD_DWORD $vgpr0_vgpr1, 200, 0, implicit $exec
   ; GCN-NEXT: {{  $}}
   ; GCN-NEXT: bb.2:
   ; GCN-NEXT:   liveins: $sgpr2, $vgpr2
@@ -1098,6 +1097,7 @@ body: |
   ; GCN-NEXT:   $vgpr2 = V_MOV_B32_e32 $vgpr2, implicit $exec
   ; GCN-NEXT:   S_WAIT_KMCNT 0
   ; GCN-NEXT:   $sgpr2 = S_MOV_B32 $sgpr2
+  ; GCN-NEXT:   S_WAIT_XCNT 0
   ; GCN-NEXT:   $vgpr0 = V_MOV_B32_e32 0, implicit $exec
   bb.0:
     liveins: $vgpr0_vgpr1, $sgpr0_sgpr1, $scc
@@ -1105,8 +1105,8 @@ body: |
     S_CBRANCH_SCC1 %bb.2, implicit $scc
   bb.1:
     liveins: $vgpr0_vgpr1, $sgpr2
-    $vgpr2 = GLOBAL_LOAD_DWORD $vgpr0_vgpr1, 0, 0, implicit $exec
-    $vgpr3 = GLOBAL_LOAD_DWORD $vgpr0_vgpr1, 0, 0, implicit $exec
+    $vgpr2 = GLOBAL_LOAD_DWORD $vgpr0_vgpr1, 100, 0, implicit $exec
+    $vgpr3 = GLOBAL_LOAD_DWORD $vgpr0_vgpr1, 200, 0, implicit $exec
   bb.2:
     liveins: $sgpr2, $vgpr2
     $vgpr2 = V_MOV_B32_e32 $vgpr2, implicit $exec

jayfoad · 2025-11-06T14:21:22Z

@RyanRio FYI

jayfoad · 2025-11-06T14:21:48Z

Split into multiple commits for ease of review and to make the final bug-fixing commit as small as possible.

jayfoad · 2025-11-11T14:12:47Z

Eager ping - I tried hard to make this easy to review. The topmost commit should be an obvious bug fix, only clear pending VMEM events if LoadCnt == 0. The other commits are all NFC refactorings leading up to the fix.

nhaehnle

Seems reasonable to me.

llvm-ci · 2025-11-12T10:01:03Z

LLVM Buildbot has detected a new failure on builder llvm-clang-aarch64-darwin running on doug-worker-5 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/190/builds/30802

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: ExecutionEngine/OrcLazy/multiple-compile-threads-basic.ll' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
# RUN: at line 1
/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli -jit-kind=orc-lazy -compile-threads=2 -thread-entry hello /Users/buildbot/buildbot-root2/aarch64-darwin/llvm-project/llvm/test/ExecutionEngine/OrcLazy/multiple-compile-threads-basic.ll | /Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/FileCheck /Users/buildbot/buildbot-root2/aarch64-darwin/llvm-project/llvm/test/ExecutionEngine/OrcLazy/multiple-compile-threads-basic.ll
# executed command: /Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli -jit-kind=orc-lazy -compile-threads=2 -thread-entry hello /Users/buildbot/buildbot-root2/aarch64-darwin/llvm-project/llvm/test/ExecutionEngine/OrcLazy/multiple-compile-threads-basic.ll
# .---command stderr------------
# | PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace and instructions to reproduce the bug.
# | Stack dump:
# | 0.	Program arguments: /Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli -jit-kind=orc-lazy -compile-threads=2 -thread-entry hello /Users/buildbot/buildbot-root2/aarch64-darwin/llvm-project/llvm/test/ExecutionEngine/OrcLazy/multiple-compile-threads-basic.ll
# |  #0 0x00000001050c4f48 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli+0x100f38f48)
# |  #1 0x00000001050c2cf8 llvm::sys::RunSignalHandlers() (/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli+0x100f36cf8)
# |  #2 0x00000001050c5a48 SignalHandler(int, __siginfo*, void*) (/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli+0x100f39a48)
# |  #3 0x00000001860c3584 (/usr/lib/system/libsystem_platform.dylib+0x18047b584)
# |  #4 0x0000010104c177e0
# |  #5 0x0000000104c2153c llvm::orc::ExecutionSession::removeJITDylibs(std::__1::vector<llvm::IntrusiveRefCntPtr<llvm::orc::JITDylib>, std::__1::allocator<llvm::IntrusiveRefCntPtr<llvm::orc::JITDylib>>>) (/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli+0x100a9553c)
# |  #6 0x0000000104c212ec llvm::orc::ExecutionSession::endSession() (/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli+0x100a952ec)
# |  #7 0x0000000104cadf5c llvm::orc::LLJIT::~LLJIT() (/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli+0x100b21f5c)
# |  #8 0x0000000104cb28e8 llvm::orc::LLLazyJIT::~LLLazyJIT() (/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli+0x100b268e8)
# |  #9 0x0000000104195134 runOrcJIT(char const*) (/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli+0x100009134)
# | #10 0x0000000104190620 main (/Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/lli+0x100004620)
# | #11 0x0000000185d07154
# `-----------------------------
# error: command failed with exit status: -11
# executed command: /Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/FileCheck /Users/buildbot/buildbot-root2/aarch64-darwin/llvm-project/llvm/test/ExecutionEngine/OrcLazy/multiple-compile-threads-basic.ll
# .---command stderr------------
# | FileCheck error: '<stdin>' is empty.
# | FileCheck command line:  /Volumes/ExternalSSD/buildbot-root/aarch64-darwin/build/bin/FileCheck /Users/buildbot/buildbot-root2/aarch64-darwin/llvm-project/llvm/test/ExecutionEngine/OrcLazy/multiple-compile-threads-basic.ll
# `-----------------------------
# error: command failed with exit status: 2

--

********************

…6779)

jayfoad added 5 commits November 6, 2025 13:55

Tweak test case

5806f11

Using different offsets for the two global loads just makes it more obvious that the second one could fail address translation even if the first one succeeds.

Simplify and inline applyPendingXcntGroup. NFC.

5b828b0

Refactor with hasMixedPendingEvents. NFC.

8042095

This is just stylistic. It allows using the identical condition in the SMEM and VMEM cases and potentially guards against more X_CNT event types being added in future.

Refactor. NFC.

fb6364c

[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs

e3e274f

llvmbot added the backend:AMDGPU label Nov 6, 2025

jayfoad requested review from Pierre-vh, cdevadas, easyonaadit and rampitec November 6, 2025 14:20

jayfoad mentioned this pull request Nov 6, 2025

[AMDGPU] Another test for missing S_WAIT_XCNT #166154

Merged

RyanRio mentioned this pull request Nov 10, 2025

[AMDGPU][SIInsertWaitCnts] Gfx12.5 - Refactor xcnt optimization #164357

Merged

nhaehnle approved these changes Nov 11, 2025

View reviewed changes

cdevadas approved these changes Nov 12, 2025

View reviewed changes

jayfoad merged commit 5e4f177 into llvm:main Nov 12, 2025
12 checks passed

jayfoad deleted the fix-missing-xcnt branch November 12, 2025 09:44

git-crd pushed a commit to git-crd/crd-llvm-project that referenced this pull request Nov 13, 2025

[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs (llvm#16…

a1c8750

…6779)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs #166779

[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs #166779

Uh oh!

jayfoad commented Nov 6, 2025 •

edited

Loading

Uh oh!

llvmbot commented Nov 6, 2025

Uh oh!

jayfoad commented Nov 6, 2025

Uh oh!

jayfoad commented Nov 6, 2025

Uh oh!

jayfoad commented Nov 11, 2025

Uh oh!

nhaehnle left a comment

Uh oh!

Uh oh!

llvm-ci commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs #166779

[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs #166779

Uh oh!

Conversation

jayfoad commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 6, 2025

Uh oh!

jayfoad commented Nov 6, 2025

Uh oh!

jayfoad commented Nov 6, 2025

Uh oh!

jayfoad commented Nov 11, 2025

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvm-ci commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jayfoad commented Nov 6, 2025 •

edited

Loading