Skip to content

Conversation

choikwa
Copy link
Contributor

@choikwa choikwa commented Aug 20, 2025

This was motivated from looking at composable kernel benchmark where IPSCCP was observed replacing noalias ptr's and their derivatives with a global alias. Doing so would lose the noalias information and target backend was more pessimistic, emitting unneeded WAITCNT instructions.

Making it a draft as it's unclear if it's beneficially moving the needle.

…atives

This was motivated from looking at composable kernel benchmark where IPSCCP was observed replacing noalias ptr's and their derivatives with a global alias.
Doing so would lose the noalias information and target backend was more pessimistic, emitting unneeded WAITCNT instructions.
@llvmbot
Copy link
Member

llvmbot commented Aug 20, 2025

@llvm/pr-subscribers-function-specialization

@llvm/pr-subscribers-llvm-transforms

Author: choikwa (choikwa)

Changes

This was motivated from looking at composable kernel benchmark where IPSCCP was observed replacing noalias ptr's and their derivatives with a global alias. Doing so would lose the noalias information and target backend was more pessimistic, emitting unneeded WAITCNT instructions.

Making it a draft as it's unclear if it's beneficially moving the needle.


Full diff: https://github.com/llvm/llvm-project/pull/154522.diff

1 Files Affected:

  • (modified) llvm/lib/Transforms/Utils/SCCPSolver.cpp (+15)
diff --git a/llvm/lib/Transforms/Utils/SCCPSolver.cpp b/llvm/lib/Transforms/Utils/SCCPSolver.cpp
index 84485176ad4ff..3c06452a91f36 100644
--- a/llvm/lib/Transforms/Utils/SCCPSolver.cpp
+++ b/llvm/lib/Transforms/Utils/SCCPSolver.cpp
@@ -62,6 +62,21 @@ bool SCCPSolver::tryToReplaceWithConstant(Value *V) {
   Constant *Const = getConstantOrNull(V);
   if (!Const)
     return false;
+
+  // Don't replace noalias arg or derivatives
+  if (isa<PointerType>(V->getType())) {
+    SmallVector<const Value *, 4> Objects;
+    getUnderlyingObjects(V, Objects, nullptr);
+
+    for (const auto Obj : Objects) {
+      if (const auto *Arg = dyn_cast<Argument>(Obj)) {
+        if (isa<PointerType>(Arg->getType()) &&
+            Arg->hasNoAliasAttr())
+          return false;
+      }
+    }
+  }
+
   // Replacing `musttail` instructions with constant breaks `musttail` invariant
   // unless the call itself can be removed.
   // Calls with "clang.arc.attachedcall" implicitly use the return value and

@choikwa choikwa requested review from dobbelaj-snps and nikic August 20, 2025 12:22
Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions cpp -- llvm/lib/Transforms/Utils/SCCPSolver.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Transforms/Utils/SCCPSolver.cpp b/llvm/lib/Transforms/Utils/SCCPSolver.cpp
index 3c06452a9..ac1a615a2 100644
--- a/llvm/lib/Transforms/Utils/SCCPSolver.cpp
+++ b/llvm/lib/Transforms/Utils/SCCPSolver.cpp
@@ -70,8 +70,7 @@ bool SCCPSolver::tryToReplaceWithConstant(Value *V) {
 
     for (const auto Obj : Objects) {
       if (const auto *Arg = dyn_cast<Argument>(Obj)) {
-        if (isa<PointerType>(Arg->getType()) &&
-            Arg->hasNoAliasAttr())
+        if (isa<PointerType>(Arg->getType()) && Arg->hasNoAliasAttr())
           return false;
       }
     }

@choikwa choikwa marked this pull request as draft August 20, 2025 12:37
@dobbelaj-snps
Copy link
Contributor

@choikwa What is the driving testcase for this change ?

@choikwa
Copy link
Contributor Author

choikwa commented Aug 21, 2025

The testcase is from composable kernel. I've attached the module in question, but the log dump is far too large to attach so I've attached cmd in lieu.
composable_kernel_example.zip

Some excerpts that demonstrate my findings:
define internal %"struct.ck_tile::buffer_view" @ZN7ck_tile16make_buffer_viewILNS_18address_space_enumE1ELNS_25amd_buffer_coherence_enumE11EKDF16blEEDaPT1_T2(ptr noalias noundef %p, i64 noundef %buffer_size) #3 {
entry:
...
Constant: ptr addrspacecast (ptr addrspace(3) @_ZZZNK7ck_tile13FmhaFwdKernelINS_34BlockFmhaPipelineQRKSVSAsyncTrloadINS_24BlockFmhaPipelineProblemIDF16bDF16bDF16bffDF16bhfDF16bfDF16bNS_13TileFmhaShapeINS_8sequenceIJLi128ELi64ELi32ELi128ELi16ELi128EEEENS4_IJLi4ELi1ELi1EEEENS4_IJLi32ELi32ELi16EEEES6_S7_Lb1EEELb0ENS_17ComposedAttentionILj0ELb1EEENS_30SimplifiedGenericAttentionMaskILb0EEELb1ENS_14TileFmhaTraitsILb0ELb0ELb0ELb0ELb0ELNS_22BlockAttentionBiasEnumE0ELb0ELb0ELb0ELb0ELin1ELb0EEEEENS_47BlockFmhaPipelineQRKSVSAsyncTrloadDefaultPolicyEEENS_17Default2DEpilogueINS_24Default2DEpilogueProblemIfDF16bLb0ELb0ELb1ELNS_21memory_operation_enumE0EEEvEEE4run_ENSO_21FmhaFwdBatchModeKargsEENKUlvE7_clEvE10smem_ptrk0 to ptr) = ptr %p

@dobbelaj-snps
Copy link
Contributor

Can you create a reduced testcase that shows the wanted effect ? (llvm-ir based). This is needed any way if you want this change to be acceptable. As you have access to a compiler with and without your change, you could use creduce and/or llvm-reduce to help you out for this.

@dobbelaj-snps
Copy link
Contributor

Thanks for the testcase. Did you happen to try this out with the Full Restrict version. My understanding is that that should keep the 'restrict' information just fine ?

@choikwa
Copy link
Contributor Author

choikwa commented Aug 22, 2025

I've tried to run llvm-reduce on CK module in the past, but even after running for 1.5day, it would fail and the reduced llir was still close to original size. Instead, I was luckily able to cook up a simpler example that demonstrates aforementioned behavior in the testcase commit.

The Diff:
{5A2295E7-0EB9-47D2-B279-407CB09903B6}

log dump:
opt.log

The problem starts with IPSCCP replacing %p in the callee body with @arr, ignoring the fact that %p is noalias ptr argument :
Constant: @arr = global [100 x i32] zeroinitializer, align 16 = ptr %p

@choikwa
Copy link
Contributor Author

choikwa commented Aug 22, 2025

Thanks for the testcase. Did you happen to try this out with the Full Restrict version. My understanding is that that should keep the 'restrict' information just fine ?

I have not. I can try with the full_restrict-update-20231215-02_ptr_provenance branch.

@choikwa
Copy link
Contributor Author

choikwa commented Aug 22, 2025

noalias-test.gz

Not sure why but the loads and store have !noalias !(p, q, r)

@choikwa
Copy link
Contributor Author

choikwa commented Aug 26, 2025

Not sure if the full restrict branch handles this case, but one thing seems certain -- if IPSCCP (in the trunk form) is allowed to run before the lowering of noalias arg attribute, it could potentially replace uses of them losing the noalias info. I'm not aware if this issue was seen and addressed already, but it seems like it needs either phase ordering change or passes like IPSCCP needs to be made aware of losing noalias info.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to create a PhaseOrdering test that shows this causes end-to-end optimization issues? Just losing noalias metadata by itself may not be a problem, as accesses on globals have better analysis capabilities than random pointers (especially with GlobalsAA in use).

@choikwa
Copy link
Contributor Author

choikwa commented Aug 26, 2025

I tried with a simple unrolled body, but most backends were able to schedule stores to the end. And it looks like the reason may be that ScopedNoAliasAAResult::alias implementation does allow for asymmetry (ie. missing scope in store) when determining NoAlias.

For my specific case, the issue was related to the existence of scope metadata on the instruction (Using LDS, a special cache-like memory). AMDGPU would pessimize the load from LDS by forcing stalls to wait for all but one of the memory traffic to finish before continuing.

So yes, some backends like AMDGPU may have quirks that may be more pessimistic when asymmetry is found, but the general case seems to be unaffected.

ret void
}

define void @callee(ptr noalias noundef %p, ptr noalias noundef %q, ptr noalias noundef %r, i32 noundef %len) #1 align 2 {
Copy link
Contributor

@brunodf-snps brunodf-snps Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function @callee in your test case needs to be "internal" before IPSCCP can transform it. (Also, the alignment on the function is redundant for the test.)

@brunodf-snps
Copy link
Contributor

Is it possible to create a PhaseOrdering test that shows this causes end-to-end optimization issues?

FWIW, I think this can cause end-to-end optimization issues. Consider:

LINKAGE void loop(int * __restrict a, int *b, int *c) {
    for (int i = 0; i < 100; ++i)
        a[i] = b[i] + c[i];
}

extern int a[];

void test(int n) {
    loop(a, a + n, a + n + n);
}

Compiling this with an empty LINKAGE value prevents IPSCCP:

$ clang -target riscv32-unknown-linux-gnu -DLINKAGE= -S -emit-llvm -O2 -o - mytest.c | opt -disable-output -passes=aa-eval -evaluate-aa-metadata -print-all-alias-modref-info
...
Function: test: 3 pointers, 1 call sites
  MayAlias:     i32* %arrayidx.i, i32* %arrayidx1.i
  MayAlias:     i32* %arrayidx.i, i32* %arrayidx2.i
  MayAlias:     i32* %arrayidx1.i, i32* %arrayidx2.i
  NoAlias:   %0 = load i32, ptr %arrayidx.i, align 4, !tbaa !12, !noalias !9 <->   store i32 %add.i, ptr %arrayidx2.i, align 4, !tbaa !12, !alias.scope !9
  NoAlias:   %1 = load i32, ptr %arrayidx1.i, align 4, !tbaa !12, !noalias !9 <->   store i32 %add.i, ptr %arrayidx2.i, align 4, !tbaa !12, !alias.scope !9
  NoModRef:  Ptr: i32* %arrayidx.i      <->  tail call void @llvm.experimental.noalias.scope.decl(metadata !15)
  NoModRef:  Ptr: i32* %arrayidx1.i     <->  tail call void @llvm.experimental.noalias.scope.decl(metadata !15)
  NoModRef:  Ptr: i32* %arrayidx2.i     <->  tail call void @llvm.experimental.noalias.scope.decl(metadata !15)
...

Whereas with LINKAGE=static we enable IPSCPP:

$ clang -target riscv32-unknown-linux-gnu -DLINKAGE=static -S -emit-llvm -O2 -o - mytest.c | opt -disable-output -passes=aa-eval -evaluate-aa-metadata -print-all-alias-modref-info
Function: test: 3 pointers, 0 call sites
  MayAlias:     i32* %arrayidx.i, i32* %arrayidx1.i
  MayAlias:     i32* %arrayidx.i, i32* %arrayidx2.i
  MayAlias:     i32* %arrayidx1.i, i32* %arrayidx2.i
  MayAlias:   %0 = load i32, ptr %arrayidx.i, align 4, !tbaa !9 <->   store i32 %add.i, ptr %arrayidx2.i, align 4, !tbaa !9
  MayAlias:   %1 = load i32, ptr %arrayidx1.i, align 4, !tbaa !9 <->   store i32 %add.i, ptr %arrayidx2.i, align 4, !tbaa !9
...

So IPSCCP turns NoAlias load/store relations into MayAlias load/store relations.

I don't know if preventing the IPSCCP transformation is the answer though. With the full restrict patches, I think the restrict usage is captured in a noalias intrinsic and it would not show this problem.

@choikwa
Copy link
Contributor Author

choikwa commented Aug 28, 2025

I think with the trunk the only theoretical issue other than AMDGPU backend would be if IPSCCP replaced what would have been asymmetric scoped-noalias relation and stopped producing NoAlias result (but not sure how to generate such test or how likely that is).

choikwa added a commit that referenced this pull request Sep 12, 2025
…efore (#157821)

This change was motivated by CK where many VMCNT(0)'s were generated due
to instructions lacking !alias.scope metadata. The two causes of this
were:
1) LowerLDSModule not tacking on scope metadata on a single LDS variable
2) IPSCCP pass before inliner replacing noalias ptr derivative with a
global value, which made inliner unable to track it back to the noalias
   ptr argument.

However, it turns out that IPSCCP losing the scope information was
largely ineffectual as ScopedNoAliasAA was able to handle asymmetric
condition, where one MemLoc was missing scope, and still return NoAlias
result.

AMDGPU however was checking for existence of scope in SIInsertWaitcnts
and conservatively treating it as aliasing all and inserted VMCNT(0)
before DS_READs, forcing it to wait for all previous LDS DMA
instructions.

Since we know that ScopedNoAliasAA can handle asymmetry, we should also
allow AA query to determine if two MIs may alias.

Passed PSDB.

Previous attempt to address the issue in IPSCCP, likely stalled:
#154522
This solution may be preferrable over that as issue only affects AMDGPU.
@choikwa
Copy link
Contributor Author

choikwa commented Sep 13, 2025

Closing this; instead I have pursued #157821

@choikwa choikwa closed this Sep 13, 2025
@choikwa
Copy link
Contributor Author

choikwa commented Sep 15, 2025

@dobbelaj-snps FWIW, I can confirm there is still big desire to see the full-restrict patch land upstream, and I would like to lend a hand to effort if there is any need. Thank you for your work.

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Sep 16, 2025
…aitcntInstBefore (#157821)

This change was motivated by CK where many VMCNT(0)'s were generated due
to instructions lacking !alias.scope metadata. The two causes of this
were:
1) LowerLDSModule not tacking on scope metadata on a single LDS variable
2) IPSCCP pass before inliner replacing noalias ptr derivative with a
global value, which made inliner unable to track it back to the noalias
   ptr argument.

However, it turns out that IPSCCP losing the scope information was
largely ineffectual as ScopedNoAliasAA was able to handle asymmetric
condition, where one MemLoc was missing scope, and still return NoAlias
result.

AMDGPU however was checking for existence of scope in SIInsertWaitcnts
and conservatively treating it as aliasing all and inserted VMCNT(0)
before DS_READs, forcing it to wait for all previous LDS DMA
instructions.

Since we know that ScopedNoAliasAA can handle asymmetry, we should also
allow AA query to determine if two MIs may alias.

Passed PSDB.

Previous attempt to address the issue in IPSCCP, likely stalled:
llvm/llvm-project#154522
This solution may be preferrable over that as issue only affects AMDGPU.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Sep 25, 2025
…efore (llvm#157821)

This change was motivated by CK where many VMCNT(0)'s were generated due
to instructions lacking !alias.scope metadata. The two causes of this
were:
1) LowerLDSModule not tacking on scope metadata on a single LDS variable
2) IPSCCP pass before inliner replacing noalias ptr derivative with a
global value, which made inliner unable to track it back to the noalias
   ptr argument.

However, it turns out that IPSCCP losing the scope information was
largely ineffectual as ScopedNoAliasAA was able to handle asymmetric
condition, where one MemLoc was missing scope, and still return NoAlias
result.

AMDGPU however was checking for existence of scope in SIInsertWaitcnts
and conservatively treating it as aliasing all and inserted VMCNT(0)
before DS_READs, forcing it to wait for all previous LDS DMA
instructions.

Since we know that ScopedNoAliasAA can handle asymmetry, we should also
allow AA query to determine if two MIs may alias.

Passed PSDB.

Previous attempt to address the issue in IPSCCP, likely stalled:
llvm#154522
This solution may be preferrable over that as issue only affects AMDGPU.

Cherry-picked from 8ae3aea and ef7de8d
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Oct 9, 2025
…efore (llvm#157821)

This change was motivated by CK where many VMCNT(0)'s were generated due
to instructions lacking !alias.scope metadata. The two causes of this
were:
1) LowerLDSModule not tacking on scope metadata on a single LDS variable
2) IPSCCP pass before inliner replacing noalias ptr derivative with a
global value, which made inliner unable to track it back to the noalias
   ptr argument.

However, it turns out that IPSCCP losing the scope information was
largely ineffectual as ScopedNoAliasAA was able to handle asymmetric
condition, where one MemLoc was missing scope, and still return NoAlias
result.

AMDGPU however was checking for existence of scope in SIInsertWaitcnts
and conservatively treating it as aliasing all and inserted VMCNT(0)
before DS_READs, forcing it to wait for all previous LDS DMA
instructions.

Since we know that ScopedNoAliasAA can handle asymmetry, we should also
allow AA query to determine if two MIs may alias.

Passed PSDB.

Previous attempt to address the issue in IPSCCP, likely stalled:
llvm#154522
This solution may be preferrable over that as issue only affects AMDGPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants