Skip to content

Conversation

@asb
Copy link
Contributor

@asb asb commented Feb 11, 2025

Note: This patch is fully ready for technical review, but is not ready for merging until ongoing testing that the altered codegen doesn't (as far as can be seen) regress code generation for the existing libcall, which may result in other patches that should land before this one. I post this for feedback now, because any alternate implementation approach would impact this testing.


In order to keep the change as incremental as possible, this only introduces the memset.pattern intrinsic in cases where memset_pattern16 would have been used. Future patches can enable it on targets that don't have the intrinsic, and select it in cases where the libcall isn't directly usable. As the memset.pattern intrinsic takes the number of times to store the pattern as an argument unlike memset_pattern16 which takes the number of bytes to write, we no longer try to form an i128 pattern.

Special care is taken for cases where multiple stores in the same loop iteration were combined to form a single pattern. For such cases, we inherit the limitation that loops such as the following are supported:

for (unsigned i = 0; i < 2 * n; i += 2) {
  f[i] = 2;
  f[i+1] = 2;
}

But the following doesn't result in a memset.pattern (even though it could be, by forming an appropriate pattern):

for (unsigned i = 0; i < 2 * n; i += 2) {
  f[i] = 2;
  f[i+1] = 3;
}

Addressing this existing deficiency is left for a follow-up due to a desire not to change too much at once (i.e. to target equivalence to the current codegen).

A command line option is introduced to force the selection of the intrinsic even in cases it wouldn't be (i.e. in cases where the libcall wouldn't have been selected). This is intended as a transitionary option for testing and experimentation, to be removed at a later point.

…than memset_pattern16 libcall

In order to keep the change as incremental as possible, this only
introduces the memset.pattern intrinsic in cases where memset_pattern16
would have been used. Future patches can enable it on targets that don't
have the intrinsic. As the memset.pattern intrinsic takes the number of
times to store the pattern as an argument unlike memset_pattern16 which
takes the number of bytes to write, we no longer try to form an i128
pattern.

Special care is taken for cases where multiple stores in the
same loop iteration were combined to form a single pattern. For such
cases, we inherit the limitation that loops such as the following are
supported:

```
for (unsigned i = 0; i < 2 * n; i += 2) {
  f[i] = 2;
  f[i+1] = 2;
}
```

But the following doesn't result in a memset.pattern (even though it
could be, by forming an appropriate pattern):
```
for (unsigned i = 0; i < 2 * n; i += 2) {
  f[i] = 2;
  f[i+1] = 3;
}
```

Addressing this existing deficiency is left for a follow-up due to a
desire not to change too much at once (i.e. to target equivalence to the
current codegen).

A command line option is introduced to force the selection of the
intrinsic even in cases it wouldn't be (i.e. in cases where the libcall
wouldn't have been selected). This is intended as a transitionary option
for testing and experimentation, to be removed at a later point.
@llvmbot
Copy link
Member

llvmbot commented Feb 11, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Alex Bradbury (asb)

Changes

Note: This patch is fully ready for technical review, but is not ready for merging until ongoing testing that the altered codegen doesn't (as far as can be seen) regress code generation for the existing libcall, which may result in other patches that should land before this one. I post this for feedback now, because any alternate implementation approach would impact this testing.


In order to keep the change as incremental as possible, this only introduces the memset.pattern intrinsic in cases where memset_pattern16 would have been used. Future patches can enable it on targets that don't have the intrinsic, and select it in cases where the libcall isn't directly usable. As the memset.pattern intrinsic takes the number of times to store the pattern as an argument unlike memset_pattern16 which takes the number of bytes to write, we no longer try to form an i128 pattern.

Special care is taken for cases where multiple stores in the same loop iteration were combined to form a single pattern. For such cases, we inherit the limitation that loops such as the following are supported:

for (unsigned i = 0; i &lt; 2 * n; i += 2) {
  f[i] = 2;
  f[i+1] = 2;
}

But the following doesn't result in a memset.pattern (even though it could be, by forming an appropriate pattern):

for (unsigned i = 0; i &lt; 2 * n; i += 2) {
  f[i] = 2;
  f[i+1] = 3;
}

Addressing this existing deficiency is left for a follow-up due to a desire not to change too much at once (i.e. to target equivalence to the current codegen).

A command line option is introduced to force the selection of the intrinsic even in cases it wouldn't be (i.e. in cases where the libcall wouldn't have been selected). This is intended as a transitionary option for testing and experimentation, to be removed at a later point.


Patch is 23.99 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126736.diff

7 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp (+94-42)
  • (added) llvm/test/Transforms/LoopIdiom/RISCV/memset-pattern.ll (+49)
  • (modified) llvm/test/Transforms/LoopIdiom/basic.ll (+5-6)
  • (modified) llvm/test/Transforms/LoopIdiom/memset-pattern-tbaa.ll (+5-11)
  • (modified) llvm/test/Transforms/LoopIdiom/struct_pattern.ll (+7-11)
  • (modified) llvm/test/Transforms/LoopIdiom/unroll-custom-dl.ll (+3-7)
  • (modified) llvm/test/Transforms/LoopIdiom/unroll.ll (+3-7)
diff --git a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
index 2462ec33e0c20..58c12298ca926 100644
--- a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
@@ -132,6 +132,11 @@ static cl::opt<bool> UseLIRCodeSizeHeurs(
              "with -Os/-Oz"),
     cl::init(true), cl::Hidden);
 
+static cl::opt<bool> ForceMemsetPatternIntrinsic(
+    "loop-idiom-force-memset-pattern-intrinsic",
+    cl::desc("Enable use of the memset.pattern intrinsic"), cl::init(false),
+    cl::Hidden);
+
 namespace {
 
 class LoopIdiomRecognize {
@@ -303,10 +308,15 @@ bool LoopIdiomRecognize::runOnLoop(Loop *L) {
       L->getHeader()->getParent()->hasOptSize() && UseLIRCodeSizeHeurs;
 
   HasMemset = TLI->has(LibFunc_memset);
+  // TODO: Unconditionally enable use of the memset pattern intrinsic (or at
+  // least, opt-in via target hook) once we are confident it will never result
+  // in worse codegen than without. For now, use it only when we would have
+  // previously emitted a libcall to memset_pattern16 (or unless this is
+  // overridden by command line option).
   HasMemsetPattern = TLI->has(LibFunc_memset_pattern16);
   HasMemcpy = TLI->has(LibFunc_memcpy);
 
-  if (HasMemset || HasMemsetPattern || HasMemcpy)
+  if (HasMemset || HasMemsetPattern || ForceMemsetPatternIntrinsic || HasMemcpy)
     if (SE->hasLoopInvariantBackedgeTakenCount(L))
       return runOnCountableLoop();
 
@@ -392,14 +402,7 @@ static Constant *getMemSetPatternValue(Value *V, const DataLayout *DL) {
   if (Size > 16)
     return nullptr;
 
-  // If the constant is exactly 16 bytes, just use it.
-  if (Size == 16)
-    return C;
-
-  // Otherwise, we'll use an array of the constants.
-  unsigned ArraySize = 16 / Size;
-  ArrayType *AT = ArrayType::get(V->getType(), ArraySize);
-  return ConstantArray::get(AT, std::vector<Constant *>(ArraySize, C));
+  return C;
 }
 
 LoopIdiomRecognize::LegalStoreKind
@@ -463,8 +466,9 @@ LoopIdiomRecognize::isLegalStore(StoreInst *SI) {
     // It looks like we can use SplatValue.
     return LegalStoreKind::Memset;
   }
-  if (!UnorderedAtomic && HasMemsetPattern && !DisableLIRP::Memset &&
-      // Don't create memset_pattern16s with address spaces.
+  if (!UnorderedAtomic && (HasMemsetPattern || ForceMemsetPatternIntrinsic) &&
+      !DisableLIRP::Memset &&
+      // Don't create memset.pattern intrinsic calls with address spaces.
       StorePtr->getType()->getPointerAddressSpace() == 0 &&
       getMemSetPatternValue(StoredVal, DL)) {
     // It looks like we can use PatternValue!
@@ -1064,53 +1068,101 @@ bool LoopIdiomRecognize::processLoopStridedStore(
     return Changed;
 
   // Okay, everything looks good, insert the memset.
+  // MemsetArg is the number of bytes for the memset libcall, and the number
+  // of pattern repetitions if the memset.pattern intrinsic is being used.
+  Value *MemsetArg;
+  std::optional<int64_t> BytesWritten = std::nullopt;
+
+  if (PatternValue && (HasMemsetPattern || ForceMemsetPatternIntrinsic)) {
+    const SCEV *TripCountS =
+        SE->getTripCountFromExitCount(BECount, IntIdxTy, CurLoop);
+    if (!Expander.isSafeToExpand(TripCountS))
+      return Changed;
+    const SCEVConstant *ConstStoreSize = dyn_cast<SCEVConstant>(StoreSizeSCEV);
+    if (!ConstStoreSize)
+      return Changed;
+    Value *TripCount = Expander.expandCodeFor(TripCountS, IntIdxTy,
+                                              Preheader->getTerminator());
+    uint64_t PatternRepsPerTrip =
+        (ConstStoreSize->getValue()->getZExtValue() * 8) /
+        DL->getTypeSizeInBits(PatternValue->getType());
+    // If ConstStoreSize is not equal to the width of PatternValue, then
+    // MemsetArg is TripCount * (ConstStoreSize/PatternValueWidth). Else
+    // MemSetArg is just TripCount.
+    MemsetArg =
+        PatternRepsPerTrip == 1
+            ? TripCount
+            : Builder.CreateMul(TripCount,
+                                Builder.getIntN(IntIdxTy->getIntegerBitWidth(),
+                                                PatternRepsPerTrip));
+    if (auto CI = dyn_cast<ConstantInt>(TripCount))
+      BytesWritten =
+          CI->getZExtValue() * ConstStoreSize->getValue()->getZExtValue();
+  } else {
+    const SCEV *NumBytesS =
+        getNumBytes(BECount, IntIdxTy, StoreSizeSCEV, CurLoop, DL, SE);
 
-  const SCEV *NumBytesS =
-      getNumBytes(BECount, IntIdxTy, StoreSizeSCEV, CurLoop, DL, SE);
-
-  // TODO: ideally we should still be able to generate memset if SCEV expander
-  // is taught to generate the dependencies at the latest point.
-  if (!Expander.isSafeToExpand(NumBytesS))
-    return Changed;
-
-  Value *NumBytes =
-      Expander.expandCodeFor(NumBytesS, IntIdxTy, Preheader->getTerminator());
+    // TODO: ideally we should still be able to generate memset if SCEV expander
+    // is taught to generate the dependencies at the latest point.
+    if (!Expander.isSafeToExpand(NumBytesS))
+      return Changed;
+    MemsetArg =
+        Expander.expandCodeFor(NumBytesS, IntIdxTy, Preheader->getTerminator());
+    if (auto CI = dyn_cast<ConstantInt>(MemsetArg))
+      BytesWritten = CI->getZExtValue();
+  }
+  assert(MemsetArg && "MemsetArg should have been set");
 
-  if (!SplatValue && !isLibFuncEmittable(M, TLI, LibFunc_memset_pattern16))
+  if (!SplatValue && !(ForceMemsetPatternIntrinsic ||
+                       isLibFuncEmittable(M, TLI, LibFunc_memset_pattern16)))
     return Changed;
 
   AAMDNodes AATags = TheStore->getAAMetadata();
   for (Instruction *Store : Stores)
     AATags = AATags.merge(Store->getAAMetadata());
-  if (auto CI = dyn_cast<ConstantInt>(NumBytes))
-    AATags = AATags.extendTo(CI->getZExtValue());
+  if (BytesWritten)
+    AATags = AATags.extendTo(BytesWritten.value());
   else
     AATags = AATags.extendTo(-1);
 
   CallInst *NewCall;
   if (SplatValue) {
     NewCall = Builder.CreateMemSet(
-        BasePtr, SplatValue, NumBytes, MaybeAlign(StoreAlignment),
+        BasePtr, SplatValue, MemsetArg, MaybeAlign(StoreAlignment),
         /*isVolatile=*/false, AATags.TBAA, AATags.Scope, AATags.NoAlias);
   } else {
-    assert (isLibFuncEmittable(M, TLI, LibFunc_memset_pattern16));
+    assert(ForceMemsetPatternIntrinsic ||
+           isLibFuncEmittable(M, TLI, LibFunc_memset_pattern16));
     // Everything is emitted in default address space
-    Type *Int8PtrTy = DestInt8PtrTy;
-
-    StringRef FuncName = "memset_pattern16";
-    FunctionCallee MSP = getOrInsertLibFunc(M, *TLI, LibFunc_memset_pattern16,
-                            Builder.getVoidTy(), Int8PtrTy, Int8PtrTy, IntIdxTy);
-    inferNonMandatoryLibFuncAttrs(M, FuncName, *TLI);
-
-    // Otherwise we should form a memset_pattern16.  PatternValue is known to be
-    // an constant array of 16-bytes.  Plop the value into a mergable global.
-    GlobalVariable *GV = new GlobalVariable(*M, PatternValue->getType(), true,
-                                            GlobalValue::PrivateLinkage,
-                                            PatternValue, ".memset_pattern");
-    GV->setUnnamedAddr(GlobalValue::UnnamedAddr::Global); // Ok to merge these.
-    GV->setAlignment(Align(16));
-    Value *PatternPtr = GV;
-    NewCall = Builder.CreateCall(MSP, {BasePtr, PatternPtr, NumBytes});
+
+    assert(isa<SCEVConstant>(StoreSizeSCEV) && "Expected constant store size");
+
+    Value *PatternArg;
+    IntegerType *PatternArgTy =
+        Builder.getIntNTy(DL->getTypeSizeInBits(PatternValue->getType()));
+
+    // If the pattern value can be casted directly to an integer argument, use
+    // that. Otherwise (e.g. if the value is a global pointer), create a
+    // GlobalVariable and load from it.
+    if (isa<ConstantInt>(PatternValue)) {
+      PatternArg = PatternValue;
+    } else if (isa<ConstantFP>(PatternValue)) {
+      PatternArg = Builder.CreateBitCast(PatternValue, PatternArgTy);
+    } else {
+      GlobalVariable *GV = new GlobalVariable(*M, PatternValue->getType(), true,
+                                              GlobalValue::PrivateLinkage,
+                                              PatternValue, ".memset_pattern");
+      GV->setUnnamedAddr(
+          GlobalValue::UnnamedAddr::Global); // Ok to merge these.
+      GV->setAlignment(Align(PatternArgTy->getPrimitiveSizeInBits()));
+      PatternArg = Builder.CreateLoad(PatternArgTy, GV);
+    }
+    assert(PatternArg);
+
+    NewCall = Builder.CreateIntrinsic(Intrinsic::experimental_memset_pattern,
+                                      {DestInt8PtrTy, PatternArgTy, IntIdxTy},
+                                      {BasePtr, PatternArg, MemsetArg,
+                                       ConstantInt::getFalse(M->getContext())});
 
     // Set the TBAA info if present.
     if (AATags.TBAA)
diff --git a/llvm/test/Transforms/LoopIdiom/RISCV/memset-pattern.ll b/llvm/test/Transforms/LoopIdiom/RISCV/memset-pattern.ll
new file mode 100644
index 0000000000000..b3cee756076af
--- /dev/null
+++ b/llvm/test/Transforms/LoopIdiom/RISCV/memset-pattern.ll
@@ -0,0 +1,49 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals
+; RUN: opt -passes=loop-idiom -mtriple=riscv64 < %s -S | FileCheck %s
+; RUN: opt -passes=loop-idiom -mtriple=riscv64 -loop-idiom-force-memset-pattern-intrinsic < %s -S \
+; RUN:   | FileCheck -check-prefix=CHECK-INTRIN %s
+
+define dso_local void @double_memset(ptr nocapture %p) {
+; CHECK-LABEL: @double_memset(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.cond.cleanup:
+; CHECK-NEXT:    ret void
+; CHECK:       for.body:
+; CHECK-NEXT:    [[I_07:%.*]] = phi i64 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[PTR1:%.*]] = getelementptr inbounds double, ptr [[P:%.*]], i64 [[I_07]]
+; CHECK-NEXT:    store double 3.141590e+00, ptr [[PTR1]], align 1
+; CHECK-NEXT:    [[INC]] = add nuw nsw i64 [[I_07]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 16
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
+;
+; CHECK-INTRIN-LABEL: @double_memset(
+; CHECK-INTRIN-NEXT:  entry:
+; CHECK-INTRIN-NEXT:    call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 4614256650576692846, i64 16, i1 false)
+; CHECK-INTRIN-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK-INTRIN:       for.cond.cleanup:
+; CHECK-INTRIN-NEXT:    ret void
+; CHECK-INTRIN:       for.body:
+; CHECK-INTRIN-NEXT:    [[I_07:%.*]] = phi i64 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-INTRIN-NEXT:    [[PTR1:%.*]] = getelementptr inbounds double, ptr [[P]], i64 [[I_07]]
+; CHECK-INTRIN-NEXT:    [[INC]] = add nuw nsw i64 [[I_07]], 1
+; CHECK-INTRIN-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 16
+; CHECK-INTRIN-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
+;
+entry:
+  br label %for.body
+
+for.cond.cleanup:
+  ret void
+
+for.body:
+  %i.07 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
+  %ptr1 = getelementptr inbounds double, ptr %p, i64 %i.07
+  store double 3.14159e+00, ptr %ptr1, align 1
+  %inc = add nuw nsw i64 %i.07, 1
+  %exitcond.not = icmp eq i64 %inc, 16
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+;.
+; CHECK-INTRIN: attributes #[[ATTR0:[0-9]+]] = { nocallback nofree nounwind willreturn memory(argmem: write) }
+;.
diff --git a/llvm/test/Transforms/LoopIdiom/basic.ll b/llvm/test/Transforms/LoopIdiom/basic.ll
index e6fc11625317b..0fe8cd747408f 100644
--- a/llvm/test/Transforms/LoopIdiom/basic.ll
+++ b/llvm/test/Transforms/LoopIdiom/basic.ll
@@ -7,8 +7,7 @@ target triple = "x86_64-apple-darwin10.0.0"
 ;.
 ; CHECK: @G = global i32 5
 ; CHECK: @g_50 = global [7 x i32] [i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 0], align 16
-; CHECK: @.memset_pattern = private unnamed_addr constant [4 x i32] [i32 1, i32 1, i32 1, i32 1], align 16
-; CHECK: @.memset_pattern.1 = private unnamed_addr constant [2 x ptr] [ptr @G, ptr @G], align 16
+; CHECK: @.memset_pattern = private unnamed_addr constant ptr @G, align 64
 ;.
 define void @test1(ptr %Base, i64 %Size) nounwind ssp {
 ; CHECK-LABEL: @test1(
@@ -533,7 +532,7 @@ for.end13:                                        ; preds = %for.inc10
 define void @test11_pattern(ptr nocapture %P) nounwind ssp {
 ; CHECK-LABEL: @test11_pattern(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern, i64 40000)
+; CHECK-NEXT:    call void @llvm.experimental.memset.pattern.p0.i32.i64(ptr [[P:%.*]], i32 1, i64 10000, i1 false)
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[INDVAR:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
@@ -596,7 +595,8 @@ for.end:                                          ; preds = %for.body
 define void @test13_pattern(ptr nocapture %P) nounwind ssp {
 ; CHECK-LABEL: @test13_pattern(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern.1, i64 80000)
+; CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr @.memset_pattern, align 8
+; CHECK-NEXT:    call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 [[TMP0]], i64 10000, i1 false)
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[INDVAR:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
@@ -1625,6 +1625,5 @@ define noalias ptr @_ZN8CMSPULog9beginImplEja(ptr nocapture writeonly %0) local_
 ; CHECK: attributes #[[ATTR1:[0-9]+]] = { nounwind }
 ; CHECK: attributes #[[ATTR2:[0-9]+]] = { nocallback nofree nounwind willreturn memory(argmem: readwrite) }
 ; CHECK: attributes #[[ATTR3:[0-9]+]] = { nocallback nofree nounwind willreturn memory(argmem: write) }
-; CHECK: attributes #[[ATTR4:[0-9]+]] = { nofree nounwind willreturn memory(argmem: readwrite) }
-; CHECK: attributes #[[ATTR5:[0-9]+]] = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+; CHECK: attributes #[[ATTR4:[0-9]+]] = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
 ;.
diff --git a/llvm/test/Transforms/LoopIdiom/memset-pattern-tbaa.ll b/llvm/test/Transforms/LoopIdiom/memset-pattern-tbaa.ll
index 57a91a3bf6e2c..98521ef82fbe7 100644
--- a/llvm/test/Transforms/LoopIdiom/memset-pattern-tbaa.ll
+++ b/llvm/test/Transforms/LoopIdiom/memset-pattern-tbaa.ll
@@ -6,15 +6,10 @@ target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f3
 target triple = "x86_64-apple-darwin10.0.0"
 
 
-;.
-; CHECK: @.memset_pattern = private unnamed_addr constant [2 x double] [double 3.141590e+00, double 3.141590e+00], align 16
-; CHECK: @.memset_pattern.1 = private unnamed_addr constant [2 x double] [double 3.141590e+00, double 3.141590e+00], align 16
-; CHECK: @.memset_pattern.2 = private unnamed_addr constant [2 x double] [double 3.141590e+00, double 3.141590e+00], align 16
-;.
 define dso_local void @double_memset(ptr nocapture %p) {
 ; CHECK-LABEL: @double_memset(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern, i64 128), !tbaa [[TBAA0:![0-9]+]]
+; CHECK-NEXT:    call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 4614256650576692846, i64 16, i1 false), !tbaa [[TBAA0:![0-9]+]]
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.cond.cleanup:
 ; CHECK-NEXT:    ret void
@@ -44,7 +39,7 @@ for.body:
 define dso_local void @struct_memset(ptr nocapture %p) {
 ; CHECK-LABEL: @struct_memset(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern.1, i64 128), !tbaa [[TBAA4:![0-9]+]]
+; CHECK-NEXT:    call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 4614256650576692846, i64 16, i1 false), !tbaa [[TBAA4:![0-9]+]]
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.cond.cleanup:
 ; CHECK-NEXT:    ret void
@@ -73,8 +68,7 @@ for.body:
 define dso_local void @var_memset(ptr nocapture %p, i64 %len) {
 ; CHECK-LABEL: @var_memset(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = shl nuw i64 [[LEN:%.*]], 3
-; CHECK-NEXT:    call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern.2, i64 [[TMP0]])
+; CHECK-NEXT:    call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 4614256650576692846, i64 [[TMP0:%.*]], i1 false)
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.cond.cleanup:
 ; CHECK-NEXT:    ret void
@@ -82,7 +76,7 @@ define dso_local void @var_memset(ptr nocapture %p, i64 %len) {
 ; CHECK-NEXT:    [[I_07:%.*]] = phi i64 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
 ; CHECK-NEXT:    [[PTR1:%.*]] = getelementptr inbounds double, ptr [[P]], i64 [[I_07]]
 ; CHECK-NEXT:    [[INC]] = add nuw nsw i64 [[I_07]], 1
-; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], [[LEN]]
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], [[TMP0]]
 ; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
 ;
 entry:
@@ -116,7 +110,7 @@ for.body:
 !21 = !{!22, !20, i64 0}
 !22 = !{!"B", !20, i64 0}
 ;.
-; CHECK: attributes #[[ATTR0:[0-9]+]] = { nofree nounwind willreturn memory(argmem: readwrite) }
+; CHECK: attributes #[[ATTR0:[0-9]+]] = { nocallback nofree nounwind willreturn memory(argmem: write) }
 ;.
 ; CHECK: [[TBAA0]] = !{[[META1:![0-9]+]], [[META1]], i64 0}
 ; CHECK: [[META1]] = !{!"double", [[META2:![0-9]+]], i64 0}
diff --git a/llvm/test/Transforms/LoopIdiom/struct_pattern.ll b/llvm/test/Transforms/LoopIdiom/struct_pattern.ll
index b65e95353ab3e..f5be8e71cf7bd 100644
--- a/llvm/test/Transforms/LoopIdiom/struct_pattern.ll
+++ b/llvm/test/Transforms/LoopIdiom/struct_pattern.ll
@@ -16,11 +16,6 @@ target triple = "x86_64-apple-darwin10.0.0"
 ;}
 
 
-;.
-; CHECK: @.memset_pattern = private unnamed_addr constant [4 x i32] [i32 2, i32 2, i32 2, i32 2], align 16
-; CHECK: @.memset_pattern.1 = private unnamed_addr constant [4 x i32] [i32 2, i32 2, i32 2, i32 2], align 16
-; CHECK: @.memset_pattern.2 = private unnamed_addr constant [4 x i32] [i32 2, i32 2, i32 2, i32 2], align 16
-;.
 define void @bar1(ptr %f, i32 %n) nounwind ssp {
 ; CHECK-LABEL: @bar1(
 ; CHECK-NEXT:  entry:
@@ -28,8 +23,8 @@ define void @bar1(ptr %f, i32 %n) nounwind ssp {
 ; CHECK-NEXT:    br i1 [[CMP1]], label [[FOR_END:%.*]], label [[FOR_BODY_PREHEADER:%.*]]
 ; CHECK:       for.body.preheader:
 ; CHECK-NEXT:    [[TMP0:%.*]] = zext i32 [[N]] to i64
-; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 3
-; CHECK-NEXT:    call void @memset_pattern16(ptr [[F:%.*]], ptr @.memset_pattern, i64 [[TMP1]])
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 2
+; CHECK-NEXT:    call void @llvm.experimental.memset.pattern.p0.i32.i64(ptr [[F:%.*]], i32 2, i64 [[TMP1]], i1 false)
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
@@ -82,8 +77,8 @@ define void @bar2(ptr %f, i32 %n) nounwind ssp {
 ; CHECK-NEXT:    br i1 [[CMP1]], label [[FOR_END:%.*]], label [[FOR_BODY_PREHEADER:%.*]]
 ; CHECK:       for.body.preheader:
 ; CHECK-NEXT:    [[TMP0:%.*]] = zext i32 [[N]] to i64
-; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 3
-; CHECK-NEXT:    call void @memset_pattern16(ptr [[F:%.*]], ptr @.memset_pattern.1, i64 [[TMP1]])
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 2
+; CHECK-NEXT:    call void @llvm.experimental.memset.pattern.p0.i32.i64(ptr [[F:%.*]], i32 2, i64 [[TMP1]], i1 false)
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
@@ -142,7 +137,8 @@ define void @bar3(ptr nocapture %f, i32 %n) nounwind ssp {
 ; CHECK-NEXT:    [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 3
 ; CHECK-NEXT:    [[TMP5:%.*]] = sub i64 [[TMP1]], [[TMP4]]
 ; CHECK-NEXT:    [[UGLYGEP:%.*]] = getelementptr i8, ptr [[F:%.*]], i64 [[TMP5]]
-; CHECK-NEXT:    call void @memset_pattern16(ptr [[UGLYGEP]], ptr @.memset_pattern.2, i64 [[TMP1]])
+; CHECK-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP0]], 2
+; CHECK-NEXT:    call void @llvm.experimental.memset.pattern.p0.i32.i64(ptr [[UGLYGEP]], i32 2, i64 [[TMP7]], i1 false)
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*...
[truncated]

asb added a commit to asb/llvm-project that referenced this pull request Feb 28, 2025
…ading from constant global

This is motivated by llvm#126736, and catches a case that would have
resulted in memset_pattern16 being produced by LoopIdiomRecognize
previously but is missed after moving to the intrinsic in llvm#126736 and
relying on PreISelintrinsicLoewring to produce the libcall when
available.

The logic for handling load instructions that access constant globals
could be made more extensive, but it's not clear it would be worthwhile.
For now we prioritise the patterns that could be produced by
LoopIdiomRecognize.
@topperc
Copy link
Collaborator

topperc commented Apr 14, 2025

Reverse ping. Can you rebase this?

@asb
Copy link
Contributor Author

asb commented Apr 17, 2025

Reverse ping. Can you rebase this?

Just catching up post EuroLLVM. I've resolved the merge conflicts now, and fixed and outdated comment and accepted your suggestion (thanks!). And now putting some more focused effort in getting this closed off.

I have a flow for looking for unexpected codegen differences (which should only show up on darwin where the libcall is supported) by applying a hacky patch that unconditionally marks memset_pattern16 as available on all platforms, then building the test-suite with -k0 (to ignore link errors) and looking at assembly diffs. Going through this, there are a few things I'm looking at that need resolution before this can be merged:

  • More minor
    • Duplicated globals vs before. The linker can clean it up, and tracing through it's just because of things like inlining, where before we made one global in LoopIdiomRecognize and later may have cloned the call that uses it. But now, we may clone the intrinsic call and then create a global for each one that is lowered to memset_pattern16.
    • Loosing some type information in the emitted globals (bit patterns are the same, but before commented as 'float' etc appropriately). I've not looked at this as it's not a functional change.
    • In some examples we no longer tailcall memset_pattern16. Marking as minor as I didn't see a case it's likely to make a big difference, but it would be nice to clean up
  • The bigger one is I see some examples with repeated memset_pattern16 calls that wasn't there before. After looking at some of the others, I'm now isolating a smaller example.

It's possible some of the above results in changes to this patch, but also equally possible they result in separate PRs that end up being effectively pre-requisites for this one.

@nikic
Copy link
Contributor

nikic commented Apr 17, 2025

Duplicated globals vs before. The linker can clean it up, and tracing through it's just because of things like inlining, where before we made one global in LoopIdiomRecognize and later may have cloned the call that uses it. But now, we may clone the intrinsic call and then create a global for each one that is lowered to memset_pattern16.

Interesting case. This makes me wonder whether it would make sense to do the expansion for this intrinsic earlier, in the late middle-end pipeline (we'd still have to keep it in the backend for O0 fallback). If we do it before ConstantMerge, it would merge the constants back together. More interestingly, if we do it before addVectorPasses, the loop expansion case would still get the usual vectorization and unrolling heuristics, which would avoid the need to consider those in the backend expansion.

I don't think the duplicate globals are a big problem though, so that's just a thought on how to improve things in the future...

@preames
Copy link
Collaborator

preames commented May 5, 2025

The bigger one is I see some examples with repeated memset_pattern16 calls that wasn't there before. After looking at some of the others, I'm now isolating a smaller example.

I wrote a bunch of test cases to see if I could identity your problem. I believe this is a consequence of the second parameter (the value param) not being known to be readnone. There might be other cases, but I can definitely see this case failing to fold in DSE and GVN. I added those tests in 15c2f79.

Unfortunately, the fact that the value param is only sometimes a pointer fits badly into our intrinsic attribute system. I have a local patch which hacks in a fix, but it sure seems like I'm missing something in terms of code structure as I'm having to change multiple places.

Edited to expand comment slightly.

@preames
Copy link
Collaborator

preames commented May 5, 2025

First attempt at a partial fix for the pointer value operand issue above: #138559

std::optional<int64_t> BytesWritten;

if (PatternValue && (HasMemsetPattern || ForceMemsetPatternIntrinsic)) {
const SCEV *TripCountS =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bunch of complexity in this section of the change that I'm not getting. Since we need the NumBytes (in the original code) for the AAInfo, why not just leave that alone? Isn't count (for memset.pattern intrinsic) just the trip count of the loop by definition?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this might be reworked, but we're down to little enough code here I'm okay with that being a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that there may be multiple stores in one loop iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the case of multiple stores in a loop iteration. I could perhaps build the necessary value with numbytes and a div, but I'm wary of cases where it may not be folded away.

// If ConstStoreSize is not equal to the width of PatternValue, then
// MemsetArg is TripCount * (ConstStoreSize/PatternValueWidth). Else
// MemSetArg is just TripCount.
MemsetArg =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just weird way of compute NumBytes?

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I reorganized the code a bit, so you'll have to rebase. Feel free to defer until after answering my high level question though - I'm suspecting we can simplify this a bit.

@asb
Copy link
Contributor Author

asb commented Jun 11, 2025

Just a holding pattern update to explain the last push:
I've merged in the latest changes and adjusted to match the upstream. I've pushed so we can work from a relatively up to date base.

  • I know my changes slightly undo some of the refactoring Philip did. I will revisit.
  • I do still see some unwanted/unexpected changes and am looking to understand those at the moment.

asb added a commit to asb/llvm-project that referenced this pull request Jun 18, 2025
…for memset_pattern16 when possible

As Constants are already uniquified, we can use a map to keep track of
whether a GlobalVariable was produced for a given Constant or not.
Repeated globals with the same value was one of the codegen differences
noted in llvm#126736. This patch removes that diff, producing cleaner
output.
…for memset_pattern16 when possible

As Constants are already uniquified, we can use a map to keep track of
whether a GlobalVariable was produced for a given Constant or not.
Repeated globals with the same value was one of the codegen differences
noted in llvm#126736. This patch removes that diff, producing cleaner
output.
@asb
Copy link
Contributor Author

asb commented Jun 18, 2025

I've further reduced diffs and investigated the changes that were unexpected and am satisfied there's nothing untoward going on:

  • (Minor) We were always casting to integer types. Now that the memset.pattern intrinsic was expanded to allow any sized type this is unnecessary. Deleted the code that did this casting, and diffs vs the old codepath are reduced as generated globals in the output .s now have comments reflecting the type.
  • We were previously making a new GlobalVariable for every memset_pattern16 in a Module. I've added a map that tracks constants that have already been emitted and avoid this. Again, this reduces diff when comparing to the current codegen path. I've cherry-picked the patch here, but it's posted as an independent PR [PreISelIntrinsicLowering] Reuse previously generated GlobalVariable for memset_pattern16 when possible #144677
  • The remaining unexpected codegen diff I was seeing was again around unrolling. This time, TTI isLoweredToCall being called from getUnrollingPreferences. This returns false with the new codegen path but not the old one, affecting unrolling. For local testing I have a minimal patch that teaches isLoweredToCall about the memset.pattern intrinsic but don't intend to upstream it (the method has very little coverage of intrinsics as-is and there doesn't seem to be a justification in adding just memset.pattern).

TODO: re-review initial patch in light of some review comments and upstream refactorings.

@preames
Copy link
Collaborator

preames commented Jun 23, 2025

JFYI, @asb and I discussed this patch offline. I asked if any of the remaining differences were likely to be material performance wise, and the tentative answer was no. Given that, he's going to update the patch and I intend to approve (after normal review). We're at the point where enabling and watching for real regressions is likely the best path forward.

Once this goes in, I intend to implement some of the obvious combines on the new intrinsic. We've been deliberately waiting on that to have a fair comparison baseline.

asb added a commit that referenced this pull request Jun 23, 2025
…for memset_pattern16 when possible (#144677)

As Constants are already uniquified, we can use a map to keep track of
whether a GlobalVariable was produced for a given Constant or not.
Repeated globals with the same value was one of the codegen differences
noted in #126736. This patch removes that diff, producing cleaner
output.
@asb
Copy link
Contributor Author

asb commented Jun 25, 2025

I've merged in the latest changes and will re-check my comparison.

My methdology has been:

  • Apply a patch to make memset_pattern16 be exposed on non-Darwin (so I can test the codepath easily)
  • Build llvm-test-suite and diff
  • Apply any other needed downstream hacks for diffs that don't appear important but may mask bigger issues

For the last bullet, I've found there is a difference in unrolling behaviour due to TargetTransformInfo::isLoweredToCall not recognising that the intrinsic may be lowered to call (ultimately called via gatherUnrollingPreferences). In general we don't do anything very exhaustive at all here re libcalls so I don't propose to change that. But locally, I work around this with:

--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -179,6 +179,9 @@ public:
     // ported from existing analysis heuristics here so that such refactorings
     // can take place in the future.

+    if (F->isIntrinsic() && (F->getIntrinsicID() == Intrinsic::experimental_memset_pattern))
+      return true;
+
     if (F->isIntrinsic())
       return false;

The hacky patch to expose the libcalls on non-Darwin is:

--- a/llvm/lib/Analysis/TargetLibraryInfo.cpp
+++ b/llvm/lib/Analysis/TargetLibraryInfo.cpp
@@ -226,11 +226,12 @@ static void initializeLibCalls(TargetLibraryInfoImpl &TLI, const Triple &T,
       TLI.setUnavailable(LibFunc_memset_pattern8);
       TLI.setUnavailable(LibFunc_memset_pattern16);
     }
-  } else if (!T.isWatchOS()) {
-    TLI.setUnavailable(LibFunc_memset_pattern4);
-    TLI.setUnavailable(LibFunc_memset_pattern8);
-    TLI.setUnavailable(LibFunc_memset_pattern16);
   }
+//} else if (!T.isWatchOS()) {
+//  TLI.setUnavailable(LibFunc_memset_pattern4);
+//  TLI.setUnavailable(LibFunc_memset_pattern8);
+//  TLI.setUnavailable(LibFunc_memset_pattern16);
+//}

   if (!hasSinCosPiStret(T)) {
     TLI.setUnavailable(LibFunc_sinpi);

(Then build llvm-test-suite with -k0 and ignore link errors, diff the .s).

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"with -Os/-Oz"),
cl::init(true), cl::Hidden);

static cl::opt<bool> ForceMemsetPatternIntrinsic(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would mildly suggest dropping this cl::opt from the final patch. We probably should have a generic means to force enable a TLI function, but having one only for memset.pattern seems slightly odd. Not a strongly held view, will leave to your discretion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have a patch somewhere that adds an opt flag to do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think force enabling of a TLI function is slightly orthogonal. This patch aims to introduce memset.pattern in just those cases where memset_pattern16 would have been available, even though we do have codegen support for targets without memset_pattern16. The rationale is that using memset.pattern on all targets in this way would be a big codegen change. So the flag means it's easy to test the more eager selection of the intrinsic.

std::optional<int64_t> BytesWritten;

if (PatternValue && (HasMemsetPattern || ForceMemsetPatternIntrinsic)) {
const SCEV *TripCountS =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this might be reworked, but we're down to little enough code here I'm okay with that being a followup.

; CHECK-LABEL: @test13_pattern(
; CHECK-NEXT: entry:
; CHECK-NEXT: call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern.1, i64 80000)
; CHECK-NEXT: call void @llvm.experimental.memset.pattern.p0.p0.i64(ptr align 4 [[P:%.*]], ptr @G, i64 10000, i1 false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note, you dropped the pointer to int conversion from the prior version of the patch, and thus until a reworked #138559 lands we have an aliasing imprecision here. I'm explicitly okay with that being true for the landed patch, and will follow up on my patch post-commit.

anthonyhatran pushed a commit to anthonyhatran/llvm-project that referenced this pull request Jun 26, 2025
…for memset_pattern16 when possible (llvm#144677)

As Constants are already uniquified, we can use a map to keep track of
whether a GlobalVariable was produced for a given Constant or not.
Repeated globals with the same value was one of the codegen differences
noted in llvm#126736. This patch removes that diff, producing cleaner
output.
@asb
Copy link
Contributor Author

asb commented Jul 9, 2025

I've merged in the latest changes and will re-check my comparison.

Confirming that no codegen changes stand out to me - there are some minor diffs, but this is vastly reduced vs when this started and most seem to be from slightly different loop unrolling behaviour.

I'm going to fix up the commit message and merge now, as this has had positive review and I believe comments are all addressed.

@asb asb merged commit 3877039 into llvm:main Jul 9, 2025
7 of 9 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jul 9, 2025

LLVM Buildbot has detected a new failure on builder clang-hip-vega20 running on hip-vega20-0 while building llvm at step 3 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/123/builds/22990

Here is the relevant piece of the build log for the reference
Step 3 (annotate) failure: '../llvm-zorg/zorg/buildbot/builders/annotated/hip-build.sh --jobs=' (failure)
...
[57/59] Linking CXX executable External/HIP/cmath-hip-6.3.0
[58/59] Building CXX object External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o
[59/59] Linking CXX executable External/HIP/TheNextWeek-hip-6.3.0
+ build_step 'Testing HIP test-suite'
+ echo '@@@BUILD_STEP Testing HIP test-suite@@@'
+ ninja check-hip-simple
@@@BUILD_STEP Testing HIP test-suite@@@
[0/1] cd /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP && /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/llvm-lit -sv array-hip-6.3.0.test empty-hip-6.3.0.test with-fopenmp-hip-6.3.0.test saxpy-hip-6.3.0.test memmove-hip-6.3.0.test split-kernel-args-hip-6.3.0.test builtin-logb-scalbn-hip-6.3.0.test TheNextWeek-hip-6.3.0.test algorithm-hip-6.3.0.test cmath-hip-6.3.0.test complex-hip-6.3.0.test math_h-hip-6.3.0.test new-hip-6.3.0.test blender.test
-- Testing: 14 tests, 14 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: test-suite :: External/HIP/blender.test (14 of 14)
******************** TEST 'test-suite :: External/HIP/blender.test' FAILED ********************

/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out --redirect-input /dev/null --summary /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.time /bin/bash test_blender.sh
/bin/bash verify_blender.sh /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out
Begin Blender test.
TEST_SUITE_HIP_ROOT=/opt/botworker/llvm/External/hip
Render /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend
Blender 4.1.1 (hash e1743a0317bc built 2024-04-15 23:47:45)
Read blend: "/opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend"
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
I0709 12:54:21.188710 2370810 device.cpp:39] HIPEW initialization succeeded
I0709 12:54:21.190714 2370810 device.cpp:45] Found HIPCC hipcc
I0709 12:54:21.260131 2370810 device.cpp:207] Device has compute preemption or is not used for display.
I0709 12:54:21.260182 2370810 device.cpp:211] Added device "" with id "HIP__0000:a3:00".
I0709 12:54:21.260268 2370810 device.cpp:568] Mapped host memory limit set to 536,444,985,344 bytes. (499.60G)
I0709 12:54:21.260520 2370810 device_impl.cpp:63] Using AVX2 CPU kernels.
Fra:1 Mem:524.00M (Peak 524.71M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_rim
Fra:1 Mem:524.00M (Peak 524.71M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.016
Fra:1 Mem:524.00M (Peak 524.71M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Cables.004
Fra:1 Mem:524.19M (Peak 524.71M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.020
Fra:1 Mem:526.03M (Peak 526.03M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Wires
Fra:1 Mem:526.03M (Peak 526.03M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.021
Fra:1 Mem:526.07M (Peak 526.07M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.022
Fra:1 Mem:526.40M (Peak 526.40M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.023
Fra:1 Mem:526.42M (Peak 526.42M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.024
Step 12 (Testing HIP test-suite) failure: Testing HIP test-suite (failure)
@@@BUILD_STEP Testing HIP test-suite@@@
[0/1] cd /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP && /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/llvm-lit -sv array-hip-6.3.0.test empty-hip-6.3.0.test with-fopenmp-hip-6.3.0.test saxpy-hip-6.3.0.test memmove-hip-6.3.0.test split-kernel-args-hip-6.3.0.test builtin-logb-scalbn-hip-6.3.0.test TheNextWeek-hip-6.3.0.test algorithm-hip-6.3.0.test cmath-hip-6.3.0.test complex-hip-6.3.0.test math_h-hip-6.3.0.test new-hip-6.3.0.test blender.test
-- Testing: 14 tests, 14 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: test-suite :: External/HIP/blender.test (14 of 14)
******************** TEST 'test-suite :: External/HIP/blender.test' FAILED ********************

/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out --redirect-input /dev/null --summary /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.time /bin/bash test_blender.sh
/bin/bash verify_blender.sh /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out
Begin Blender test.
TEST_SUITE_HIP_ROOT=/opt/botworker/llvm/External/hip
Render /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend
Blender 4.1.1 (hash e1743a0317bc built 2024-04-15 23:47:45)
Read blend: "/opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend"
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
I0709 12:54:21.188710 2370810 device.cpp:39] HIPEW initialization succeeded
I0709 12:54:21.190714 2370810 device.cpp:45] Found HIPCC hipcc
I0709 12:54:21.260131 2370810 device.cpp:207] Device has compute preemption or is not used for display.
I0709 12:54:21.260182 2370810 device.cpp:211] Added device "" with id "HIP__0000:a3:00".
I0709 12:54:21.260268 2370810 device.cpp:568] Mapped host memory limit set to 536,444,985,344 bytes. (499.60G)
I0709 12:54:21.260520 2370810 device_impl.cpp:63] Using AVX2 CPU kernels.
Fra:1 Mem:524.00M (Peak 524.71M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_rim
Fra:1 Mem:524.00M (Peak 524.71M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.016
Fra:1 Mem:524.00M (Peak 524.71M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Cables.004
Fra:1 Mem:524.19M (Peak 524.71M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.020
Fra:1 Mem:526.03M (Peak 526.03M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Wires
Fra:1 Mem:526.03M (Peak 526.03M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.021
Fra:1 Mem:526.07M (Peak 526.07M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.022
Fra:1 Mem:526.40M (Peak 526.40M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.023
Fra:1 Mem:526.42M (Peak 526.42M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.024
Fra:1 Mem:526.48M (Peak 526.48M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.025
Fra:1 Mem:526.71M (Peak 526.71M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.039
Fra:1 Mem:526.98M (Peak 526.98M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Hoses.003
Fra:1 Mem:527.67M (Peak 527.67M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_wires
Fra:1 Mem:527.85M (Peak 527.85M) | Time:00:00.67 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors
Fra:1 Mem:529.16M (Peak 529.16M) | Time:00:00.68 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors.001
Fra:1 Mem:529.64M (Peak 529.64M) | Time:00:00.68 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors.002

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jul 9, 2025
…sic rather than memset_pattern16 libcall (#126736)

In order to keep the change as incremental as possible, this only
introduces the memset.pattern intrinsic in cases where memset_pattern16
would have been used. Future patches can enable it on targets that don't
have the intrinsic, and select it in cases where the libcall isn't
directly usable. As the memset.pattern intrinsic takes the number of
times to store the pattern as an argument unlike memset_pattern16 which
takes the number of bytes to write, we no longer try to form an i128
pattern.

Special care is taken for cases where multiple stores in the same loop
iteration were combined to form a single pattern. For such cases, we
inherit the limitation that loops such as the following are supported:

```
for (unsigned i = 0; i < 2 * n; i += 2) {
  f[i] = 2;
  f[i+1] = 2;
}
```

But the following doesn't result in a memset.pattern (even though it
could be, by forming an appropriate pattern):
```
for (unsigned i = 0; i < 2 * n; i += 2) {
  f[i] = 2;
  f[i+1] = 3;
}
```

Addressing this existing deficiency is left for a follow-up due to a
desire not to change too much at once (i.e. to target equivalence to the
current codegen).

A command line option is introduced to force the selection of the
intrinsic even in cases it wouldn't be (i.e. in cases where the libcall
wouldn't have been selected). This is intended as a transitionary option
for testing and experimentation, to be removed at a later point.

The only platforms this should impact are those that have the memset_pattern16 libcall (Apple platforms). Testing performed to check for no unexpected codegen changes is described here llvm/llvm-project#126736 (comment)
@asb
Copy link
Contributor Author

asb commented Jul 9, 2025

LLVM Buildbot has detected a new failure on builder clang-hip-vega20 running on hip-vega20-0 while building llvm at step 3 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/123/builds/22990

A subsequent build https://lab.llvm.org/buildbot/#/builders/123/builds/22992 passed, so I think there is not a failure introduced by this patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants