[MemoryLocation] Size Scalable Masked MemOps #154785

MDevereau · 2025-08-21T15:17:41Z

Scalable masked loads and stores with a get active lane mask whose size is less than or equal to the scalable minimum number of elements can be be proven to have a fixed size. Adding this infomation allows scalable masked loads and stores to benefit from alias analysis optimizations.

llvmbot · 2025-08-21T15:18:19Z

@llvm/pr-subscribers-llvm-analysis

Author: Matthew Devereau (MDevereau)

Changes

Scalable masked loads and stores with a get active lane mask whose size is less than or equal to the scalable minimum number of elements can be be proven to have a fixed size. Adding this infomation allows scalable masked loads and stores to benefit from alias analysis optimizations.

Full diff: https://github.com/llvm/llvm-project/pull/154785.diff

2 Files Affected:

(modified) llvm/lib/Analysis/MemoryLocation.cpp (+44-11)
(added) llvm/test/Analysis/BasicAA/scalable-dse-aa.ll (+145)

diff --git a/llvm/lib/Analysis/MemoryLocation.cpp b/llvm/lib/Analysis/MemoryLocation.cpp
index 72b643c56a994..f2c3b843f70f6 100644
--- a/llvm/lib/Analysis/MemoryLocation.cpp
+++ b/llvm/lib/Analysis/MemoryLocation.cpp
@@ -150,6 +150,29 @@ MemoryLocation::getForDest(const CallBase *CB, const TargetLibraryInfo &TLI) {
   return MemoryLocation::getBeforeOrAfter(UsedV, CB->getAAMetadata());
 }
 
+static std::optional<FixedVectorType *>
+getFixedTypeFromScalableMemOp(Value *Mask, Type *Ty) {
+  auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);
+  if (!ActiveLaneMask ||
+      ActiveLaneMask->getIntrinsicID() != Intrinsic::get_active_lane_mask)
+    return std::nullopt;
+
+  auto ScalableTy = dyn_cast<ScalableVectorType>(Ty);
+  if (!ScalableTy)
+    return std::nullopt;
+
+  auto LaneMaskLo = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(0));
+  auto LaneMaskHi = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(1));
+  if (!LaneMaskLo || !LaneMaskHi)
+    return std::nullopt;
+
+  uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue();
+  if (NumElts > ScalableTy->getMinNumElements())
+    return std::nullopt;
+
+  return FixedVectorType::get(ScalableTy->getElementType(), NumElts);
+}
+
 MemoryLocation MemoryLocation::getForArgument(const CallBase *Call,
                                               unsigned ArgIdx,
                                               const TargetLibraryInfo *TLI) {
@@ -213,20 +236,30 @@ MemoryLocation MemoryLocation::getForArgument(const CallBase *Call,
               cast<ConstantInt>(II->getArgOperand(0))->getZExtValue()),
           AATags);
 
-    case Intrinsic::masked_load:
+    case Intrinsic::masked_load: {
       assert(ArgIdx == 0 && "Invalid argument index");
-      return MemoryLocation(
-          Arg,
-          LocationSize::upperBound(DL.getTypeStoreSize(II->getType())),
-          AATags);
 
-    case Intrinsic::masked_store:
+      Type *Ty = II->getType();
+      auto KnownScalableSize =
+          getFixedTypeFromScalableMemOp(II->getOperand(2), Ty);
+      if (KnownScalableSize)
+        return MemoryLocation(Arg, DL.getTypeStoreSize(*KnownScalableSize),
+                              AATags);
+
+      return MemoryLocation(Arg, DL.getTypeStoreSize(Ty), AATags);
+    }
+    case Intrinsic::masked_store: {
       assert(ArgIdx == 1 && "Invalid argument index");
-      return MemoryLocation(
-          Arg,
-          LocationSize::upperBound(
-              DL.getTypeStoreSize(II->getArgOperand(0)->getType())),
-          AATags);
+
+      Type *Ty = II->getArgOperand(0)->getType();
+      auto KnownScalableSize =
+          getFixedTypeFromScalableMemOp(II->getOperand(3), Ty);
+      if (KnownScalableSize)
+        return MemoryLocation(Arg, DL.getTypeStoreSize(*KnownScalableSize),
+                              AATags);
+
+      return MemoryLocation(Arg, DL.getTypeStoreSize(Ty), AATags);
+    }
 
     case Intrinsic::invariant_end:
       // The first argument to an invariant.end is a "descriptor" type (e.g. a
diff --git a/llvm/test/Analysis/BasicAA/scalable-dse-aa.ll b/llvm/test/Analysis/BasicAA/scalable-dse-aa.ll
new file mode 100644
index 0000000000000..c12d1c2f25835
--- /dev/null
+++ b/llvm/test/Analysis/BasicAA/scalable-dse-aa.ll
@@ -0,0 +1,145 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -aa-pipeline=basic-aa -passes=dse -S | FileCheck %s
+
+define <vscale x 4 x float> @dead_scalable_store(i32 %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 4 x float> @dead_scalable_store(
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+; CHECK-NOT: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.32, ptr nonnull %gep.arr.32, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+;
+  %arr = alloca [64 x i32], align 4
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+
+  %gep.1.16 = getelementptr inbounds nuw i8, ptr %1, i64 16
+  %gep.1.32 = getelementptr inbounds nuw i8, ptr %1, i64 32
+  %gep.1.48 = getelementptr inbounds nuw i8, ptr %1, i64 48
+  %gep.arr.16 = getelementptr inbounds nuw i8, ptr %arr, i64 16
+  %gep.arr.32 = getelementptr inbounds nuw i8, ptr %arr, i64 32
+  %gep.arr.48 = getelementptr inbounds nuw i8, ptr %arr, i64 48
+
+  %load.1.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.1.32 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.32, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.32, ptr nonnull %gep.arr.32, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.1.48 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+
+  %faddop0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %faddop1 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %fadd = fadd <vscale x 4 x float> %faddop0, %faddop1
+
+  ret <vscale x 4 x float> %fadd
+}
+
+define <vscale x 4 x float> @scalable_store_partial_overwrite(ptr %0) {
+; CHECK-LABEL: define <vscale x 4 x float> @scalable_store_partial_overwrite(
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+;
+  %arr = alloca [64 x i32], align 4
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+
+  %gep.0.16 = getelementptr inbounds nuw i8, ptr %0, i64 16
+  %gep.0.30 = getelementptr inbounds nuw i8, ptr %0, i64 30
+  %gep.0.48 = getelementptr inbounds nuw i8, ptr %0, i64 48
+  %gep.arr.16 = getelementptr inbounds nuw i8, ptr %arr, i64 16
+  %gep.arr.30 = getelementptr inbounds nuw i8, ptr %arr, i64 30
+  %gep.arr.48 = getelementptr inbounds nuw i8, ptr %arr, i64 48
+
+  %load.0.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.0.30 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.30, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.0.48 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+
+  %faddop0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %faddop1 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %fadd = fadd <vscale x 4 x float> %faddop0, %faddop1
+
+  ret <vscale x 4 x float> %fadd
+}
+
+define <vscale x 4 x float> @dead_scalable_store_small_mask(ptr %0) {
+; CHECK-LABEL: define <vscale x 4 x float> @dead_scalable_store_small_mask(
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+; CHECK-NOT: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.46, ptr nonnull %gep.arr.46, i32 1, <vscale x 4 x i1> %mask)
+  %arr = alloca [64 x i32], align 4
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+
+  %gep.0.16 = getelementptr inbounds nuw i8, ptr %0, i64 16
+  %gep.0.30 = getelementptr inbounds nuw i8, ptr %0, i64 30
+  %gep.0.46 = getelementptr inbounds nuw i8, ptr %0, i64 46
+  %gep.arr.16 = getelementptr inbounds nuw i8, ptr %arr, i64 16
+  %gep.arr.30 = getelementptr inbounds nuw i8, ptr %arr, i64 30
+  %gep.arr.46 = getelementptr inbounds nuw i8, ptr %arr, i64 46
+
+  %load.0.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.0.30 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.30, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.0.46 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.46, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.46, ptr nonnull %gep.arr.46, i32 1, <vscale x 4 x i1> %mask)
+
+  %smallmask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.32(i32 0, i32 2)
+  %faddop0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %smallmask, <vscale x 4 x float> zeroinitializer)
+  %faddop1 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.46, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %fadd = fadd <vscale x 4 x float> %faddop0, %faddop1
+
+  ret <vscale x 4 x float> %fadd
+}
+
+define <vscale x 4 x float> @dead_scalar_store(ptr noalias %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 4 x float> @dead_scalar_store(
+; CHECK-NOT: store i32 20, ptr %gep.1.12
+;
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+  %gep.1.12 = getelementptr inbounds nuw i8, ptr %1, i64 12
+  store i32 20, ptr %gep.1.12
+
+  %load.0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %0, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0, ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask)
+  %retval = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  ret <vscale x 4 x float> %retval
+}
+
+; We don't know if the scalar store is dead as we can't determine vscale.
+; This get active lane mask may cover 4 or 8 integers
+define <vscale x 4 x float> @mask_gt_minimum_num_elts(ptr noalias %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 4 x float> @mask_gt_minimum_num_elts(
+; CHECK: store i32 20, ptr %gep.1.28
+;
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 8)
+  %gep.1.28 = getelementptr inbounds nuw i8, ptr %1, i64 28
+  store i32 20, ptr %gep.1.28
+
+  %load.0 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %0, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0, ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask)
+  %retval = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  ret <vscale x 4 x float> %retval
+}
+
+define <vscale x 16 x i8> @scalar_stores_small_mask(ptr noalias %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 16 x i8> @scalar_stores_small_mask(
+; CHECK-NOT: store i8 60, ptr %gep.1.6
+; CHECK: store i8 120, ptr %gep.1.8
+;
+  %mask = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i8.i32(i32 0, i32 7)
+  %gep.1.6 = getelementptr inbounds nuw i8, ptr %1, i64 6
+  store i8 60, ptr %gep.1.6
+  %gep.1.8 = getelementptr inbounds nuw i8, ptr %1, i64 8
+  store i8 120,   ptr %gep.1.8
+
+  %load.0 = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr nonnull %0, i32 1, <vscale x 16 x i1> %mask, <vscale x 16 x i8> zeroinitializer)
+  call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> %load.0, ptr %1, i32 1, <vscale x 16 x i1> %mask)
+  %retval = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %1, i32 1, <vscale x 16 x i1> %mask, <vscale x 16 x i8> zeroinitializer)
+  ret <vscale x 16 x i8> %retval
+}

nikic · 2025-08-21T15:21:41Z

llvm/lib/Analysis/MemoryLocation.cpp


+static std::optional<FixedVectorType *>
+getFixedTypeFromScalableMemOp(Value *Mask, Type *Ty) {
+  auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);


Suggested change

auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);

auto *ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);

etc

nikic · 2025-08-21T15:22:45Z

llvm/lib/Analysis/MemoryLocation.cpp

+  if (!LaneMaskLo || !LaneMaskHi)
+    return std::nullopt;
+
+  uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue();


Shouldn't this just use LaneMaskHi, without subtracting LaneMaskLo?

NumElts is the number of active elements, but the memory location is relative to the original base pointer, so inactive elements at the start need to be counted as well.

It looks like you are missing tests for a non-zero base index.

inactive elements at the start need to be counted as well.

From the langref for get active lane mask:

%m[i] = icmp ult (%base + i), %n

From this definition I interpreted that it's not possible for the first N elements to be 0 and then have following 1s - either the mask returned from this begins with a 1, is all false, or poison. If there is a range of 3 between %base and %n, the new mask size would be 3. I added the test dead_scalar_store_offset for this scenario.

david-arm · 2025-08-21T15:23:28Z

llvm/lib/Analysis/MemoryLocation.cpp

+      Type *Ty = II->getType();
+      auto KnownScalableSize =
+          getFixedTypeFromScalableMemOp(II->getOperand(2), Ty);
+      if (KnownScalableSize)


You could do:

if (auto KnownScalableSize = getFixedTypeFromScalableMemOp(II->getOperand(2), Ty)) Ty = *KnownScalableSize; return MemoryLocation(Arg, DL.getTypeStoreSize(Ty), AATags);

and similarly for the store below.

david-arm · 2025-08-21T15:27:53Z

llvm/lib/Analysis/MemoryLocation.cpp


+static std::optional<FixedVectorType *>
+getFixedTypeFromScalableMemOp(Value *Mask, Type *Ty) {
+  auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);


I think you can rewrite this and the code to get the lo and hi values below by using match, i.e.

auto ScalableTy = dyn_cast<ScalableVectorType>(Ty); if (!ScalableTy) return std::nullopt; Value *LaneMaskLo, * LaneMaskHi; if (!match(Mask, m_Intrinsic<Intrinsic::get_active_lane_mask>(m_Value(LaneMaskLo), m_Value(LaneMaskHi))) return std::nullopt; uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue(); ...

I avoided this because I was worried about compile time. This module doesn't include any match headers, plus it's quite low level and small sized. I haven't tested this though so it might not be a big deal.

I've added matching logic, it's simple to remove if it does hamper compile time.

david-arm · 2025-08-21T15:31:14Z

llvm/lib/Analysis/MemoryLocation.cpp

+  if (!LaneMaskLo || !LaneMaskHi)
+    return std::nullopt;
+
+  uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue();


I think you need to be careful with logic like this because it's possible for hi to be lower than lo. I think you need an extra check like:

if (LaneMaskHi <= LaneMaskLo) return std::nullopt;

If the mask would return all-false then essentially this operation doesn't touch memory at all, although I don't know if we have to worry about that.

From the LangRef:

The '``llvm.get.active.lane.mask.*``' intrinsics are semantically equivalent to: :: %m[i] = icmp ult (%base + i), %n

I've added bail outs for (LaneMaskHi <= LaneMaskLo) and (LaneMaskHi == 0)

david-arm · 2025-08-21T15:35:40Z

llvm/lib/Analysis/MemoryLocation.cpp

+      ActiveLaneMask->getIntrinsicID() != Intrinsic::get_active_lane_mask)
+    return std::nullopt;
+
+  auto ScalableTy = dyn_cast<ScalableVectorType>(Ty);


I think we can probably remove this restriction as the mask works equally well for fixed-width vectors too.

I've reworked it to work with fixed vectors too

david-arm · 2025-08-21T15:38:17Z

llvm/lib/Analysis/MemoryLocation.cpp

+
+  auto LaneMaskLo = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(0));
+  auto LaneMaskHi = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(1));
+  if (!LaneMaskLo || !LaneMaskHi)


Until we have clarity on #152140 I think we should explicitly bail out if LaneMaskHi is 0 for now.

Use patternmatch logic Add pointer tokens to auto declarations Add offset test dead_scalar_store_offset

david-arm

Thanks for addressing all the comments so far @MDevereau! I have a few more ...

david-arm · 2025-08-28T15:37:58Z

llvm/test/Analysis/BasicAA/scalable-dse-aa.ll

+  %load.1.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.1.32 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.32, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)


Is it safe to add tail to memory intrinsic calls? Might be good to remove from all masked load/store intrinsics.

david-arm · 2025-08-28T15:39:28Z

llvm/test/Analysis/BasicAA/scalable-dse-aa.ll

@@ -0,0 +1,196 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5


Can you remove this NOTE line please? It looks like the CHECK lines below have been edited so they're not really autogenerated.

david-arm · 2025-08-28T16:00:43Z

llvm/test/Analysis/BasicAA/scalable-dse-aa.ll

+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -aa-pipeline=basic-aa -passes=dse -S | FileCheck %s
+
+define <vscale x 4 x float> @dead_scalable_store(i32 %0, ptr %1) {


Looks like %0 is unused and can be removed?

david-arm · 2025-08-28T16:12:07Z

llvm/test/Analysis/BasicAA/scalable-dse-aa.ll

+
+; We don't know if the scalar store is dead as we can't determine vscale.
+; This get active lane mask may cover 4 or 8 integers
+define <vscale x 4 x float> @mask_gt_minimum_num_elts(ptr noalias %0, ptr %1) {


Is it worth having a variant of this where the scalar store is to a lower address, i.e.

%gep.1.12 = getelementptr inbounds nuw i8, ptr %1, i64 12

I can imagine in future the MemoryLocation class might be able to handle the concept of a memory range, which in this case is equivalent to a range of 16-32 bytes. In theory, we should still be able to remove this scalar store, but that's probably a much bigger piece of work. The same applies to functions with a vscale_range attribute.

I've added a store and check for byte 12 to this test.

david-arm · 2025-08-28T16:18:33Z

llvm/test/Analysis/BasicAA/scalable-dse-aa.ll

+  ret <vscale x 4 x float> %retval
+}
+
+; Don't do anything if the 2nd Op of get active lane mask is 0. This currently generates poison


nit: Might be worth phrasing this a bit differently, i.e.

; TODO: Improve this once we have clarity in the LangRef for get.active.lane.mask ; regarding the expected behaviour when the second operand is 0.

What do you think?

Since the args for active lane mask are unsigned, I've removed this test and the LaneMaskHi == 0 check since the LaneMaskHi <= LaneMaskLo bail will also cover the same case

david-arm · 2025-08-28T16:37:20Z

llvm/test/Analysis/BasicAA/scalable-dse-aa.ll

+; CHECK-LABEL: define <vscale x 4 x float> @mask_hi_0(
+; CHECK: store i32 20, ptr %1
+;
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 0)


I'm not sure how useful this test is because if we did treat (0,0) as returning an all-false mask (instead of poison according to the LangRef), then you'd expect the masked store below to be a nop, since it doesn't write to anything. In this case I'd still expect the scalar store to remain as before. Perhaps a test like this?

define <4 x i32> @mask_hi_0(ptr %0) { %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 0) call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> splat(i32 1), ptr nonnull %0, i32 1, <vscale x 4 x i1> %mask) store i8 3, ptr %0 store i32 20, ptr %0 %retval = load <4 x i32>, ptr %0, align 1 ret <4 x i32> %retval }

You can see that the store i8 3, ptr %0 gets deleted, but the masked store remains. We should be able to kill off the masked store as well.

Incidentally, the pass doesn't deal with fixed-width constant masks either such as <i32 1, i32 1, i32 0, i32 0> or 'zeroinitializer`, but perhaps that can be done in a different patch?

We discussed this privately and decided that for the (0,0) case, we cannot confirm it as a no-op because it will return poison. We also decided that a case such as (3,3), this could be lowered to a no-op by a later patch if any interest in the case arises.

david-arm · 2025-08-28T16:50:13Z

llvm/lib/Analysis/MemoryLocation.cpp

+    return std::nullopt;
+
+  uint64_t NumElts = LaneMaskHi - LaneMaskLo;
+  if (NumElts > Ty->getElementCount().getKnownMinValue())


I think we only need to bail out if it's a scalable vector, e.g. we can support get.active.lane.mask with arguments (0,8) when the vector type is <4 x i32>. The maximum permitted by the type is 4.

If you change this code, would be good to have a fixed-width test showing it works. Thanks!

I've added a clamp for the fixed width. I've added the test dead_scalable_store_fixed_large_mask to assert it.

david-arm · 2025-08-28T16:54:42Z

llvm/test/Analysis/BasicAA/scalable-dse-aa.ll

+}
+
+; Don't do anything if the 2nd Op is gt/eq the 1st
+define <vscale x 4 x float> @active_lane_mask_gt_eq(ptr noalias %0, ptr %1) {


The comment above for @mask_hi_0 applies here too I think.

I've split this out into active_lane_mask_eq and active_lane_mask_lt

david-arm · 2025-08-28T16:55:39Z

llvm/test/Analysis/BasicAA/scalable-dse-aa.ll

+  %gep.1.6 = getelementptr inbounds nuw i8, ptr %1, i64 6
+  store i8 60, ptr %gep.1.6
+  %gep.1.8 = getelementptr inbounds nuw i8, ptr %1, i64 8
+  store i8 120,   ptr %gep.1.8


nit: whitespace before ptr

david-arm · 2025-08-28T16:57:55Z

llvm/lib/Analysis/MemoryLocation.cpp

+  if (NumElts > Ty->getElementCount().getKnownMinValue())
+    return std::nullopt;
+
+  return FixedVectorType::get(Ty->getElementType(), NumElts);


Would be good to have at least one test for fixed-width masked stores as well. Maybe just a variant of @dead_scalable_store?

I've added the test dead_scalable_store_fixed

Use APInt maths Refine tests

david-arm

LGTM!

david-arm · 2025-09-02T12:11:42Z

llvm/lib/Analysis/MemoryLocation.cpp

-          LocationSize::upperBound(
-              DL.getTypeStoreSize(II->getArgOperand(0)->getType())),
-          AATags);
+          Arg, LocationSize::upperBound(DL.getTypeStoreSize(Ty)), AATags);


Not for this PR, but upperBound is currently extremely pessimistic for scalable vectors because for unknown pointers the upper bound is just the top of the address space. In future, we could improve the upper bound by taking vscale_range into account, or just add the ScalableBit to ImpreciseBit in the LocationSize object.

For normal stores we call this function instead:

static LocationSize precise(TypeSize Value) { return LocationSize(Value.getKnownMinValue(), Value.isScalable()); }

MDevereau requested review from davemgreen, david-arm and nikic August 21, 2025 15:17

llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Aug 21, 2025

nikic reviewed Aug 21, 2025

View reviewed changes

david-arm reviewed Aug 21, 2025

View reviewed changes

Address review comments

fc9274f

Use patternmatch logic Add pointer tokens to auto declarations Add offset test dead_scalar_store_offset

david-arm reviewed Aug 28, 2025

View reviewed changes

Add fixed-width tests

7f8e9c8

Use APInt maths Refine tests

david-arm approved these changes Sep 2, 2025

View reviewed changes

MDevereau merged commit f831463 into llvm:main Sep 4, 2025
9 checks passed

	auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);
	auto *ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);

		@@ -0,0 +1,196 @@
		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5

[MemoryLocation] Size Scalable Masked MemOps #154785

[MemoryLocation] Size Scalable Masked MemOps #154785

Uh oh!

Conversation

MDevereau commented Aug 21, 2025

Uh oh!

llvmbot commented Aug 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MDevereau Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

MDevereau Sep 1, 2025 •

edited

Loading