Skip to content

Conversation

@MDevereau
Copy link
Contributor

Scalable masked loads and stores with a get active lane mask whose size is less than or equal to the scalable minimum number of elements can be be proven to have a fixed size. Adding this infomation allows scalable masked loads and stores to benefit from alias analysis optimizations.

Scalable masked loads and stores with a get active lane mask whose size is less
than or equal to the scalable minimum number of elements can be be proven to
have a fixed size. Adding this infomation allows scalable masked loads and
stores to benefit from alias analysis optimizations.
@llvmbot llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Aug 21, 2025
@llvmbot
Copy link
Member

llvmbot commented Aug 21, 2025

@llvm/pr-subscribers-llvm-analysis

Author: Matthew Devereau (MDevereau)

Changes

Scalable masked loads and stores with a get active lane mask whose size is less than or equal to the scalable minimum number of elements can be be proven to have a fixed size. Adding this infomation allows scalable masked loads and stores to benefit from alias analysis optimizations.


Full diff: https://github.com/llvm/llvm-project/pull/154785.diff

2 Files Affected:

  • (modified) llvm/lib/Analysis/MemoryLocation.cpp (+44-11)
  • (added) llvm/test/Analysis/BasicAA/scalable-dse-aa.ll (+145)
diff --git a/llvm/lib/Analysis/MemoryLocation.cpp b/llvm/lib/Analysis/MemoryLocation.cpp
index 72b643c56a994..f2c3b843f70f6 100644
--- a/llvm/lib/Analysis/MemoryLocation.cpp
+++ b/llvm/lib/Analysis/MemoryLocation.cpp
@@ -150,6 +150,29 @@ MemoryLocation::getForDest(const CallBase *CB, const TargetLibraryInfo &TLI) {
   return MemoryLocation::getBeforeOrAfter(UsedV, CB->getAAMetadata());
 }
 
+static std::optional<FixedVectorType *>
+getFixedTypeFromScalableMemOp(Value *Mask, Type *Ty) {
+  auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);
+  if (!ActiveLaneMask ||
+      ActiveLaneMask->getIntrinsicID() != Intrinsic::get_active_lane_mask)
+    return std::nullopt;
+
+  auto ScalableTy = dyn_cast<ScalableVectorType>(Ty);
+  if (!ScalableTy)
+    return std::nullopt;
+
+  auto LaneMaskLo = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(0));
+  auto LaneMaskHi = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(1));
+  if (!LaneMaskLo || !LaneMaskHi)
+    return std::nullopt;
+
+  uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue();
+  if (NumElts > ScalableTy->getMinNumElements())
+    return std::nullopt;
+
+  return FixedVectorType::get(ScalableTy->getElementType(), NumElts);
+}
+
 MemoryLocation MemoryLocation::getForArgument(const CallBase *Call,
                                               unsigned ArgIdx,
                                               const TargetLibraryInfo *TLI) {
@@ -213,20 +236,30 @@ MemoryLocation MemoryLocation::getForArgument(const CallBase *Call,
               cast<ConstantInt>(II->getArgOperand(0))->getZExtValue()),
           AATags);
 
-    case Intrinsic::masked_load:
+    case Intrinsic::masked_load: {
       assert(ArgIdx == 0 && "Invalid argument index");
-      return MemoryLocation(
-          Arg,
-          LocationSize::upperBound(DL.getTypeStoreSize(II->getType())),
-          AATags);
 
-    case Intrinsic::masked_store:
+      Type *Ty = II->getType();
+      auto KnownScalableSize =
+          getFixedTypeFromScalableMemOp(II->getOperand(2), Ty);
+      if (KnownScalableSize)
+        return MemoryLocation(Arg, DL.getTypeStoreSize(*KnownScalableSize),
+                              AATags);
+
+      return MemoryLocation(Arg, DL.getTypeStoreSize(Ty), AATags);
+    }
+    case Intrinsic::masked_store: {
       assert(ArgIdx == 1 && "Invalid argument index");
-      return MemoryLocation(
-          Arg,
-          LocationSize::upperBound(
-              DL.getTypeStoreSize(II->getArgOperand(0)->getType())),
-          AATags);
+
+      Type *Ty = II->getArgOperand(0)->getType();
+      auto KnownScalableSize =
+          getFixedTypeFromScalableMemOp(II->getOperand(3), Ty);
+      if (KnownScalableSize)
+        return MemoryLocation(Arg, DL.getTypeStoreSize(*KnownScalableSize),
+                              AATags);
+
+      return MemoryLocation(Arg, DL.getTypeStoreSize(Ty), AATags);
+    }
 
     case Intrinsic::invariant_end:
       // The first argument to an invariant.end is a "descriptor" type (e.g. a
diff --git a/llvm/test/Analysis/BasicAA/scalable-dse-aa.ll b/llvm/test/Analysis/BasicAA/scalable-dse-aa.ll
new file mode 100644
index 0000000000000..c12d1c2f25835
--- /dev/null
+++ b/llvm/test/Analysis/BasicAA/scalable-dse-aa.ll
@@ -0,0 +1,145 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -aa-pipeline=basic-aa -passes=dse -S | FileCheck %s
+
+define <vscale x 4 x float> @dead_scalable_store(i32 %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 4 x float> @dead_scalable_store(
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+; CHECK-NOT: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.32, ptr nonnull %gep.arr.32, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+;
+  %arr = alloca [64 x i32], align 4
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+
+  %gep.1.16 = getelementptr inbounds nuw i8, ptr %1, i64 16
+  %gep.1.32 = getelementptr inbounds nuw i8, ptr %1, i64 32
+  %gep.1.48 = getelementptr inbounds nuw i8, ptr %1, i64 48
+  %gep.arr.16 = getelementptr inbounds nuw i8, ptr %arr, i64 16
+  %gep.arr.32 = getelementptr inbounds nuw i8, ptr %arr, i64 32
+  %gep.arr.48 = getelementptr inbounds nuw i8, ptr %arr, i64 48
+
+  %load.1.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.1.32 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.32, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.32, ptr nonnull %gep.arr.32, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.1.48 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+
+  %faddop0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %faddop1 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %fadd = fadd <vscale x 4 x float> %faddop0, %faddop1
+
+  ret <vscale x 4 x float> %fadd
+}
+
+define <vscale x 4 x float> @scalable_store_partial_overwrite(ptr %0) {
+; CHECK-LABEL: define <vscale x 4 x float> @scalable_store_partial_overwrite(
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+;
+  %arr = alloca [64 x i32], align 4
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+
+  %gep.0.16 = getelementptr inbounds nuw i8, ptr %0, i64 16
+  %gep.0.30 = getelementptr inbounds nuw i8, ptr %0, i64 30
+  %gep.0.48 = getelementptr inbounds nuw i8, ptr %0, i64 48
+  %gep.arr.16 = getelementptr inbounds nuw i8, ptr %arr, i64 16
+  %gep.arr.30 = getelementptr inbounds nuw i8, ptr %arr, i64 30
+  %gep.arr.48 = getelementptr inbounds nuw i8, ptr %arr, i64 48
+
+  %load.0.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.0.30 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.30, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.0.48 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+
+  %faddop0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %faddop1 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %fadd = fadd <vscale x 4 x float> %faddop0, %faddop1
+
+  ret <vscale x 4 x float> %fadd
+}
+
+define <vscale x 4 x float> @dead_scalable_store_small_mask(ptr %0) {
+; CHECK-LABEL: define <vscale x 4 x float> @dead_scalable_store_small_mask(
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+; CHECK-NOT: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.46, ptr nonnull %gep.arr.46, i32 1, <vscale x 4 x i1> %mask)
+  %arr = alloca [64 x i32], align 4
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+
+  %gep.0.16 = getelementptr inbounds nuw i8, ptr %0, i64 16
+  %gep.0.30 = getelementptr inbounds nuw i8, ptr %0, i64 30
+  %gep.0.46 = getelementptr inbounds nuw i8, ptr %0, i64 46
+  %gep.arr.16 = getelementptr inbounds nuw i8, ptr %arr, i64 16
+  %gep.arr.30 = getelementptr inbounds nuw i8, ptr %arr, i64 30
+  %gep.arr.46 = getelementptr inbounds nuw i8, ptr %arr, i64 46
+
+  %load.0.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.0.30 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.30, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+
+  %load.0.46 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.46, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.46, ptr nonnull %gep.arr.46, i32 1, <vscale x 4 x i1> %mask)
+
+  %smallmask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.32(i32 0, i32 2)
+  %faddop0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %smallmask, <vscale x 4 x float> zeroinitializer)
+  %faddop1 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.46, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  %fadd = fadd <vscale x 4 x float> %faddop0, %faddop1
+
+  ret <vscale x 4 x float> %fadd
+}
+
+define <vscale x 4 x float> @dead_scalar_store(ptr noalias %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 4 x float> @dead_scalar_store(
+; CHECK-NOT: store i32 20, ptr %gep.1.12
+;
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+  %gep.1.12 = getelementptr inbounds nuw i8, ptr %1, i64 12
+  store i32 20, ptr %gep.1.12
+
+  %load.0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %0, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0, ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask)
+  %retval = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  ret <vscale x 4 x float> %retval
+}
+
+; We don't know if the scalar store is dead as we can't determine vscale.
+; This get active lane mask may cover 4 or 8 integers
+define <vscale x 4 x float> @mask_gt_minimum_num_elts(ptr noalias %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 4 x float> @mask_gt_minimum_num_elts(
+; CHECK: store i32 20, ptr %gep.1.28
+;
+  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 8)
+  %gep.1.28 = getelementptr inbounds nuw i8, ptr %1, i64 28
+  store i32 20, ptr %gep.1.28
+
+  %load.0 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %0, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0, ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask)
+  %retval = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+  ret <vscale x 4 x float> %retval
+}
+
+define <vscale x 16 x i8> @scalar_stores_small_mask(ptr noalias %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 16 x i8> @scalar_stores_small_mask(
+; CHECK-NOT: store i8 60, ptr %gep.1.6
+; CHECK: store i8 120, ptr %gep.1.8
+;
+  %mask = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i8.i32(i32 0, i32 7)
+  %gep.1.6 = getelementptr inbounds nuw i8, ptr %1, i64 6
+  store i8 60, ptr %gep.1.6
+  %gep.1.8 = getelementptr inbounds nuw i8, ptr %1, i64 8
+  store i8 120,   ptr %gep.1.8
+
+  %load.0 = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr nonnull %0, i32 1, <vscale x 16 x i1> %mask, <vscale x 16 x i8> zeroinitializer)
+  call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> %load.0, ptr %1, i32 1, <vscale x 16 x i1> %mask)
+  %retval = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %1, i32 1, <vscale x 16 x i1> %mask, <vscale x 16 x i8> zeroinitializer)
+  ret <vscale x 16 x i8> %retval
+}


static std::optional<FixedVectorType *>
getFixedTypeFromScalableMemOp(Value *Mask, Type *Ty) {
auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);
auto *ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);

etc

if (!LaneMaskLo || !LaneMaskHi)
return std::nullopt;

uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this just use LaneMaskHi, without subtracting LaneMaskLo?

NumElts is the number of active elements, but the memory location is relative to the original base pointer, so inactive elements at the start need to be counted as well.

It looks like you are missing tests for a non-zero base index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inactive elements at the start need to be counted as well.

From the langref for get active lane mask:

%m[i] = icmp ult (%base + i), %n

From this definition I interpreted that it's not possible for the first N elements to be 0 and then have following 1s - either the mask returned from this begins with a 1, is all false, or poison. If there is a range of 3 between %base and %n, the new mask size would be 3. I added the test dead_scalar_store_offset for this scenario.

Type *Ty = II->getType();
auto KnownScalableSize =
getFixedTypeFromScalableMemOp(II->getOperand(2), Ty);
if (KnownScalableSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do:

  if (auto KnownScalableSize = getFixedTypeFromScalableMemOp(II->getOperand(2), Ty))
    Ty = *KnownScalableSize;
  return MemoryLocation(Arg, DL.getTypeStoreSize(Ty), AATags);

and similarly for the store below.


static std::optional<FixedVectorType *>
getFixedTypeFromScalableMemOp(Value *Mask, Type *Ty) {
auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can rewrite this and the code to get the lo and hi values below by using match, i.e.

  auto ScalableTy = dyn_cast<ScalableVectorType>(Ty);
  if (!ScalableTy)
    return std::nullopt;

  Value *LaneMaskLo, * LaneMaskHi;
  if (!match(Mask, m_Intrinsic<Intrinsic::get_active_lane_mask>(m_Value(LaneMaskLo), m_Value(LaneMaskHi)))
    return std::nullopt;

  uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue();
  ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoided this because I was worried about compile time. This module doesn't include any match headers, plus it's quite low level and small sized. I haven't tested this though so it might not be a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added matching logic, it's simple to remove if it does hamper compile time.

if (!LaneMaskLo || !LaneMaskHi)
return std::nullopt;

uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to be careful with logic like this because it's possible for hi to be lower than lo. I think you need an extra check like:

   if (LaneMaskHi <= LaneMaskLo)
     return std::nullopt;

If the mask would return all-false then essentially this operation doesn't touch memory at all, although I don't know if we have to worry about that.

From the LangRef:

The '``llvm.get.active.lane.mask.*``' intrinsics are semantically equivalent
to:

::

      %m[i] = icmp ult (%base + i), %n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added bail outs for (LaneMaskHi <= LaneMaskLo) and (LaneMaskHi == 0)

ActiveLaneMask->getIntrinsicID() != Intrinsic::get_active_lane_mask)
return std::nullopt;

auto ScalableTy = dyn_cast<ScalableVectorType>(Ty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably remove this restriction as the mask works equally well for fixed-width vectors too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworked it to work with fixed vectors too


auto LaneMaskLo = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(0));
auto LaneMaskHi = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(1));
if (!LaneMaskLo || !LaneMaskHi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we have clarity on #152140 I think we should explicitly bail out if LaneMaskHi is 0 for now.

Use patternmatch logic
Add pointer tokens to auto declarations
Add offset test dead_scalar_store_offset
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments so far @MDevereau! I have a few more ...

%load.1.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)

%load.1.32 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.32, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to add tail to memory intrinsic calls? Might be good to remove from all masked load/store intrinsics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docked.

@@ -0,0 +1,196 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this NOTE line please? It looks like the CHECK lines below have been edited so they're not really autogenerated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: opt < %s -aa-pipeline=basic-aa -passes=dse -S | FileCheck %s

define <vscale x 4 x float> @dead_scalable_store(i32 %0, ptr %1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like %0 is unused and can be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


; We don't know if the scalar store is dead as we can't determine vscale.
; This get active lane mask may cover 4 or 8 integers
define <vscale x 4 x float> @mask_gt_minimum_num_elts(ptr noalias %0, ptr %1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having a variant of this where the scalar store is to a lower address, i.e.

%gep.1.12 = getelementptr inbounds nuw i8, ptr %1, i64 12

I can imagine in future the MemoryLocation class might be able to handle the concept of a memory range, which in this case is equivalent to a range of 16-32 bytes. In theory, we should still be able to remove this scalar store, but that's probably a much bigger piece of work. The same applies to functions with a vscale_range attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a store and check for byte 12 to this test.

ret <vscale x 4 x float> %retval
}

; Don't do anything if the 2nd Op of get active lane mask is 0. This currently generates poison
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Might be worth phrasing this a bit differently, i.e.

; TODO: Improve this once we have clarity in the LangRef for get.active.lane.mask
; regarding the expected behaviour when the second operand is 0.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the args for active lane mask are unsigned, I've removed this test and the LaneMaskHi == 0 check since the LaneMaskHi <= LaneMaskLo bail will also cover the same case

; CHECK-LABEL: define <vscale x 4 x float> @mask_hi_0(
; CHECK: store i32 20, ptr %1
;
%mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how useful this test is because if we did treat (0,0) as returning an all-false mask (instead of poison according to the LangRef), then you'd expect the masked store below to be a nop, since it doesn't write to anything. In this case I'd still expect the scalar store to remain as before. Perhaps a test like this?

define <4 x i32> @mask_hi_0(ptr %0) {
  %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 0)
  call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> splat(i32 1), ptr nonnull %0, i32 1, <vscale x 4 x i1> %mask)
  store i8 3, ptr %0
  store i32 20, ptr %0
  %retval = load <4 x i32>, ptr %0, align 1
  ret <4 x i32> %retval
}

You can see that the store i8 3, ptr %0 gets deleted, but the masked store remains. We should be able to kill off the masked store as well.

Incidentally, the pass doesn't deal with fixed-width constant masks either such as <i32 1, i32 1, i32 0, i32 0> or 'zeroinitializer`, but perhaps that can be done in a different patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this privately and decided that for the (0,0) case, we cannot confirm it as a no-op because it will return poison. We also decided that a case such as (3,3), this could be lowered to a no-op by a later patch if any interest in the case arises.

return std::nullopt;

uint64_t NumElts = LaneMaskHi - LaneMaskLo;
if (NumElts > Ty->getElementCount().getKnownMinValue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to bail out if it's a scalable vector, e.g. we can support get.active.lane.mask with arguments (0,8) when the vector type is <4 x i32>. The maximum permitted by the type is 4.

If you change this code, would be good to have a fixed-width test showing it works. Thanks!

Copy link
Contributor Author

@MDevereau MDevereau Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a clamp for the fixed width. I've added the test dead_scalable_store_fixed_large_mask to assert it.

}

; Don't do anything if the 2nd Op is gt/eq the 1st
define <vscale x 4 x float> @active_lane_mask_gt_eq(ptr noalias %0, ptr %1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above for @mask_hi_0 applies here too I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split this out into active_lane_mask_eq and active_lane_mask_lt

%gep.1.6 = getelementptr inbounds nuw i8, ptr %1, i64 6
store i8 60, ptr %gep.1.6
%gep.1.8 = getelementptr inbounds nuw i8, ptr %1, i64 8
store i8 120, ptr %gep.1.8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace before ptr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (NumElts > Ty->getElementCount().getKnownMinValue())
return std::nullopt;

return FixedVectorType::get(Ty->getElementType(), NumElts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have at least one test for fixed-width masked stores as well. Maybe just a variant of @dead_scalable_store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the test dead_scalable_store_fixed

Use APInt maths
Refine tests
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

LocationSize::upperBound(
DL.getTypeStoreSize(II->getArgOperand(0)->getType())),
AATags);
Arg, LocationSize::upperBound(DL.getTypeStoreSize(Ty)), AATags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but upperBound is currently extremely pessimistic for scalable vectors because for unknown pointers the upper bound is just the top of the address space. In future, we could improve the upper bound by taking vscale_range into account, or just add the ScalableBit to ImpreciseBit in the LocationSize object.

For normal stores we call this function instead:

  static LocationSize precise(TypeSize Value) {
    return LocationSize(Value.getKnownMinValue(), Value.isScalable());
  }

@MDevereau MDevereau merged commit f831463 into llvm:main Sep 4, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llvm:analysis Includes value tracking, cost tables and constant folding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants