-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[MemoryLocation] Size Scalable Masked MemOps #154785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Scalable masked loads and stores with a get active lane mask whose size is less than or equal to the scalable minimum number of elements can be be proven to have a fixed size. Adding this infomation allows scalable masked loads and stores to benefit from alias analysis optimizations.
|
@llvm/pr-subscribers-llvm-analysis Author: Matthew Devereau (MDevereau) ChangesScalable masked loads and stores with a get active lane mask whose size is less than or equal to the scalable minimum number of elements can be be proven to have a fixed size. Adding this infomation allows scalable masked loads and stores to benefit from alias analysis optimizations. Full diff: https://github.com/llvm/llvm-project/pull/154785.diff 2 Files Affected:
diff --git a/llvm/lib/Analysis/MemoryLocation.cpp b/llvm/lib/Analysis/MemoryLocation.cpp
index 72b643c56a994..f2c3b843f70f6 100644
--- a/llvm/lib/Analysis/MemoryLocation.cpp
+++ b/llvm/lib/Analysis/MemoryLocation.cpp
@@ -150,6 +150,29 @@ MemoryLocation::getForDest(const CallBase *CB, const TargetLibraryInfo &TLI) {
return MemoryLocation::getBeforeOrAfter(UsedV, CB->getAAMetadata());
}
+static std::optional<FixedVectorType *>
+getFixedTypeFromScalableMemOp(Value *Mask, Type *Ty) {
+ auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask);
+ if (!ActiveLaneMask ||
+ ActiveLaneMask->getIntrinsicID() != Intrinsic::get_active_lane_mask)
+ return std::nullopt;
+
+ auto ScalableTy = dyn_cast<ScalableVectorType>(Ty);
+ if (!ScalableTy)
+ return std::nullopt;
+
+ auto LaneMaskLo = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(0));
+ auto LaneMaskHi = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(1));
+ if (!LaneMaskLo || !LaneMaskHi)
+ return std::nullopt;
+
+ uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue();
+ if (NumElts > ScalableTy->getMinNumElements())
+ return std::nullopt;
+
+ return FixedVectorType::get(ScalableTy->getElementType(), NumElts);
+}
+
MemoryLocation MemoryLocation::getForArgument(const CallBase *Call,
unsigned ArgIdx,
const TargetLibraryInfo *TLI) {
@@ -213,20 +236,30 @@ MemoryLocation MemoryLocation::getForArgument(const CallBase *Call,
cast<ConstantInt>(II->getArgOperand(0))->getZExtValue()),
AATags);
- case Intrinsic::masked_load:
+ case Intrinsic::masked_load: {
assert(ArgIdx == 0 && "Invalid argument index");
- return MemoryLocation(
- Arg,
- LocationSize::upperBound(DL.getTypeStoreSize(II->getType())),
- AATags);
- case Intrinsic::masked_store:
+ Type *Ty = II->getType();
+ auto KnownScalableSize =
+ getFixedTypeFromScalableMemOp(II->getOperand(2), Ty);
+ if (KnownScalableSize)
+ return MemoryLocation(Arg, DL.getTypeStoreSize(*KnownScalableSize),
+ AATags);
+
+ return MemoryLocation(Arg, DL.getTypeStoreSize(Ty), AATags);
+ }
+ case Intrinsic::masked_store: {
assert(ArgIdx == 1 && "Invalid argument index");
- return MemoryLocation(
- Arg,
- LocationSize::upperBound(
- DL.getTypeStoreSize(II->getArgOperand(0)->getType())),
- AATags);
+
+ Type *Ty = II->getArgOperand(0)->getType();
+ auto KnownScalableSize =
+ getFixedTypeFromScalableMemOp(II->getOperand(3), Ty);
+ if (KnownScalableSize)
+ return MemoryLocation(Arg, DL.getTypeStoreSize(*KnownScalableSize),
+ AATags);
+
+ return MemoryLocation(Arg, DL.getTypeStoreSize(Ty), AATags);
+ }
case Intrinsic::invariant_end:
// The first argument to an invariant.end is a "descriptor" type (e.g. a
diff --git a/llvm/test/Analysis/BasicAA/scalable-dse-aa.ll b/llvm/test/Analysis/BasicAA/scalable-dse-aa.ll
new file mode 100644
index 0000000000000..c12d1c2f25835
--- /dev/null
+++ b/llvm/test/Analysis/BasicAA/scalable-dse-aa.ll
@@ -0,0 +1,145 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -aa-pipeline=basic-aa -passes=dse -S | FileCheck %s
+
+define <vscale x 4 x float> @dead_scalable_store(i32 %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 4 x float> @dead_scalable_store(
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+; CHECK-NOT: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.32, ptr nonnull %gep.arr.32, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+;
+ %arr = alloca [64 x i32], align 4
+ %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+
+ %gep.1.16 = getelementptr inbounds nuw i8, ptr %1, i64 16
+ %gep.1.32 = getelementptr inbounds nuw i8, ptr %1, i64 32
+ %gep.1.48 = getelementptr inbounds nuw i8, ptr %1, i64 48
+ %gep.arr.16 = getelementptr inbounds nuw i8, ptr %arr, i64 16
+ %gep.arr.32 = getelementptr inbounds nuw i8, ptr %arr, i64 32
+ %gep.arr.48 = getelementptr inbounds nuw i8, ptr %arr, i64 48
+
+ %load.1.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+ %load.1.32 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.32, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.32, ptr nonnull %gep.arr.32, i32 1, <vscale x 4 x i1> %mask)
+
+ %load.1.48 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+
+ %faddop0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ %faddop1 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ %fadd = fadd <vscale x 4 x float> %faddop0, %faddop1
+
+ ret <vscale x 4 x float> %fadd
+}
+
+define <vscale x 4 x float> @scalable_store_partial_overwrite(ptr %0) {
+; CHECK-LABEL: define <vscale x 4 x float> @scalable_store_partial_overwrite(
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+;
+ %arr = alloca [64 x i32], align 4
+ %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+
+ %gep.0.16 = getelementptr inbounds nuw i8, ptr %0, i64 16
+ %gep.0.30 = getelementptr inbounds nuw i8, ptr %0, i64 30
+ %gep.0.48 = getelementptr inbounds nuw i8, ptr %0, i64 48
+ %gep.arr.16 = getelementptr inbounds nuw i8, ptr %arr, i64 16
+ %gep.arr.30 = getelementptr inbounds nuw i8, ptr %arr, i64 30
+ %gep.arr.48 = getelementptr inbounds nuw i8, ptr %arr, i64 48
+
+ %load.0.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+ %load.0.30 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.30, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+
+ %load.0.48 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.48, ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask)
+
+ %faddop0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ %faddop1 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.48, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ %fadd = fadd <vscale x 4 x float> %faddop0, %faddop1
+
+ ret <vscale x 4 x float> %fadd
+}
+
+define <vscale x 4 x float> @dead_scalable_store_small_mask(ptr %0) {
+; CHECK-LABEL: define <vscale x 4 x float> @dead_scalable_store_small_mask(
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+; CHECK-NOT: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+; CHECK: call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.46, ptr nonnull %gep.arr.46, i32 1, <vscale x 4 x i1> %mask)
+ %arr = alloca [64 x i32], align 4
+ %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+
+ %gep.0.16 = getelementptr inbounds nuw i8, ptr %0, i64 16
+ %gep.0.30 = getelementptr inbounds nuw i8, ptr %0, i64 30
+ %gep.0.46 = getelementptr inbounds nuw i8, ptr %0, i64 46
+ %gep.arr.16 = getelementptr inbounds nuw i8, ptr %arr, i64 16
+ %gep.arr.30 = getelementptr inbounds nuw i8, ptr %arr, i64 30
+ %gep.arr.46 = getelementptr inbounds nuw i8, ptr %arr, i64 46
+
+ %load.0.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask)
+
+ %load.0.30 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.30, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.30, ptr nonnull %gep.arr.30, i32 1, <vscale x 4 x i1> %mask)
+
+ %load.0.46 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.0.46, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0.46, ptr nonnull %gep.arr.46, i32 1, <vscale x 4 x i1> %mask)
+
+ %smallmask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.32(i32 0, i32 2)
+ %faddop0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %smallmask, <vscale x 4 x float> zeroinitializer)
+ %faddop1 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.arr.46, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ %fadd = fadd <vscale x 4 x float> %faddop0, %faddop1
+
+ ret <vscale x 4 x float> %fadd
+}
+
+define <vscale x 4 x float> @dead_scalar_store(ptr noalias %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 4 x float> @dead_scalar_store(
+; CHECK-NOT: store i32 20, ptr %gep.1.12
+;
+ %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 4)
+ %gep.1.12 = getelementptr inbounds nuw i8, ptr %1, i64 12
+ store i32 20, ptr %gep.1.12
+
+ %load.0 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %0, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0, ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask)
+ %retval = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ ret <vscale x 4 x float> %retval
+}
+
+; We don't know if the scalar store is dead as we can't determine vscale.
+; This get active lane mask may cover 4 or 8 integers
+define <vscale x 4 x float> @mask_gt_minimum_num_elts(ptr noalias %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 4 x float> @mask_gt_minimum_num_elts(
+; CHECK: store i32 20, ptr %gep.1.28
+;
+ %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 8)
+ %gep.1.28 = getelementptr inbounds nuw i8, ptr %1, i64 28
+ store i32 20, ptr %gep.1.28
+
+ %load.0 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %0, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.0, ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask)
+ %retval = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %1, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer)
+ ret <vscale x 4 x float> %retval
+}
+
+define <vscale x 16 x i8> @scalar_stores_small_mask(ptr noalias %0, ptr %1) {
+; CHECK-LABEL: define <vscale x 16 x i8> @scalar_stores_small_mask(
+; CHECK-NOT: store i8 60, ptr %gep.1.6
+; CHECK: store i8 120, ptr %gep.1.8
+;
+ %mask = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i8.i32(i32 0, i32 7)
+ %gep.1.6 = getelementptr inbounds nuw i8, ptr %1, i64 6
+ store i8 60, ptr %gep.1.6
+ %gep.1.8 = getelementptr inbounds nuw i8, ptr %1, i64 8
+ store i8 120, ptr %gep.1.8
+
+ %load.0 = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr nonnull %0, i32 1, <vscale x 16 x i1> %mask, <vscale x 16 x i8> zeroinitializer)
+ call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> %load.0, ptr %1, i32 1, <vscale x 16 x i1> %mask)
+ %retval = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %1, i32 1, <vscale x 16 x i1> %mask, <vscale x 16 x i8> zeroinitializer)
+ ret <vscale x 16 x i8> %retval
+}
|
llvm/lib/Analysis/MemoryLocation.cpp
Outdated
|
|
||
| static std::optional<FixedVectorType *> | ||
| getFixedTypeFromScalableMemOp(Value *Mask, Type *Ty) { | ||
| auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask); | |
| auto *ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask); |
etc
llvm/lib/Analysis/MemoryLocation.cpp
Outdated
| if (!LaneMaskLo || !LaneMaskHi) | ||
| return std::nullopt; | ||
|
|
||
| uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this just use LaneMaskHi, without subtracting LaneMaskLo?
NumElts is the number of active elements, but the memory location is relative to the original base pointer, so inactive elements at the start need to be counted as well.
It looks like you are missing tests for a non-zero base index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inactive elements at the start need to be counted as well.
From the langref for get active lane mask:
%m[i] = icmp ult (%base + i), %n
From this definition I interpreted that it's not possible for the first N elements to be 0 and then have following 1s - either the mask returned from this begins with a 1, is all false, or poison. If there is a range of 3 between %base and %n, the new mask size would be 3. I added the test dead_scalar_store_offset for this scenario.
llvm/lib/Analysis/MemoryLocation.cpp
Outdated
| Type *Ty = II->getType(); | ||
| auto KnownScalableSize = | ||
| getFixedTypeFromScalableMemOp(II->getOperand(2), Ty); | ||
| if (KnownScalableSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do:
if (auto KnownScalableSize = getFixedTypeFromScalableMemOp(II->getOperand(2), Ty))
Ty = *KnownScalableSize;
return MemoryLocation(Arg, DL.getTypeStoreSize(Ty), AATags);
and similarly for the store below.
llvm/lib/Analysis/MemoryLocation.cpp
Outdated
|
|
||
| static std::optional<FixedVectorType *> | ||
| getFixedTypeFromScalableMemOp(Value *Mask, Type *Ty) { | ||
| auto ActiveLaneMask = dyn_cast<IntrinsicInst>(Mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can rewrite this and the code to get the lo and hi values below by using match, i.e.
auto ScalableTy = dyn_cast<ScalableVectorType>(Ty);
if (!ScalableTy)
return std::nullopt;
Value *LaneMaskLo, * LaneMaskHi;
if (!match(Mask, m_Intrinsic<Intrinsic::get_active_lane_mask>(m_Value(LaneMaskLo), m_Value(LaneMaskHi)))
return std::nullopt;
uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue();
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I avoided this because I was worried about compile time. This module doesn't include any match headers, plus it's quite low level and small sized. I haven't tested this though so it might not be a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added matching logic, it's simple to remove if it does hamper compile time.
llvm/lib/Analysis/MemoryLocation.cpp
Outdated
| if (!LaneMaskLo || !LaneMaskHi) | ||
| return std::nullopt; | ||
|
|
||
| uint64_t NumElts = LaneMaskHi->getZExtValue() - LaneMaskLo->getZExtValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to be careful with logic like this because it's possible for hi to be lower than lo. I think you need an extra check like:
if (LaneMaskHi <= LaneMaskLo)
return std::nullopt;
If the mask would return all-false then essentially this operation doesn't touch memory at all, although I don't know if we have to worry about that.
From the LangRef:
The '``llvm.get.active.lane.mask.*``' intrinsics are semantically equivalent
to:
::
%m[i] = icmp ult (%base + i), %n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added bail outs for (LaneMaskHi <= LaneMaskLo) and (LaneMaskHi == 0)
llvm/lib/Analysis/MemoryLocation.cpp
Outdated
| ActiveLaneMask->getIntrinsicID() != Intrinsic::get_active_lane_mask) | ||
| return std::nullopt; | ||
|
|
||
| auto ScalableTy = dyn_cast<ScalableVectorType>(Ty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably remove this restriction as the mask works equally well for fixed-width vectors too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworked it to work with fixed vectors too
llvm/lib/Analysis/MemoryLocation.cpp
Outdated
|
|
||
| auto LaneMaskLo = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(0)); | ||
| auto LaneMaskHi = dyn_cast<ConstantInt>(ActiveLaneMask->getOperand(1)); | ||
| if (!LaneMaskLo || !LaneMaskHi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we have clarity on #152140 I think we should explicitly bail out if LaneMaskHi is 0 for now.
Use patternmatch logic Add pointer tokens to auto declarations Add offset test dead_scalar_store_offset
david-arm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all the comments so far @MDevereau! I have a few more ...
| %load.1.16 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.16, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer) | ||
| call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %load.1.16, ptr nonnull %gep.arr.16, i32 1, <vscale x 4 x i1> %mask) | ||
|
|
||
| %load.1.32 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr nonnull %gep.1.32, i32 1, <vscale x 4 x i1> %mask, <vscale x 4 x float> zeroinitializer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to add tail to memory intrinsic calls? Might be good to remove from all masked load/store intrinsics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docked.
| @@ -0,0 +1,196 @@ | |||
| ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this NOTE line please? It looks like the CHECK lines below have been edited so they're not really autogenerated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | ||
| ; RUN: opt < %s -aa-pipeline=basic-aa -passes=dse -S | FileCheck %s | ||
|
|
||
| define <vscale x 4 x float> @dead_scalable_store(i32 %0, ptr %1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like %0 is unused and can be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| ; We don't know if the scalar store is dead as we can't determine vscale. | ||
| ; This get active lane mask may cover 4 or 8 integers | ||
| define <vscale x 4 x float> @mask_gt_minimum_num_elts(ptr noalias %0, ptr %1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth having a variant of this where the scalar store is to a lower address, i.e.
%gep.1.12 = getelementptr inbounds nuw i8, ptr %1, i64 12
I can imagine in future the MemoryLocation class might be able to handle the concept of a memory range, which in this case is equivalent to a range of 16-32 bytes. In theory, we should still be able to remove this scalar store, but that's probably a much bigger piece of work. The same applies to functions with a vscale_range attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a store and check for byte 12 to this test.
| ret <vscale x 4 x float> %retval | ||
| } | ||
|
|
||
| ; Don't do anything if the 2nd Op of get active lane mask is 0. This currently generates poison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Might be worth phrasing this a bit differently, i.e.
; TODO: Improve this once we have clarity in the LangRef for get.active.lane.mask
; regarding the expected behaviour when the second operand is 0.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the args for active lane mask are unsigned, I've removed this test and the LaneMaskHi == 0 check since the LaneMaskHi <= LaneMaskLo bail will also cover the same case
| ; CHECK-LABEL: define <vscale x 4 x float> @mask_hi_0( | ||
| ; CHECK: store i32 20, ptr %1 | ||
| ; | ||
| %mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful this test is because if we did treat (0,0) as returning an all-false mask (instead of poison according to the LangRef), then you'd expect the masked store below to be a nop, since it doesn't write to anything. In this case I'd still expect the scalar store to remain as before. Perhaps a test like this?
define <4 x i32> @mask_hi_0(ptr %0) {
%mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 0, i32 0)
call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> splat(i32 1), ptr nonnull %0, i32 1, <vscale x 4 x i1> %mask)
store i8 3, ptr %0
store i32 20, ptr %0
%retval = load <4 x i32>, ptr %0, align 1
ret <4 x i32> %retval
}
You can see that the store i8 3, ptr %0 gets deleted, but the masked store remains. We should be able to kill off the masked store as well.
Incidentally, the pass doesn't deal with fixed-width constant masks either such as <i32 1, i32 1, i32 0, i32 0> or 'zeroinitializer`, but perhaps that can be done in a different patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this privately and decided that for the (0,0) case, we cannot confirm it as a no-op because it will return poison. We also decided that a case such as (3,3), this could be lowered to a no-op by a later patch if any interest in the case arises.
llvm/lib/Analysis/MemoryLocation.cpp
Outdated
| return std::nullopt; | ||
|
|
||
| uint64_t NumElts = LaneMaskHi - LaneMaskLo; | ||
| if (NumElts > Ty->getElementCount().getKnownMinValue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to bail out if it's a scalable vector, e.g. we can support get.active.lane.mask with arguments (0,8) when the vector type is <4 x i32>. The maximum permitted by the type is 4.
If you change this code, would be good to have a fixed-width test showing it works. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a clamp for the fixed width. I've added the test dead_scalable_store_fixed_large_mask to assert it.
| } | ||
|
|
||
| ; Don't do anything if the 2nd Op is gt/eq the 1st | ||
| define <vscale x 4 x float> @active_lane_mask_gt_eq(ptr noalias %0, ptr %1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment above for @mask_hi_0 applies here too I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've split this out into active_lane_mask_eq and active_lane_mask_lt
| %gep.1.6 = getelementptr inbounds nuw i8, ptr %1, i64 6 | ||
| store i8 60, ptr %gep.1.6 | ||
| %gep.1.8 = getelementptr inbounds nuw i8, ptr %1, i64 8 | ||
| store i8 120, ptr %gep.1.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: whitespace before ptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
llvm/lib/Analysis/MemoryLocation.cpp
Outdated
| if (NumElts > Ty->getElementCount().getKnownMinValue()) | ||
| return std::nullopt; | ||
|
|
||
| return FixedVectorType::get(Ty->getElementType(), NumElts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have at least one test for fixed-width masked stores as well. Maybe just a variant of @dead_scalable_store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the test dead_scalable_store_fixed
Use APInt maths Refine tests
david-arm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| LocationSize::upperBound( | ||
| DL.getTypeStoreSize(II->getArgOperand(0)->getType())), | ||
| AATags); | ||
| Arg, LocationSize::upperBound(DL.getTypeStoreSize(Ty)), AATags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but upperBound is currently extremely pessimistic for scalable vectors because for unknown pointers the upper bound is just the top of the address space. In future, we could improve the upper bound by taking vscale_range into account, or just add the ScalableBit to ImpreciseBit in the LocationSize object.
For normal stores we call this function instead:
static LocationSize precise(TypeSize Value) {
return LocationSize(Value.getKnownMinValue(), Value.isScalable());
}
Scalable masked loads and stores with a get active lane mask whose size is less than or equal to the scalable minimum number of elements can be be proven to have a fixed size. Adding this infomation allows scalable masked loads and stores to benefit from alias analysis optimizations.