Skip to content

Conversation

@SamTebbs33
Copy link
Collaborator

@SamTebbs33 SamTebbs33 commented Jul 25, 2024

When vectorising a loop that uses loads and stores, those pointers with unknown bounds could overlap if their difference is less than the vector factor. For example, if address 20 is being stored to and address 23 is being loaded from, they overlap when the vector factor is 4 or higher. Currently LoopVectorize branches to a scalar loop in these cases with a runtime check. However if we construct a mask that disables the overlapping (aliasing) lanes then the vectorised loop can be safely entered, as long as the loads and stores are masked off.

This PR modifies the LoopVectorizer and VPlan to create such a mask and branch to the vector loop if the mask has active lanes.

@llvmbot llvmbot added clang Clang issues not falling into any other category vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Jul 25, 2024
@llvmbot
Copy link
Member

llvmbot commented Jul 25, 2024

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: Sam Tebbs (SamTebbs33)

Changes

When vectorising a loop that uses loads and stores, those pointers with unknown bounds could overlap if their difference is less than the vector factor. For example, if address 20 is being stored to and address 23 is being loaded from, they overlap when the vector factor is 4 or higher. Currently LoopVectorize branches to a scalar loop in these cases with a runtime check. However if we construct a mask that disables the overlapping (aliasing) lanes then the vectorised loop can be safely entered, as long as the loads and stores are masked off.

This PR modifies the LoopVectorizer and VPlan to create such a mask and always branch to the vector loop. Currently this is only done if we're tail-predicating, but more work will come in the future to do this in other cases as well.


Patch is 103.32 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/100579.diff

11 Files Affected:

  • (added) clang/test/CodeGen/loop-alias-mask.c (+404)
  • (modified) llvm/include/llvm/Analysis/LoopAccessAnalysis.h (+15-1)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+21-3)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+112-31)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+90-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+21-10)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+4-1)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/whilewr-opt.ll (+369)
  • (modified) llvm/test/Transforms/LoopVectorize/runtime-check-small-clamped-bounds.ll (+11-11)
  • (modified) llvm/test/Transforms/LoopVectorize/runtime-checks-difference.ll (+53-49)
diff --git a/clang/test/CodeGen/loop-alias-mask.c b/clang/test/CodeGen/loop-alias-mask.c
new file mode 100644
index 0000000000000..76c3b5deddfa0
--- /dev/null
+++ b/clang/test/CodeGen/loop-alias-mask.c
@@ -0,0 +1,404 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4
+// RUN: %clang --target=aarch64-linux-gnu -march=armv9+sme2 -emit-llvm -S -g0 -O3 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize %s -o - | FileCheck %s
+#include <stdint.h>
+
+// CHECK-LABEL: define dso_local void @alias_mask_8(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP11:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP11]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C14:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[B15:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[B15]], [[C14]]
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], 0
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 16 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 16 x i1> [[DOTSPLATINSERT]], <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[SUB_DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 16 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 16 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 16 x i8>
+// CHECK-NEXT:    [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv16i8(<vscale x 16 x i8> [[TMP0]])
+// CHECK-NEXT:    [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP3:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP4]], i32 1, <vscale x 16 x i1> [[TMP3]], <vscale x 16 x i8> poison), !tbaa [[TBAA6:![0-9]+]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD16:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP5]], i32 1, <vscale x 16 x i1> [[TMP3]], <vscale x 16 x i8> poison), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 16 x i8> [[WIDE_MASKED_LOAD16]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP6]], ptr [[TMP7]], i32 1, <vscale x 16 x i1> [[TMP3]]), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP8:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP9:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_8(uint8_t *restrict a, uint8_t * b, uint8_t * c, int n) {
+  #pragma clang loop vectorize(enable)
+  for (int i = 0; i < n; i++) {
+    c[i] = a[i] + b[i];
+  }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_16(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP11:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP11]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C14:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[B15:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[B15]], [[C14]]
+// CHECK-NEXT:    [[DIFF:%.*]] = sdiv i64 [[SUB_DIFF]], 2
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], -1
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 8 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 8 x i1> [[DOTSPLATINSERT]], <vscale x 8 x i1> poison, <vscale x 8 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 8 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 8 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 8 x i8>
+// CHECK-NEXT:    [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv8i8(<vscale x 8 x i8> [[TMP0]])
+// CHECK-NEXT:    [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 8 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP3:%.*]] = and <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0(ptr [[TMP4]], i32 2, <vscale x 8 x i1> [[TMP3]], <vscale x 8 x i16> poison), !tbaa [[TBAA13:![0-9]+]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i16, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD16:%.*]] = tail call <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0(ptr [[TMP5]], i32 2, <vscale x 8 x i1> [[TMP3]], <vscale x 8 x i16> poison), !tbaa [[TBAA13]]
+// CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 8 x i16> [[WIDE_MASKED_LOAD16]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv8i16.p0(<vscale x 8 x i16> [[TMP6]], ptr [[TMP7]], i32 2, <vscale x 8 x i1> [[TMP3]]), !tbaa [[TBAA13]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP8:%.*]] = extractelement <vscale x 8 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP15:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_16(uint16_t *restrict a, uint16_t * b, uint16_t * c, int n) {
+  #pragma clang loop vectorize(enable)
+  for (int i = 0; i < n; i++) {
+    c[i] = a[i] + b[i];
+  }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_32(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP9:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C12:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[B13:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[B13]], [[C12]]
+// CHECK-NEXT:    [[DIFF:%.*]] = sdiv i64 [[SUB_DIFF]], 4
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], -3
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i1> [[DOTSPLATINSERT]], <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 4 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 4 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 4 x i8>
+// CHECK-NEXT:    [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv4i8(<vscale x 4 x i8> [[TMP0]])
+// CHECK-NEXT:    [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP3:%.*]] = and <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr [[TMP4]], i32 4, <vscale x 4 x i1> [[TMP3]], <vscale x 4 x i32> poison), !tbaa [[TBAA16:![0-9]+]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD14:%.*]] = tail call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr [[TMP5]], i32 4, <vscale x 4 x i1> [[TMP3]], <vscale x 4 x i32> poison), !tbaa [[TBAA16]]
+// CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 4 x i32> [[WIDE_MASKED_LOAD14]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> [[TMP6]], ptr [[TMP7]], i32 4, <vscale x 4 x i1> [[TMP3]]), !tbaa [[TBAA16]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP8:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP18:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_32(uint32_t *restrict a, uint32_t * b, uint32_t * c, int n) {
+  #pragma clang loop vectorize(enable)
+  for (int i = 0; i < n; i++) {
+    c[i] = a[i] + b[i];
+  }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_64(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP9:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C12:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[B13:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[B13]], [[C12]]
+// CHECK-NEXT:    [[DIFF:%.*]] = sdiv i64 [[SUB_DIFF]], 8
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], -7
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i1> [[DOTSPLATINSERT]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 [[DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 2 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 2 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 2 x i8>
+// CHECK-NEXT:    [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv2i8(<vscale x 2 x i8> [[TMP0]])
+// CHECK-NEXT:    [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP3:%.*]] = and <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP4]], i32 8, <vscale x 2 x i1> [[TMP3]], <vscale x 2 x i64> poison), !tbaa [[TBAA19:![0-9]+]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD14:%.*]] = tail call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP5]], i32 8, <vscale x 2 x i1> [[TMP3]], <vscale x 2 x i64> poison), !tbaa [[TBAA19]]
+// CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 2 x i64> [[WIDE_MASKED_LOAD14]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i64, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP6]], ptr [[TMP7]], i32 8, <vscale x 2 x i1> [[TMP3]]), !tbaa [[TBAA19]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP8:%.*]] = extractelement <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP21:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_64(uint64_t *restrict a, uint64_t * b, uint64_t * c, int n) {
+  #pragma clang loop vectorize(enable)
+  for (int i = 0; i < n; i++) {
+    c[i] = a[i] + b[i];
+  }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_multiple_8(
+// CHECK-SAME: ptr nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP11:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP11]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C14:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[A15:%.*]] = ptrtoint ptr [[A]] to i64
+// CHECK-NEXT:    [[B16:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[A15]], [[C14]]
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], 0
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 16 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 16 x i1> [[DOTSPLATINSERT]], <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[SUB_DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 16 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[SUB_DIFF18:%.*]] = sub i64 [[B16]], [[C14]]
+// CHECK-NEXT:    [[NEG_COMPARE20:%.*]] = icmp slt i64 [[SUB_DIFF18]], 0
+// CHECK-NEXT:    [[DOTSPLATINSERT21:%.*]] = insertelement <vscale x 16 x i1> poison, i1 [[NEG_COMPARE20]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT22:%.*]] = shufflevector <vscale x 16 x i1> [[DOTSPLATINSERT21]], <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK23:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[SUB_DIFF18]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS24:%.*]] = or <vscale x 16 x i1> [[PTR_DIFF_LANE_MASK23]], [[DOTSPLAT22]]
+// CHECK-NEXT:    [[TMP0:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK_ALIAS]], [[ACTIVE_LANE_MASK_ALIAS24]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP1:%.*]] = zext <vscale x 16 x i1> [[TMP0]] to <vscale x 16 x i8>
+// CHECK-NEXT:    [[TMP2:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv16i8(<vscale x 16 x i8> [[TMP1]])
+// CHECK-NEXT:    [[TMP3:%.*]] = zext i8 [[TMP2]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP4:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], [[TMP0]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP5]], i32 1, <vscale x 16 x i1> [[TMP4]], <vscale x 16 x i8> poison), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD25:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP6]], i32 1, <vscale x 16 x i1> [[TMP4]], <vscale x 16 x i8> poison), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[TMP7:%.*]] = add <vscale x 16 x i8> [[WIDE_MASKED_LOAD25]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP7]], ptr [[TMP8]], i32 1, <vscale x 16 x i1> [[TMP4]]), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP3]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP9:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP9]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP22:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_multiple_8(ui...
[truncated]

SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Jul 26, 2024
llvm#100579 emits IR that creates a
mask disabling lanes that could alias within a loop iteration, based on
a pair of pointers. This PR lowers that IR to a WHILEWR instruction for
AArch64.
@SamTebbs33 SamTebbs33 requested a review from davemgreen July 30, 2024 09:47
@@ -0,0 +1,404 @@
// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4
Copy link
Collaborator

@huntergr-arm huntergr-arm Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't add source level tests for llvm optimizations; aiui clang's CodeGen test directory is mainly to test lowering the AST to LLVM IR, not end-to-end.

We might want a different way of testing the feature from source that we could add to a buildbot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I've removed this file and have started work on adding a test to the llvm-test-suite to replace it. A statistic variable will help make sure the alias masking is happening.

NeedsFreeze(NeedsFreeze) {}
};

/// A pair of pointers that could overlap across a loop iteration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the analysis has been removed from LAA, then I suggest the struct can also be moved elsewhere. Maybe VPlan.h ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. I originally put it here so it was next to the related PointerDiffInfo

SamTebbs33 added a commit that referenced this pull request Aug 2, 2024
#100579 emits IR that creates a
mask disabling lanes that could alias within a loop iteration, based on
a pair of pointers. This PR lowers that IR to the WHILEWR instruction
for AArch64.
#include "llvm/Analysis/LoopAnalysisManager.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"
#include "llvm/IR/DiagnosticInfo.h"
#include "llvm/Transforms/Utils/ScalarEvolutionExpander.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is left over from previous changes, thanks. Removed now.

bool HasNUW = Style == TailFoldingStyle::None;
addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), HasNUW, DL);

VPValue *AliasMask = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be done at initial construction? IIUC this could be framed as optimization of the original VPlan, done as a separate VPlan-to-VPlan transform?

At the moment this requires threading a bunch of information to a lot of places, would
Be good limit to a single dedicated transform

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the mask is only generated when tail predication is enabled, in which case it perhaps could be done with a transform since all the vector loads and stores will be masked anyway. However the goal is to generate the mask even when tail predication is disabled, to capture more cases, and so the transform would have to handle swapping the non-masked loads and stores to masked variants, which I don't think is best done with a transform since the original VPlan does a good job of that anyway.

I'd also argue that it's a good idea to get the masking as we want it to be from the beginning rather than setting up an active lane mask and building on top of it later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible to do so in a separate transform, that would preferable IMO as now the general VPlan construction has a bunch of added complexity for something that is quite specific.

A similar approach was taken for adding various options to control tail-folding (addActiveLaneMask, tryAddExplicitVectorLength). This also has the advantage of keeping the code responsible to generate the new construct at a single place, rather than spreading it across multiple places?

Having it as a separate transform also allows to generate VPlans with different strategies (e.g. with and without this new mask) and pick the best option based on cost.

GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, TTI, F->getDataLayout(),
AddBranchWeights);

// VPlan needs the aliasing pointers as Values and not SCEVs, so expand them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If expanded SCEVs are needed in VPlan they should be expanded during VPlan execution using VPExpandSCEVRecioe if possible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh perfect, I didn't know that existed, thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

EltVT->getArrayNumElements();
else
ElementSize =
GEP->getSourceElementType()->getScalarSizeInBits() / 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if using the GEP source element type here is correct, should this use the accessed type from load/store which could be different?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to figure it out now and am using the AccessSize provided by PointerDIffInfo. Thanks for pointing out the flakiness here!

; CHECK-NEXT: [[B1:%.*]] = ptrtoint ptr [[B:%.*]] to i64
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N:%.*]], 4
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %scalar.ph, label %vector.memcheck
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the update script did this. I've removed it now.

%cmp11 = icmp sgt i32 %n, 0
br i1 %cmp11, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something but the input is already vectorized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks for pointing that out. I've improved the test so that it shows the loop vectoriser's work.

@github-actions
Copy link

github-actions bot commented Aug 6, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@SamTebbs33
Copy link
Collaborator Author

Ping.

}
}
}
assert(ElementSize > 0 && "Couldn't get element size from pointer");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert fires when compiling MicroBenchmarks/ImageProcessing/Dilate/CMakeFiles/Dilate.dir/dilateKernel.c from the test suite, so that needs fixing at least.

But I think we shouldn't be trying to figure out the element size at recipe execution time; it should be determined when trying to build the recipe and the value stored as part of the class. If we can't find the size then we stop trying to build the recipe and fall back to normal RT checks. Does that sound sensible to you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Graham, I've had a deeper look and PointerDiffInfo provides an AccessSize which corresponds to what we want, so I'm using that now.

@SamTebbs33 SamTebbs33 force-pushed the whilewr branch 2 times, most recently from 7a87bf4 to 96b8b65 Compare October 22, 2024 07:57
@SamTebbs33
Copy link
Collaborator Author

I've moved the read-after-write lowering to #114028

@SamTebbs33
Copy link
Collaborator Author

I've now fixed the comparison that's emitted since it should have been comparing to 0. See the Arm WHILERW instruction.

SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Nov 20, 2024
It can be unsafe to load a vector from an address and write a vector to
an address if those two addresses have overlapping lanes within a
vectorised loop iteration.

This PR adds an intrinsic designed to create a mask with lanes disabled
if they overlap between the two pointer arguments, so that only safe
lanes are loaded, operated on and stored.

Along with the two pointer parameters, the intrinsic also takes an
immediate that represents the size in bytes of the vector element
types, as well as an immediate i1 that is true if there is a write
after-read-hazard or false if there is a read-after-write hazard.

This will be used by llvm#100579 and replaces the existing lowering for
whilewr since that isn't needed now we have the intrinsic.
SamTebbs33 added a commit to SamTebbs33/llvm-project that referenced this pull request Aug 15, 2025
It can be unsafe to load a vector from an address and write a vector to
an address if those two addresses have overlapping lanes within a
vectorised loop iteration.

This PR adds an intrinsic designed to create a mask with lanes disabled
if they overlap between the two pointer arguments, so that only safe
lanes are loaded, operated on and stored.

Along with the two pointer parameters, the intrinsic also takes an
immediate that represents the size in bytes of the vector element
types, as well as an immediate i1 that is true if there is a write
after-read-hazard or false if there is a read-after-write hazard.

This will be used by llvm#100579 and replaces the existing lowering for
whilewr since that isn't needed now we have the intrinsic.
@SamTebbs33 SamTebbs33 force-pushed the users/SamTebbs33/alias-intrinsic branch from b856ecf to 5402e27 Compare August 15, 2025 16:16
SamTebbs33 added a commit that referenced this pull request Sep 2, 2025
…lanes (#117007)

It can be unsafe to load a vector from an address and write a vector to
an address if those two addresses have overlapping lanes within a
vectorised loop iteration.

This PR adds intrinsics designed to create a mask with lanes disabled if
they overlap between the two pointer arguments, so that only safe lanes
are loaded, operated on and stored. The `loop.dependence.war.mask`
intrinsic represents cases where the store occurs after the load, and
the opposite for `loop.dependence.raw.mask`. The distinction between
write-after-read and read-after-write is important, since the ordering
of the read and write operations affects if the chain of those
instructions can be done safely.

Along with the two pointer parameters, the intrinsics also take an
immediate that represents the size in bytes of the vector element types.

This will be used by #100579.
@SamTebbs33
Copy link
Collaborator Author

Ping.

@@ -2117,8 +2170,8 @@ Value *llvm::addDiffRuntimeChecks(
Value *Diff =
Expander.expandCodeFor(SE.getMinusSCEV(SinkStart, SrcStart), Ty, Loc);

// Check if the same compare has already been created earlier. In that case,
// there is no need to check it again.
// Check if the same compare has already been created earlier. In that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// Map to keep track of created compares, The key is the pair of operands for
// the compare, to allow detecting and re-using redundant compares.
DenseMap<std::pair<Value *, Value *>, Value *> SeenCompares;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add newline after DenseMap to distinguish the comment from the code below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant now that this function has been removed, due to the new mask insertion code. Likewise for the other comments here.

Comment on lines 2114 to 2115
if (SeenCompares.lookup({Sink, Src}))
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not assume that {Sink, Src} tuples are always unique? (note that when I comment this out, there's no failing test)

Comment on lines 2119 to 2120
Value *SourceAsPtr = ChkBuilder.CreateCast(Instruction::IntToPtr, Src,
ChkBuilder.getPtrTy());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Value *SourceAsPtr = ChkBuilder.CreateCast(Instruction::IntToPtr, Src,
ChkBuilder.getPtrTy());
Value *SourceAsPtr = ChkBuilder.CreateIntToPtr(Src, ChkBuilder.getPtrTy());

(same below)

}
assert(AliasLaneMask && "Expected an alias lane mask to have been created.");
auto *VecVT = VectorType::get(ChkBuilder.getInt1Ty(), VF);
// Extend to an i8 since i1 is too small to add with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to compare with 0? (i.e. any lanes set) If so, then I would think an or reduction on i1's would be better.
Alternatively, if we only want to execute the vector loop if at least VF/2 elements are handled in the vectorised loop (for VF > 2), then doing a popcount makes sense. But in that case, we should be cautious about the element type used. i8 may not be enough for very large vectors, that will depend on the VF and VScale values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the intention is to compare with zero, but the popcount instruction is generic (the new mask insertion approach uses the instruction rather than constructing the intrinsic) so I'd like to keep it as an add. I'm sure I've seen it being optimised to an or somewhere!

; CHECK-NEXT: [[TMP1:%.*]] = call i8 @llvm.vector.reduce.add.nxv16i8(<vscale x 16 x i8> [[TMP0]])
; CHECK-NEXT: [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
; CHECK-NEXT: [[TMP3:%.*]] = icmp ugt i64 [[TMP2]], 0
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the popcount is higher than 0, it branches to the scalar preheader, isn't it supposed to branch to the vector preheader instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now, thank you. I also realised I don't have to AND or OR with the existing MemCheckCond since we want to override it.

case TailFoldingStyle::Data:
case TailFoldingStyle::DataAndControlFlow:
case TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck:
return RTCheckStyle::UseSafeEltsMask;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to tie this to tail folding? We would be able to get more coverage if we could enable this for all loops. If there are reasons for tying this to fail folding, then I'd prefer to do that under an option that we can disable for testing purposes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was because the loop dependence mask was inserted at the same time as the active lane mask, but with the new approach they're de-coupled, so I have removed this restriction.

// [...]
// store %b, %a.vec ; only s and t can be stored as their addresses don't
// overlap with %a + (VF - 1)
class VPAliasLaneMaskRecipe : public VPSingleDefRecipe {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure there's a good reason for it, but why is this a VPRecipe as opposed to being a VPInstruction?

Copy link
Collaborator Author

@SamTebbs33 SamTebbs33 Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try to use VPWidenIntrinsicRecipe instead but have run into an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with the intrinsic recipe now.

Builder.getPtrTy());
Value *SinkAsPtr =
Builder.CreateCast(Instruction::IntToPtr, SinkValue, Builder.getPtrTy());
Value *AliasMask = Builder.CreateIntrinsic(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically lowers to a wide intrinsic, can this simply use VPWidenIntrinsicReicpe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop dependence intrinsics expect pointer arguments, but the scev expansion turns the pointer arguments into integers. This is fine for the dedicated recipe, since it converts those integers to pointers for the intrinsic call, but VPWidenIntrinsicRecipe won't do that. I've tried adding inttoptr casts before the VPWidenIntrinsicRecipe, but then an assertion fails saying "VPExpandSCEVRecipes must be at the beginning of the entry block, " "after any VPIRInstructions", since there's now a cast instruction after the scev expansion. It's very annoying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but can't all needed SCEVs be expanded first, then casted? Or if a pointer is needed, avoid converting them to integers first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured it out now! I think I was getting the VPBuilder usage wrong so the recipes were being inserted in the wrong place.

// Otherwise, fallback to default scalarization cost.
break;
}
case Intrinsic::loop_dependence_raw_mask:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to move this to an independent PR, and also test this with dedicated cost-model tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised that PR here and I've rebased this work on top of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that looks like a good first step. Would be good if updating the masks could also be done there (the PopCount introduction) and we could avoid making addActiveLaneMask more complex

    When vectorising a loop that uses loads and stores, those pointers could
    overlap if their difference is less than the vector factor. For example,
    if address 20 is being stored to and address 23 is being loaded from, they
    overlap when the vector factor is 4 or higher. Currently LoopVectorize
    branches to a scalar loop in these cases with a runtime check. Howver if
    we construct a mask that disables the overlapping (aliasing) lanes then
    the vectorised loop can be safely entered, as long as the loads and
    stores are masked off.
@SamTebbs33 SamTebbs33 changed the base branch from main to users/SamTebbs33/loop-dependence-costmodel November 13, 2025 10:54
// Replace uses of the loop body's active lane mask phi with an AND of the
// phi and the alias mask.
for (VPRecipeBase &R : *LoopBody) {
auto *MaskPhi = dyn_cast<VPActiveLaneMaskPHIRecipe>(&R);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the transform is currently incorrect. When there is no active lane mask, it would create an unpredicated vector loop that handles e.g. VF=16 lanes in a loop, even when the result of the alias.mask would say that only 3 lanes could be safely handled, for example. It would then increment the IV by 3 elements, but that doesn't mean only 3 lanes are handled each iteration. Without predication, it still handles 16 lanes each iteration.

I think there are two options here:

  1. if there is no active lane mask in the loop, we could bail out to the scalar loop if the number of lanes < VF
  2. request the use of an active lane mask in the loop for data when there is an alias mask required and the target supports the use of an active lane mask.

I wouldn't mind taking approach 1 first, so that we can already use the whilerw instructions for the alias checks in the check block, rather than a bunch of scalar instructions, and then follow this up by option 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we necessarily need an active-lane-mask, as long as either all recipes that need predication (memory ops, ops that are immediate UB on poison, reduction/recurrences) are already predicated (could be due to tail-folding without active-lane-mask) or we could convert them to predicated variants using the alias mask.

Also, an active-lane-mask also does not necessarily mean all required recipes are predicated and use the active-lane-mask (e.g. a transform may convert a masked memory access to an unmasked one, if it is guaranteed dereferneceable for the whole loop).

So would probably be good to check if all required recipes are masked and make sure their masks inlcude AliasMask after the transform

if (VecVT->getElementType()->getScalarSizeInBits() < 8) {
Cnt = Builder.CreateCast(
Instruction::ZExt, Cnt,
VectorType::get(Builder.getInt8Ty(), VecVT->getElementCount()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use a hard-coded i8, as this may not be sufficient to represent the number of bits from popcount.

// Replace uses of the loop body's active lane mask phi with an AND of the
// phi and the alias mask.
for (VPRecipeBase &R : *LoopBody) {
auto *MaskPhi = dyn_cast<VPActiveLaneMaskPHIRecipe>(&R);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we necessarily need an active-lane-mask, as long as either all recipes that need predication (memory ops, ops that are immediate UB on poison, reduction/recurrences) are already predicated (could be due to tail-folding without active-lane-mask) or we could convert them to predicated variants using the alias mask.

Also, an active-lane-mask also does not necessarily mean all required recipes are predicated and use the active-lane-mask (e.g. a transform may convert a masked memory access to an unmasked one, if it is guaranteed dereferneceable for the whole loop).

So would probably be good to check if all required recipes are masked and make sure their masks inlcude AliasMask after the transform

CM.Legal->getRuntimePointerChecking()->getDiffChecks();

// Create a mask enabling safe elements for each iteration.
if (CM.getRTCheckStyle(TTI) == RTCheckStyle::UseSafeEltsMask &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to outline to a separte function + document the transform

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AArch64 clang Clang issues not falling into any other category llvm:analysis Includes value tracking, cost tables and constant folding llvm:ir llvm:transforms vectorizers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants