-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[VPlan] Support scalable VFs in narrowInterleaveGroups. #154842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Update narrowInterleaveGroups to support scalable VFs. After the transform, the vector loop will process a single iteration of the original vector loop for fixed-width vectors and vscale iterations for scalable vectors.
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-vectorizers Author: Florian Hahn (fhahn) ChangesUpdate narrowInterleaveGroups to support scalable VFs. After the transform, the vector loop will process a single iteration of the original vector loop for fixed-width vectors and vscale iterations for scalable vectors. Full diff: https://github.com/llvm/llvm-project/pull/154842.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 504dd9a7c22c7..2d8ad7daf2da6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -3533,12 +3533,12 @@ static bool isAlreadyNarrow(VPValue *VPV) {
void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
unsigned VectorRegWidth) {
VPRegionBlock *VectorLoop = Plan.getVectorLoopRegion();
- if (VF.isScalable() || !VectorLoop)
+ if (!VectorLoop)
return;
VPTypeAnalysis TypeInfo(Plan);
- unsigned FixedVF = VF.getFixedValue();
+ unsigned VFMinVal = VF.getKnownMinValue();
SmallVector<VPInterleaveRecipe *> StoreGroups;
for (auto &R : *VectorLoop->getEntryBasicBlock()) {
if (isa<VPCanonicalIVPHIRecipe>(&R) ||
@@ -3574,7 +3574,7 @@ void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
continue;
// Bail out on non-consecutive interleave groups.
- if (!isConsecutiveInterleaveGroup(InterleaveR, FixedVF, TypeInfo,
+ if (!isConsecutiveInterleaveGroup(InterleaveR, VFMinVal, TypeInfo,
VectorRegWidth))
return;
@@ -3693,10 +3693,21 @@ void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
// original iteration.
auto *CanIV = Plan.getCanonicalIV();
auto *Inc = cast<VPInstruction>(CanIV->getBackedgeValue());
- Inc->setOperand(1, Plan.getOrAddLiveIn(ConstantInt::get(
- CanIV->getScalarType(), 1 * Plan.getUF())));
- Plan.getVF().replaceAllUsesWith(
- Plan.getOrAddLiveIn(ConstantInt::get(CanIV->getScalarType(), 1)));
+ VPBuilder PHBuilder(Plan.getVectorPreheader());
+
+ VPValue *UF = Plan.getOrAddLiveIn(
+ ConstantInt::get(CanIV->getScalarType(), 1 * Plan.getUF()));
+ if (VF.isScalable()) {
+ VPValue *VScale = PHBuilder.createElementCount(
+ CanIV->getScalarType(), ElementCount::getScalable(1));
+ VPValue *VScaleUF = PHBuilder.createNaryOp(Instruction::Mul, {VScale, UF});
+ Inc->setOperand(1, VScaleUF);
+ Plan.getVF().replaceAllUsesWith(VScale);
+ } else {
+ Inc->setOperand(1, UF);
+ Plan.getVF().replaceAllUsesWith(
+ Plan.getOrAddLiveIn(ConstantInt::get(CanIV->getScalarType(), 1)));
+ }
removeDeadRecipes(Plan);
}
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-scalable.ll b/llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-scalable.ll
index 7533636f9d41c..46ba7f645a03e 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-scalable.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-scalable.ll
@@ -16,18 +16,15 @@ define void @load_store_interleave_group(ptr noalias %data) {
; CHECK-NEXT: [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 2
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 100, [[TMP3]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 100, [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK: [[VECTOR_BODY]]:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = shl nsw i64 [[INDEX]], 1
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i64, ptr [[DATA]], i64 [[TMP0]]
-; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <vscale x 4 x i64>, ptr [[TMP1]], align 8
-; CHECK-NEXT: [[STRIDED_VEC:%.*]] = call { <vscale x 2 x i64>, <vscale x 2 x i64> } @llvm.vector.deinterleave2.nxv4i64(<vscale x 4 x i64> [[WIDE_VEC]])
-; CHECK-NEXT: [[TMP6:%.*]] = extractvalue { <vscale x 2 x i64>, <vscale x 2 x i64> } [[STRIDED_VEC]], 0
-; CHECK-NEXT: [[TMP7:%.*]] = extractvalue { <vscale x 2 x i64>, <vscale x 2 x i64> } [[STRIDED_VEC]], 1
-; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = call <vscale x 4 x i64> @llvm.vector.interleave2.nxv4i64(<vscale x 2 x i64> [[TMP6]], <vscale x 2 x i64> [[TMP7]])
-; CHECK-NEXT: store <vscale x 4 x i64> [[INTERLEAVED_VEC]], ptr [[TMP1]], align 8
-; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP3]]
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 2 x i64>, ptr [[TMP1]], align 8
+; CHECK-NEXT: store <vscale x 2 x i64> [[WIDE_LOAD]], ptr [[TMP1]], align 8
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP8]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: [[MIDDLE_BLOCK]]:
@@ -69,20 +66,16 @@ define void @test_2xi64_unary_op_load_interleave_group(ptr noalias %data, ptr no
; CHECK-NEXT: [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 2
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1111, [[TMP3]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 1111, [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK: [[VECTOR_BODY]]:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = shl nsw i64 [[INDEX]], 1
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds double, ptr [[DATA]], i64 [[TMP0]]
-; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <vscale x 4 x double>, ptr [[TMP1]], align 8
-; CHECK-NEXT: [[STRIDED_VEC:%.*]] = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.vector.deinterleave2.nxv4f64(<vscale x 4 x double> [[WIDE_VEC]])
-; CHECK-NEXT: [[TMP6:%.*]] = extractvalue { <vscale x 2 x double>, <vscale x 2 x double> } [[STRIDED_VEC]], 0
-; CHECK-NEXT: [[TMP7:%.*]] = extractvalue { <vscale x 2 x double>, <vscale x 2 x double> } [[STRIDED_VEC]], 1
-; CHECK-NEXT: [[TMP8:%.*]] = fneg <vscale x 2 x double> [[TMP6]]
+; CHECK-NEXT: [[TMP7:%.*]] = load <vscale x 2 x double>, ptr [[TMP1]], align 8
; CHECK-NEXT: [[TMP9:%.*]] = fneg <vscale x 2 x double> [[TMP7]]
-; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = call <vscale x 4 x double> @llvm.vector.interleave2.nxv4f64(<vscale x 2 x double> [[TMP8]], <vscale x 2 x double> [[TMP9]])
-; CHECK-NEXT: store <vscale x 4 x double> [[INTERLEAVED_VEC]], ptr [[TMP1]], align 8
-; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP3]]
+; CHECK-NEXT: store <vscale x 2 x double> [[TMP9]], ptr [[TMP1]], align 8
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: [[MIDDLE_BLOCK]]:
|
|
|
||
| // Bail out on non-consecutive interleave groups. | ||
| if (!isConsecutiveInterleaveGroup(InterleaveR, FixedVF, TypeInfo, | ||
| if (!isConsecutiveInterleaveGroup(InterleaveR, VFMinVal, TypeInfo, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically, this is a bit confusing. Maybe instead pass in the ElementCount, and adjust the routine to work in terms of TypeSize and perform the final check in terms of not being equal to VF, but a multiple of?
Though, I think I'm a bit confused on what this check is actually doing. This seems to be disallowing fixed VFs which are a multiple of the interleave group size? Is that an intentional restriction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes at the moment this will only transform fixed vectors, if the interleave group processes exactly VF elements, without gaps. It can be extended to handle multiples of a fixed VF, but that should probably be a separate change.
| ; CHECK-NEXT: [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 2 | ||
| ; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 100, [[TMP3]] | ||
| ; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 100, [[N_MOD_VF]] | ||
| ; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is the calculation of N_VEC still correct here? The vector loop is processing half the number of iterations compared to before, so I thought the calculation needed updating. Does the narrowing only take place for power-of-2 interleave factors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vector trip count should still be correct, although we may be able to process more iterations in it, as we now step by vscale, instead of vscale x 2. But that is also a separate change.
We update the step in the loop to only step by vscale, so using the orignial vector trip count should be fine I htink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I see. We only perform the transform if the VF is divisible by the interleave factor, which currently excludes interleave factors that aren't powers of 2. So doesn't have to be done in this PR, but I do think N_VEC should be recalculated because we're making the scalar tail longer than it needs to be. Suppose the original trip count was 19, the interleave factor is 4 and the VF is 4. N_VEC will be 19 - (19 % 4) = 16, which means we're only processing 16 iterations when in reality we can process all 19 and delete the tail completely. For scalable VFs we can't delete the tail, but we can still process more iterations in the vector loop, if that makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When interleaving there will still be a scalar tail of course, but for VF=4,IC=2 that's just a single remaining iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the minimum iteration check could be improved. This should now be do-able as we create the minimum iteration check directly in VPlan (when not vectorizing the epilogue). I can check separately.
david-arm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| ; CHECK-NEXT: [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 2 | ||
| ; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 100, [[TMP3]] | ||
| ; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 100, [[N_MOD_VF]] | ||
| ; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I see. We only perform the transform if the VF is divisible by the interleave factor, which currently excludes interleave factors that aren't powers of 2. So doesn't have to be done in this PR, but I do think N_VEC should be recalculated because we're making the scalar tail longer than it needs to be. Suppose the original trip count was 19, the interleave factor is 4 and the VF is 4. N_VEC will be 19 - (19 % 4) = 16, which means we're only processing 16 iterations when in reality we can process all 19 and delete the tail completely. For scalable VFs we can't delete the tail, but we can still process more iterations in the vector loop, if that makes sense?
Update narrowInterleaveGroups to support scalable VFs. After the transform, the vector loop will process a single iteration of the original vector loop for fixed-width vectors and vscale iterations for scalable vectors.