-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[VectorCombine] Scalarize bin ops and cmps with two splatted operands #137786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Luke Lau (lukel97) ChangesConsider this C loop, simplified from a case in SPEC CPU 2017: void f(int *x, int *y) {
for (int i = 0; i < 1024; i++)
x[i] += *y * 2;
}On RISC-V we will currently generate this loop: We could avoid the splat
However DAGCombiner isn't the only place that can scalarize bin ops of splats. However it only matches the canonical form that InstCombine produces for two non-constant splatted operands, e.g. %x.insert = insertelement <vscale x 4 x i32> poison, i32 %x, i64 0
%y.insert = insertelement <vscale x 4 x i32> poison, i32 %y, i64 0
%0 = add <vscale x 4 x i32> %x.insert, %y.insert
%res = shufflevector <vscale x 4 x i32> %0, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializerIf one operand is constant, i.e. %x.insert = insertelement <vscale x 4 x i32> poison, i32 %x, i64 0
%x.splat = shufflevector <vscale x 4 x i32> %x.insert, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
%res = add <vscale x 4 x i32> %x.splat, splat (i32 42)This patch teaches VectorCombine to handle this case, and in doing so allows the splat to be avoided on RISC-V. I initially created a separate transform for this but there ended up being a lot of duplicated code, so I've tried my best to reuse the existing transform which operates on insertelements. Since there doesn't seem to be a dedicated "splat from scalar" cost hook I've just reused the insertelement cost, which is hopefully close enough. Full diff: https://github.com/llvm/llvm-project/pull/137786.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index 04c084ffdda97..cc6ef1656914c 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -1035,50 +1035,61 @@ bool VectorCombine::scalarizeBinopOrCmp(Instruction &I) {
if (match(U, m_Select(m_Specific(&I), m_Value(), m_Value())))
return false;
- // Match against one or both scalar values being inserted into constant
- // vectors:
- // vec_op VecC0, (inselt VecC1, V1, Index)
- // vec_op (inselt VecC0, V0, Index), VecC1
- // vec_op (inselt VecC0, V0, Index), (inselt VecC1, V1, Index)
- // TODO: Deal with mismatched index constants and variable indexes?
Constant *VecC0 = nullptr, *VecC1 = nullptr;
Value *V0 = nullptr, *V1 = nullptr;
- uint64_t Index0 = 0, Index1 = 0;
- if (!match(Ins0, m_InsertElt(m_Constant(VecC0), m_Value(V0),
- m_ConstantInt(Index0))) &&
- !match(Ins0, m_Constant(VecC0)))
- return false;
- if (!match(Ins1, m_InsertElt(m_Constant(VecC1), m_Value(V1),
- m_ConstantInt(Index1))) &&
- !match(Ins1, m_Constant(VecC1)))
- return false;
+ std::optional<uint64_t> Index;
+
+ // Try and match against two splatted operands first.
+ // vec_op (splat V0), (splat V1)
+ V0 = getSplatValue(Ins0);
+ V1 = getSplatValue(Ins1);
+ if (!V0 || !V1) {
+ // Match against one or both scalar values being inserted into constant
+ // vectors:
+ // vec_op VecC0, (inselt VecC1, V1, Index)
+ // vec_op (inselt VecC0, V0, Index), VecC1
+ // vec_op (inselt VecC0, V0, Index), (inselt VecC1, V1, Index)
+ // TODO: Deal with mismatched index constants and variable indexes?
+ V0 = nullptr, V1 = nullptr;
+ uint64_t Index0 = 0, Index1 = 0;
+ if (!match(Ins0, m_InsertElt(m_Constant(VecC0), m_Value(V0),
+ m_ConstantInt(Index0))) &&
+ !match(Ins0, m_Constant(VecC0)))
+ return false;
+ if (!match(Ins1, m_InsertElt(m_Constant(VecC1), m_Value(V1),
+ m_ConstantInt(Index1))) &&
+ !match(Ins1, m_Constant(VecC1)))
+ return false;
- bool IsConst0 = !V0;
- bool IsConst1 = !V1;
- if (IsConst0 && IsConst1)
- return false;
- if (!IsConst0 && !IsConst1 && Index0 != Index1)
- return false;
+ bool IsConst0 = !V0;
+ bool IsConst1 = !V1;
+ if (IsConst0 && IsConst1)
+ return false;
+ if (!IsConst0 && !IsConst1 && Index0 != Index1)
+ return false;
- auto *VecTy0 = cast<VectorType>(Ins0->getType());
- auto *VecTy1 = cast<VectorType>(Ins1->getType());
- if (VecTy0->getElementCount().getKnownMinValue() <= Index0 ||
- VecTy1->getElementCount().getKnownMinValue() <= Index1)
- return false;
+ auto *VecTy0 = cast<VectorType>(Ins0->getType());
+ auto *VecTy1 = cast<VectorType>(Ins1->getType());
+ if (VecTy0->getElementCount().getKnownMinValue() <= Index0 ||
+ VecTy1->getElementCount().getKnownMinValue() <= Index1)
+ return false;
- // Bail for single insertion if it is a load.
- // TODO: Handle this once getVectorInstrCost can cost for load/stores.
- auto *I0 = dyn_cast_or_null<Instruction>(V0);
- auto *I1 = dyn_cast_or_null<Instruction>(V1);
- if ((IsConst0 && I1 && I1->mayReadFromMemory()) ||
- (IsConst1 && I0 && I0->mayReadFromMemory()))
- return false;
+ // Bail for single insertion if it is a load.
+ // TODO: Handle this once getVectorInstrCost can cost for load/stores.
+ auto *I0 = dyn_cast_or_null<Instruction>(V0);
+ auto *I1 = dyn_cast_or_null<Instruction>(V1);
+ if ((IsConst0 && I1 && I1->mayReadFromMemory()) ||
+ (IsConst1 && I0 && I0->mayReadFromMemory()))
+ return false;
+
+ Index = IsConst0 ? Index1 : Index0;
+ }
- uint64_t Index = IsConst0 ? Index1 : Index0;
- Type *ScalarTy = IsConst0 ? V1->getType() : V0->getType();
- Type *VecTy = I.getType();
+ auto *VecTy = cast<VectorType>(I.getType());
+ Type *ScalarTy = VecTy->getElementType();
assert(VecTy->isVectorTy() &&
- (IsConst0 || IsConst1 || V0->getType() == V1->getType()) &&
+ (isa<Constant>(Ins0) || isa<Constant>(Ins1) ||
+ V0->getType() == V1->getType()) &&
(ScalarTy->isIntegerTy() || ScalarTy->isFloatingPointTy() ||
ScalarTy->isPointerTy()) &&
"Unexpected types for insert element into binop or cmp");
@@ -1099,12 +1110,14 @@ bool VectorCombine::scalarizeBinopOrCmp(Instruction &I) {
// Get cost estimate for the insert element. This cost will factor into
// both sequences.
InstructionCost InsertCost = TTI.getVectorInstrCost(
- Instruction::InsertElement, VecTy, CostKind, Index);
- InstructionCost OldCost =
- (IsConst0 ? 0 : InsertCost) + (IsConst1 ? 0 : InsertCost) + VectorOpCost;
- InstructionCost NewCost = ScalarOpCost + InsertCost +
- (IsConst0 ? 0 : !Ins0->hasOneUse() * InsertCost) +
- (IsConst1 ? 0 : !Ins1->hasOneUse() * InsertCost);
+ Instruction::InsertElement, VecTy, CostKind, Index.value_or(0));
+ InstructionCost OldCost = (isa<Constant>(Ins0) ? 0 : InsertCost) +
+ (isa<Constant>(Ins1) ? 0 : InsertCost) +
+ VectorOpCost;
+ InstructionCost NewCost =
+ ScalarOpCost + InsertCost +
+ (isa<Constant>(Ins0) ? 0 : !Ins0->hasOneUse() * InsertCost) +
+ (isa<Constant>(Ins1) ? 0 : !Ins1->hasOneUse() * InsertCost);
// We want to scalarize unless the vector variant actually has lower cost.
if (OldCost < NewCost || !NewCost.isValid())
@@ -1112,16 +1125,18 @@ bool VectorCombine::scalarizeBinopOrCmp(Instruction &I) {
// vec_op (inselt VecC0, V0, Index), (inselt VecC1, V1, Index) -->
// inselt NewVecC, (scalar_op V0, V1), Index
+ //
+ // vec_op (splat V0), (splat V1) --> splat (scalar_op V0, V1)
if (IsCmp)
++NumScalarCmp;
else
++NumScalarBO;
// For constant cases, extract the scalar element, this should constant fold.
- if (IsConst0)
- V0 = ConstantExpr::getExtractElement(VecC0, Builder.getInt64(Index));
- if (IsConst1)
- V1 = ConstantExpr::getExtractElement(VecC1, Builder.getInt64(Index));
+ if (Index && isa<Constant>(Ins0))
+ V0 = ConstantExpr::getExtractElement(VecC0, Builder.getInt64(*Index));
+ if (Index && isa<Constant>(Ins1))
+ V1 = ConstantExpr::getExtractElement(VecC1, Builder.getInt64(*Index));
Value *Scalar =
IsCmp ? Builder.CreateCmp(Pred, V0, V1)
@@ -1134,12 +1149,16 @@ bool VectorCombine::scalarizeBinopOrCmp(Instruction &I) {
if (auto *ScalarInst = dyn_cast<Instruction>(Scalar))
ScalarInst->copyIRFlags(&I);
- // Fold the vector constants in the original vectors into a new base vector.
- Value *NewVecC =
- IsCmp ? Builder.CreateCmp(Pred, VecC0, VecC1)
- : Builder.CreateBinOp((Instruction::BinaryOps)Opcode, VecC0, VecC1);
- Value *Insert = Builder.CreateInsertElement(NewVecC, Scalar, Index);
- replaceValue(I, *Insert);
+ Value *Result;
+ if (Index) {
+ // Fold the vector constants in the original vectors into a new base vector.
+ Value *NewVecC = IsCmp ? Builder.CreateCmp(Pred, VecC0, VecC1)
+ : Builder.CreateBinOp((Instruction::BinaryOps)Opcode,
+ VecC0, VecC1);
+ Result = Builder.CreateInsertElement(NewVecC, Scalar, *Index);
+ } else
+ Result = Builder.CreateVectorSplat(VecTy->getElementCount(), Scalar);
+ replaceValue(I, *Result);
return true;
}
diff --git a/llvm/test/Transforms/VectorCombine/X86/shuffle-of-cmps.ll b/llvm/test/Transforms/VectorCombine/X86/shuffle-of-cmps.ll
index f9108efa7ee79..0da4a72291380 100644
--- a/llvm/test/Transforms/VectorCombine/X86/shuffle-of-cmps.ll
+++ b/llvm/test/Transforms/VectorCombine/X86/shuffle-of-cmps.ll
@@ -271,12 +271,24 @@ define <4 x i32> @shuf_icmp_ugt_v4i32_use(<4 x i32> %x, <4 x i32> %y, <4 x i32>
; PR121110 - don't merge equivalent (but not matching) predicates
define <2 x i1> @PR121110() {
-; CHECK-LABEL: define <2 x i1> @PR121110(
-; CHECK-SAME: ) #[[ATTR0]] {
-; CHECK-NEXT: [[UGT:%.*]] = icmp samesign ugt <2 x i32> zeroinitializer, zeroinitializer
-; CHECK-NEXT: [[SGT:%.*]] = icmp sgt <2 x i32> zeroinitializer, <i32 6, i32 -4>
-; CHECK-NEXT: [[RES:%.*]] = shufflevector <2 x i1> [[UGT]], <2 x i1> [[SGT]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: ret <2 x i1> [[RES]]
+; SSE-LABEL: define <2 x i1> @PR121110(
+; SSE-SAME: ) #[[ATTR0]] {
+; SSE-NEXT: [[SGT:%.*]] = icmp sgt <2 x i32> zeroinitializer, <i32 6, i32 -4>
+; SSE-NEXT: [[RES:%.*]] = shufflevector <2 x i1> zeroinitializer, <2 x i1> [[SGT]], <2 x i32> <i32 0, i32 3>
+; SSE-NEXT: ret <2 x i1> [[RES]]
+;
+; AVX2-LABEL: define <2 x i1> @PR121110(
+; AVX2-SAME: ) #[[ATTR0]] {
+; AVX2-NEXT: [[SGT:%.*]] = icmp sgt <2 x i32> zeroinitializer, <i32 6, i32 -4>
+; AVX2-NEXT: [[RES:%.*]] = shufflevector <2 x i1> zeroinitializer, <2 x i1> [[SGT]], <2 x i32> <i32 0, i32 3>
+; AVX2-NEXT: ret <2 x i1> [[RES]]
+;
+; AVX512-LABEL: define <2 x i1> @PR121110(
+; AVX512-SAME: ) #[[ATTR0]] {
+; AVX512-NEXT: [[UGT:%.*]] = icmp samesign ugt <2 x i32> zeroinitializer, zeroinitializer
+; AVX512-NEXT: [[SGT:%.*]] = icmp sgt <2 x i32> zeroinitializer, <i32 6, i32 -4>
+; AVX512-NEXT: [[RES:%.*]] = shufflevector <2 x i1> [[UGT]], <2 x i1> [[SGT]], <2 x i32> <i32 0, i32 3>
+; AVX512-NEXT: ret <2 x i1> [[RES]]
;
%ugt = icmp samesign ugt <2 x i32> < i32 0, i32 0 >, < i32 0, i32 0 >
%sgt = icmp sgt <2 x i32> < i32 0, i32 0 >, < i32 6, i32 4294967292 >
@@ -285,12 +297,24 @@ define <2 x i1> @PR121110() {
}
define <2 x i1> @PR121110_commute() {
-; CHECK-LABEL: define <2 x i1> @PR121110_commute(
-; CHECK-SAME: ) #[[ATTR0]] {
-; CHECK-NEXT: [[SGT:%.*]] = icmp sgt <2 x i32> zeroinitializer, <i32 6, i32 -4>
-; CHECK-NEXT: [[UGT:%.*]] = icmp samesign ugt <2 x i32> zeroinitializer, zeroinitializer
-; CHECK-NEXT: [[RES:%.*]] = shufflevector <2 x i1> [[SGT]], <2 x i1> [[UGT]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: ret <2 x i1> [[RES]]
+; SSE-LABEL: define <2 x i1> @PR121110_commute(
+; SSE-SAME: ) #[[ATTR0]] {
+; SSE-NEXT: [[SGT:%.*]] = icmp sgt <2 x i32> zeroinitializer, <i32 6, i32 -4>
+; SSE-NEXT: [[RES:%.*]] = shufflevector <2 x i1> [[SGT]], <2 x i1> zeroinitializer, <2 x i32> <i32 0, i32 3>
+; SSE-NEXT: ret <2 x i1> [[RES]]
+;
+; AVX2-LABEL: define <2 x i1> @PR121110_commute(
+; AVX2-SAME: ) #[[ATTR0]] {
+; AVX2-NEXT: [[SGT:%.*]] = icmp sgt <2 x i32> zeroinitializer, <i32 6, i32 -4>
+; AVX2-NEXT: [[RES:%.*]] = shufflevector <2 x i1> [[SGT]], <2 x i1> zeroinitializer, <2 x i32> <i32 0, i32 3>
+; AVX2-NEXT: ret <2 x i1> [[RES]]
+;
+; AVX512-LABEL: define <2 x i1> @PR121110_commute(
+; AVX512-SAME: ) #[[ATTR0]] {
+; AVX512-NEXT: [[SGT:%.*]] = icmp sgt <2 x i32> zeroinitializer, <i32 6, i32 -4>
+; AVX512-NEXT: [[UGT:%.*]] = icmp samesign ugt <2 x i32> zeroinitializer, zeroinitializer
+; AVX512-NEXT: [[RES:%.*]] = shufflevector <2 x i1> [[SGT]], <2 x i1> [[UGT]], <2 x i32> <i32 0, i32 3>
+; AVX512-NEXT: ret <2 x i1> [[RES]]
;
%sgt = icmp sgt <2 x i32> < i32 0, i32 0 >, < i32 6, i32 4294967292 >
%ugt = icmp samesign ugt <2 x i32> < i32 0, i32 0 >, < i32 0, i32 0 >
diff --git a/llvm/test/Transforms/VectorCombine/scalarize-binop.ll b/llvm/test/Transforms/VectorCombine/scalarize-binop.ll
new file mode 100644
index 0000000000000..0bc3826e73902
--- /dev/null
+++ b/llvm/test/Transforms/VectorCombine/scalarize-binop.ll
@@ -0,0 +1,94 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes=vector-combine < %s | FileCheck %s
+
+define <4 x i32> @add_v4i32(i32 %x, i32 %y) {
+; CHECK-LABEL: define <4 x i32> @add_v4i32(
+; CHECK-SAME: i32 [[X:%.*]], i32 [[Y:%.*]]) {
+; CHECK-NEXT: [[RES_SCALAR:%.*]] = add i32 [[X]], 42
+; CHECK-NEXT: [[X_HEAD:%.*]] = insertelement <4 x i32> poison, i32 [[RES_SCALAR]], i64 0
+; CHECK-NEXT: [[X_SPLAT:%.*]] = shufflevector <4 x i32> [[X_HEAD]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT: ret <4 x i32> [[X_SPLAT]]
+;
+ %x.head = insertelement <4 x i32> poison, i32 %x, i32 0
+ %x.splat = shufflevector <4 x i32> %x.head, <4 x i32> poison, <4 x i32> zeroinitializer
+ %res = add <4 x i32> %x.splat, splat (i32 42)
+ ret <4 x i32> %res
+}
+
+define <vscale x 4 x i32> @add_nxv4i32(i32 %x, i32 %y) {
+; CHECK-LABEL: define <vscale x 4 x i32> @add_nxv4i32(
+; CHECK-SAME: i32 [[X:%.*]], i32 [[Y:%.*]]) {
+; CHECK-NEXT: [[RES_SCALAR:%.*]] = add i32 [[X]], 42
+; CHECK-NEXT: [[Y_HEAD1:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[RES_SCALAR]], i64 0
+; CHECK-NEXT: [[Y_SPLAT1:%.*]] = shufflevector <vscale x 4 x i32> [[Y_HEAD1]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT: ret <vscale x 4 x i32> [[Y_SPLAT1]]
+;
+ %x.head = insertelement <vscale x 4 x i32> poison, i32 %x, i32 0
+ %x.splat = shufflevector <vscale x 4 x i32> %x.head, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+ %res = add <vscale x 4 x i32> %x.splat, splat (i32 42)
+ ret <vscale x 4 x i32> %res
+}
+
+; Make sure that we can scalarize sequences of vector instructions.
+define <4 x i32> @add_mul_v4i32(i32 %x, i32 %y, i32 %z) {
+; CHECK-LABEL: define <4 x i32> @add_mul_v4i32(
+; CHECK-SAME: i32 [[X:%.*]], i32 [[Y:%.*]], i32 [[Z:%.*]]) {
+; CHECK-NEXT: [[RES0_SCALAR:%.*]] = add i32 [[X]], 42
+; CHECK-NEXT: [[RES1_SCALAR:%.*]] = mul i32 [[RES0_SCALAR]], 42
+; CHECK-NEXT: [[Z_HEAD1:%.*]] = insertelement <4 x i32> poison, i32 [[RES1_SCALAR]], i64 0
+; CHECK-NEXT: [[Z_SPLAT1:%.*]] = shufflevector <4 x i32> [[Z_HEAD1]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT: ret <4 x i32> [[Z_SPLAT1]]
+;
+ %x.head = insertelement <4 x i32> poison, i32 %x, i32 0
+ %x.splat = shufflevector <4 x i32> %x.head, <4 x i32> poison, <4 x i32> zeroinitializer
+ %res0 = add <4 x i32> %x.splat, splat (i32 42)
+ %res1 = mul <4 x i32> %res0, splat (i32 42)
+ ret <4 x i32> %res1
+}
+
+; Shouldn't be scalarized since %x.splat and %y.splat have other users.
+define <4 x i32> @other_users_v4i32(i32 %x, i32 %y, ptr %p, ptr %q) {
+; CHECK-LABEL: define <4 x i32> @other_users_v4i32(
+; CHECK-SAME: i32 [[X:%.*]], i32 [[Y:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]]) {
+; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[X]], i32 0
+; CHECK-NEXT: [[RES:%.*]] = shufflevector <4 x i32> [[DOTSPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT: [[RES1:%.*]] = add <4 x i32> [[RES]], splat (i32 42)
+; CHECK-NEXT: store <4 x i32> [[RES]], ptr [[P]], align 16
+; CHECK-NEXT: store <4 x i32> [[RES]], ptr [[Q]], align 16
+; CHECK-NEXT: ret <4 x i32> [[RES1]]
+;
+ %x.head = insertelement <4 x i32> poison, i32 %x, i32 0
+ %x.splat = shufflevector <4 x i32> %x.head, <4 x i32> poison, <4 x i32> zeroinitializer
+ %res = add <4 x i32> %x.splat, splat (i32 42)
+ store <4 x i32> %x.splat, ptr %p
+ store <4 x i32> %x.splat, ptr %q
+ ret <4 x i32> %res
+}
+
+define <4 x i1> @icmp_v4i32(i32 %x, i32 %y) {
+; CHECK-LABEL: define <4 x i1> @icmp_v4i32(
+; CHECK-SAME: i32 [[X:%.*]], i32 [[Y:%.*]]) {
+; CHECK-NEXT: [[RES_SCALAR:%.*]] = icmp eq i32 [[X]], 42
+; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <4 x i1> poison, i1 [[RES_SCALAR]], i64 0
+; CHECK-NEXT: [[RES:%.*]] = shufflevector <4 x i1> [[DOTSPLATINSERT]], <4 x i1> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT: ret <4 x i1> [[RES]]
+;
+ %x.head = insertelement <4 x i32> poison, i32 %x, i32 0
+ %x.splat = shufflevector <4 x i32> %x.head, <4 x i32> poison, <4 x i32> zeroinitializer
+ %res = icmp eq <4 x i32> %x.splat, splat (i32 42)
+ ret <4 x i1> %res
+}
+
+define <vscale x 4 x i1> @icmp_nxv4i32(i32 %x, i32 %y) {
+; CHECK-LABEL: define <vscale x 4 x i1> @icmp_nxv4i32(
+; CHECK-SAME: i32 [[X:%.*]], i32 [[Y:%.*]]) {
+; CHECK-NEXT: [[RES_SCALAR:%.*]] = icmp eq i32 [[X]], 42
+; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i1> poison, i1 [[RES_SCALAR]], i64 0
+; CHECK-NEXT: [[RES:%.*]] = shufflevector <vscale x 4 x i1> [[DOTSPLATINSERT]], <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT: ret <vscale x 4 x i1> [[RES]]
+;
+ %x.head = insertelement <vscale x 4 x i32> poison, i32 %x, i32 0
+ %x.splat = shufflevector <vscale x 4 x i32> %x.head, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+ %res = icmp eq <vscale x 4 x i32> %x.splat, splat (i32 42)
+ ret <vscale x 4 x i1> %res
+}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised the new pattern you proposed here wasn't created first, maybe it's because historically splat has been considered cheap?
| : Builder.CreateBinOp((Instruction::BinaryOps)Opcode, | ||
| VecC0, VecC1); | ||
| Result = Builder.CreateInsertElement(NewVecC, Scalar, *Index); | ||
| } else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matching curly braces
|
Couple of high level comments:
Honestly, I think this pushes thing in the direction of a new transform, but I'm open to being convinced I'm wrong there. Note that scalarizeVPIntrinsic does exactly the proposed transform, but only for VP. I don't see obvious code sharing there, but maybe there's something? If you do want to take the single transform route, please make sure you update the block comments.
This looks like a missing case in the shuffle canonicalization done by InstCombine in foldVectorBinop. We should be able to convert this to: In fact, I'm confused why the block which starts with "If one argument is a shuffle within one vector and the other is a constant" doesn't transform this case. Now, I'm not really sure that's a good canonicalization overall (since it breaks the splat recognition), but why isn't the instcombine code able to handle this? |
| uint64_t Index0 = 0, Index1 = 0; | ||
| if (!match(Ins0, m_InsertElt(m_Constant(VecC0), m_Value(V0), | ||
| m_ConstantInt(Index0))) && | ||
| !match(Ins0, m_Constant(VecC0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the prior code did handle constant splats here. I think most of the m_Constant dependent bits become dead with your change? Of, this is because of the coupled condition noted just above.
| // vec_op (splat V0), (splat V1) | ||
| V0 = getSplatValue(Ins0); | ||
| V1 = getSplatValue(Ins1); | ||
| if (!V0 || !V1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're going to need this if (!VO) { do something; } if (!V1) { do_something } if (indices don't match) { bail; }
| (IsConst0 ? 0 : !Ins0->hasOneUse() * InsertCost) + | ||
| (IsConst1 ? 0 : !Ins1->hasOneUse() * InsertCost); | ||
| Instruction::InsertElement, VecTy, CostKind, Index.value_or(0)); | ||
| InstructionCost OldCost = (isa<Constant>(Ins0) ? 0 : InsertCost) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The costing here looks suspect. In particular, I would assume that (splat x) + (splat y) needs to be costed differently that (insertelt undef, x, 0) + (insertelt undef, y, 0). Consider the case where this is a m8 vector type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that as far as I can tell we don't have a "splat from scalar" cost hook, which is why I originally kept it as just an insert. I think the technically correct cost would be an insert element + broadcast shuffle. But that would overcost it on RISC-V I presume?
| InstructionCost OldCost = (isa<Constant>(Ins0) ? 0 : InsertCost) + | ||
| (isa<Constant>(Ins1) ? 0 : InsertCost) + | ||
| VectorOpCost; | ||
| InstructionCost NewCost = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same basic costing problem with the NewCost too.
I think the reason is relatively subtle: because that rule currently matches the constant with |
Yeah, and on top of that I agree that cramming two patterns into one transform is less than ideal. I really want to avoid making a separate transform since to handle the actual cases I'm seeing in the wild we need to extend scalarization to not just handle bin ops, but unary ops and intrinsics too, potentially with 3 operands e.g. in the case of It actually looks like the canonicalization only fails on scalable vectors. E.g. define <4 x i32> @fixed(i32 %x) {
%x.insert = insertelement <4 x i32> poison, i32 %x, i64 0
%x.splat = shufflevector <4 x i32> %x.insert, <4 x i32> poison, <4 x i32> zeroinitializer
%res = add <4 x i32> %x.splat, splat (i32 42)
ret <4 x i32> %res
}Will get canonicalized to this: define <4 x i32> @fixed(i32 %x) {
%x.insert = insertelement <4 x i32> poison, i32 %x, i64 0
%1 = add <4 x i32> %x.insert, <i32 42, i32 poison, i32 poison, i32 poison>
%res = shufflevector <4 x i32> %1, <4 x i32> poison, <4 x i32> zeroinitializer
ret <4 x i32> %res
}I will try canonicalizing scalable vectors this way in InstCombine and see what breaks. If that works then we don't need this PR. |
|
Thanks for the reviews, I've opened up #137948 to fix this from the InstCombine side, which I think is a better way to solve it. The codegen diff on RISC-V from that PR is actually a lot more minor than I expected, and I didn't notice any missed splat patterns. The only thing is that it also needs #137823 to make sure that everything still gets scalarized. Otherwise it subsumes the improvements seen from this PR, so I'll close this now. |
Given a binary op on splatted vector and a splatted constant, InstCombine will normally pull the shuffle out in `InstCombinerImpl::foldVectorBinop`: ```llvm define <4 x i32> @f(i32 %x) { %x.insert = insertelement <4 x i32> poison, i32 %x, i64 0 %x.splat = shufflevector <4 x i32> %x.insert, <4 x i32> poison, <4 x i32> zeroinitializer %res = add <4 x i32> %x.splat, splat (i32 42) ret <4 x i32> %res } ``` ```llvm define <4 x i32> @f(i32 %x) { %x.insert = insertelement <4 x i32> poison, i32 %x, i64 0 %1 = add <4 x i32> %x.insert, <i32 42, i32 poison, i32 poison, i32 poison> %res = shufflevector <4 x i32> %1, <4 x i32> poison, <4 x i32> zeroinitializer ret <4 x i32> %res } ``` However, this currently only operates on fixed length vectors. Splats of scalable vectors don't currently have their shuffle pulled out, e.g: ```llvm define <vscale x 4 x i32> @f(i32 %x) { %x.insert = insertelement <vscale x 4 x i32> poison, i32 %x, i64 0 %x.splat = shufflevector <vscale x 4 x i32> %x.insert, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer %res = add <vscale x 4 x i32> %x.splat, splat (i32 42) ret <vscale x 4 x i32> %res } ``` Having this canonical form with the shuffle pulled out is important as VectorCombine relies on it in order to scalarize binary ops in `scalarizeBinopOrCmp`, which would prevent the need for #137786. This also brings it in line for scalable binary ops with two non-constant operands: https://godbolt.org/z/M9f7ebzca This adds a combine just after the fixed-length version, but restricted to splats at index 0 so that it also handles the scalable case: So the whilst the existing combine looks like: `Op(shuffle(V1, Mask), C) -> shuffle(Op(V1, NewC), Mask)` This patch adds: `Op(shuffle(V1, 0), (splat C)) -> shuffle(Op(V1, (splat C)), 0)` I think this could be generalized to other splat indexes that aren't zero, but I think it would be dead code since only fixed-length vectors can have non-zero shuffle indices, which would be covered by the existing combine.
Currently VectorCombine can scalarize vector compares and binary ops. This extends it to also scalarize binary-op like intrinsics like umax, minnum etc. The motivation behind this is to scalarize more intrinsics in VectorCombine rather than in DAGCombine, so we can sink splats across basic blocks: see #137786 This currently has very little effect on generated code because InstCombine doesn't yet canonicalize binary intrinsics where one operand is a constant into the form that VectorCombine expects, i.e. `binop (shuffle insert) const --> shuffle (binop insert const)`. The plan is to land this first and then in a subsequent patch teach InstCombine to do the canonicalization to avoid regressions in the meantime. This uses `isTriviallyVectorizable` to determine whether or not an intrinsic is safe to scalarize. There's also `isTriviallyScalarizable`, but this seems more geared towards the Scalarizer pass and includes intrinsics with multiple return values. It also only handles intrinsics with two operands with the same type as the return type. In the future we would generalize this to handle arbitrary numbers of operands, including unary operators too, e.g. fneg or fma, as well as different operand types, e.g. powi or scmp
…Cmp (#138406) This adds support for unary operands, and unary + ternary intrinsics in scalarizeOpOrCmp (FKA scalarizeBinOpOrCmp). The motivation behind this is to scalarize more intrinsics in VectorCombine rather than in DAGCombine, so we can sink splats across basic blocks: see #137786 The main change required is to generalize the existing VecC0/VecC1 rules across n-ary ops: - An operand can either be a constant vector or an insert of a scalar into a constant vector - If it's an insert, the index needs to be static and in bounds - If it's an insert, all indices need to be the same across all operands - If all the operands are constant vectors, bail as it will get constant folded anyway
…alarizeOpOrCmp (#138406) This adds support for unary operands, and unary + ternary intrinsics in scalarizeOpOrCmp (FKA scalarizeBinOpOrCmp). The motivation behind this is to scalarize more intrinsics in VectorCombine rather than in DAGCombine, so we can sink splats across basic blocks: see llvm/llvm-project#137786 The main change required is to generalize the existing VecC0/VecC1 rules across n-ary ops: - An operand can either be a constant vector or an insert of a scalar into a constant vector - If it's an insert, the index needs to be static and in bounds - If it's an insert, all indices need to be the same across all operands - If all the operands are constant vectors, bail as it will get constant folded anyway
Consider this C loop, simplified from a case in SPEC CPU 2017:
On RISC-V we will currently generate this loop:
We could avoid the splat
vmv.v.x v8, a5by usingvadd.vx v10, v10, a5, but this doesn't happen today:noalias, so LICM can't initially hoist out the*y * 2.*y * 2is now vectorized.*y * 2into the preheader.*y * 2once it detects it is a splat. But this is too late as we can no longer sink the splat into the loop body, so we can't matchvadd.vx.However DAGCombiner isn't the only place that can scalarize bin ops of splats.
VectorCombine::scalarizeBinopOrCmpcan also scalarize bin ops, and because it's before CodeGenPrepare this allows the splat to be sunk into the loop body.However it only matches the canonical form that InstCombine produces for two non-constant splatted operands, e.g.
x[i] += *y * *z:If one operand is constant, i.e.
x[i] += *y * 2, then the IR will be in the form:This patch teaches VectorCombine to handle this case, and in doing so allows the splat to be avoided on RISC-V.
I initially created a separate transform for this but there ended up being a lot of duplicated code, so I've tried my best to reuse the existing transform which operates on insertelements.
Since there doesn't seem to be a dedicated "splat from scalar" cost hook I've just reused the insertelement cost, which is hopefully close enough.