-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[RISCV][DAGCombiner] Fix potential missed combine in VL->VW extension #168026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-backend-risc-v Author: None (OMG-link) ChangesProblem: The previous implementation of Changes:
Behavioral Improvement A minimal reproducer ( Before this patch, the DAG produces:
After this patch, both mul+add chains are folded into:
Verification:
Patch is 51.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/168026.diff 6 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 38cce26e44af4..ae29b39d5b675 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -57,12 +57,6 @@ using namespace llvm;
STATISTIC(NumTailCalls, "Number of tail calls");
-static cl::opt<unsigned> ExtensionMaxWebSize(
- DEBUG_TYPE "-ext-max-web-size", cl::Hidden,
- cl::desc("Give the maximum size (in number of nodes) of the web of "
- "instructions that we will consider for VW expansion"),
- cl::init(18));
-
static cl::opt<bool>
AllowSplatInVW_W(DEBUG_TYPE "-form-vw-w-with-splat", cl::Hidden,
cl::desc("Allow the formation of VW_W operations (e.g., "
@@ -18201,109 +18195,30 @@ static SDValue combineOp_VLToVWOp_VL(SDNode *N,
if (!NodeExtensionHelper::isSupportedRoot(N, Subtarget))
return SDValue();
- SmallVector<SDNode *> Worklist;
- SmallPtrSet<SDNode *, 8> Inserted;
- SmallPtrSet<SDNode *, 8> ExtensionsToRemove;
- Worklist.push_back(N);
- Inserted.insert(N);
- SmallVector<CombineResult> CombinesToApply;
-
- while (!Worklist.empty()) {
- SDNode *Root = Worklist.pop_back_val();
-
- NodeExtensionHelper LHS(Root, 0, DAG, Subtarget);
- NodeExtensionHelper RHS(Root, 1, DAG, Subtarget);
- auto AppendUsersIfNeeded =
- [&Worklist, &Subtarget, &Inserted,
- &ExtensionsToRemove](const NodeExtensionHelper &Op) {
- if (Op.needToPromoteOtherUsers()) {
- // Remember that we're supposed to remove this extension.
- ExtensionsToRemove.insert(Op.OrigOperand.getNode());
- for (SDUse &Use : Op.OrigOperand->uses()) {
- SDNode *TheUser = Use.getUser();
- if (!NodeExtensionHelper::isSupportedRoot(TheUser, Subtarget))
- return false;
- // We only support the first 2 operands of FMA.
- if (Use.getOperandNo() >= 2)
- return false;
- if (Inserted.insert(TheUser).second)
- Worklist.push_back(TheUser);
- }
- }
- return true;
- };
+ NodeExtensionHelper LHS(N, 0, DAG, Subtarget);
+ NodeExtensionHelper RHS(N, 1, DAG, Subtarget);
- // Control the compile time by limiting the number of node we look at in
- // total.
- if (Inserted.size() > ExtensionMaxWebSize)
- return SDValue();
+ SmallVector<NodeExtensionHelper::CombineToTry> FoldingStrategies =
+ NodeExtensionHelper::getSupportedFoldings(N);
- SmallVector<NodeExtensionHelper::CombineToTry> FoldingStrategies =
- NodeExtensionHelper::getSupportedFoldings(Root);
-
- assert(!FoldingStrategies.empty() && "Nothing to be folded");
- bool Matched = false;
- for (int Attempt = 0;
- (Attempt != 1 + NodeExtensionHelper::isCommutative(Root)) && !Matched;
- ++Attempt) {
-
- for (NodeExtensionHelper::CombineToTry FoldingStrategy :
- FoldingStrategies) {
- std::optional<CombineResult> Res =
- FoldingStrategy(Root, LHS, RHS, DAG, Subtarget);
- if (Res) {
- // If this strategy wouldn't remove an extension we're supposed to
- // remove, reject it.
- if (!Res->LHSExt.has_value() &&
- ExtensionsToRemove.contains(LHS.OrigOperand.getNode()))
- continue;
- if (!Res->RHSExt.has_value() &&
- ExtensionsToRemove.contains(RHS.OrigOperand.getNode()))
- continue;
+ if (FoldingStrategies.empty())
+ return SDValue();
- Matched = true;
- CombinesToApply.push_back(*Res);
- // All the inputs that are extended need to be folded, otherwise
- // we would be leaving the old input (since it is may still be used),
- // and the new one.
- if (Res->LHSExt.has_value())
- if (!AppendUsersIfNeeded(LHS))
- return SDValue();
- if (Res->RHSExt.has_value())
- if (!AppendUsersIfNeeded(RHS))
- return SDValue();
- break;
- }
+ bool IsComm = NodeExtensionHelper::isCommutative(N);
+ for (int Attempt = 0; Attempt != 1 + IsComm; ++Attempt) {
+ for (NodeExtensionHelper::CombineToTry FoldingStrategy :
+ FoldingStrategies) {
+ std::optional<CombineResult> Res =
+ FoldingStrategy(N, LHS, RHS, DAG, Subtarget);
+ if (Res) {
+ SDValue NewValue = Res->materialize(DAG, Subtarget);
+ return NewValue;
}
- std::swap(LHS, RHS);
- }
- // Right now we do an all or nothing approach.
- if (!Matched)
- return SDValue();
- }
- // Store the value for the replacement of the input node separately.
- SDValue InputRootReplacement;
- // We do the RAUW after we materialize all the combines, because some replaced
- // nodes may be feeding some of the yet-to-be-replaced nodes. Put differently,
- // some of these nodes may appear in the NodeExtensionHelpers of some of the
- // yet-to-be-visited CombinesToApply roots.
- SmallVector<std::pair<SDValue, SDValue>> ValuesToReplace;
- ValuesToReplace.reserve(CombinesToApply.size());
- for (CombineResult Res : CombinesToApply) {
- SDValue NewValue = Res.materialize(DAG, Subtarget);
- if (!InputRootReplacement) {
- assert(Res.Root == N &&
- "First element is expected to be the current node");
- InputRootReplacement = NewValue;
- } else {
- ValuesToReplace.emplace_back(SDValue(Res.Root, 0), NewValue);
}
+ std::swap(LHS, RHS);
}
- for (std::pair<SDValue, SDValue> OldNewValues : ValuesToReplace) {
- DAG.ReplaceAllUsesOfValueWith(OldNewValues.first, OldNewValues.second);
- DCI.AddToWorklist(OldNewValues.second.getNode());
- }
- return InputRootReplacement;
+
+ return SDValue();
}
// Fold (vwadd(u).wv y, (vmerge cond, x, 0)) -> vwadd(u).wv y, x, y, cond
diff --git a/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll b/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
new file mode 100644
index 0000000000000..6c179e4f1472c
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
@@ -0,0 +1,54 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32
+; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64
+
+define void @matmul_min(<32 x i8>* %vptr, i8* %scalars, <32 x i16>* %acc0_ptr, <32 x i16>* %acc1_ptr) {
+; CHECK-LABEL: matmul_min:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: li a4, 64
+; CHECK-NEXT: li a5, 32
+; CHECK-NEXT: vsetvli zero, a5, e8, m2, ta, ma
+; CHECK-NEXT: vle8.v v16, (a0)
+; CHECK-NEXT: lb a0, 0(a1)
+; CHECK-NEXT: lb a1, 1(a1)
+; CHECK-NEXT: vsetvli zero, a4, e8, m4, ta, ma
+; CHECK-NEXT: vle8.v v8, (a2)
+; CHECK-NEXT: vle8.v v12, (a3)
+; CHECK-NEXT: vsetvli zero, a5, e8, m2, ta, ma
+; CHECK-NEXT: vwmacc.vx v8, a0, v16
+; CHECK-NEXT: vwmacc.vx v12, a1, v16
+; CHECK-NEXT: vsetvli zero, a4, e8, m4, ta, ma
+; CHECK-NEXT: vse8.v v8, (a2)
+; CHECK-NEXT: vse8.v v12, (a3)
+; CHECK-NEXT: ret
+entry:
+ %acc0 = load <32 x i16>, <32 x i16>* %acc0_ptr, align 1
+ %acc1 = load <32 x i16>, <32 x i16>* %acc1_ptr, align 1
+
+ %v8 = load <32 x i8>, <32 x i8>* %vptr, align 1
+ %v16 = sext <32 x i8> %v8 to <32 x i16>
+
+ %s0_ptr = getelementptr i8, i8* %scalars, i32 0
+ %s0_i8 = load i8, i8* %s0_ptr, align 1
+ %s0_i16 = sext i8 %s0_i8 to i16
+ %tmp0 = insertelement <32 x i16> undef, i16 %s0_i16, i32 0
+ %splat0 = shufflevector <32 x i16> %tmp0, <32 x i16> undef, <32 x i32> zeroinitializer
+ %mul0 = mul <32 x i16> %splat0, %v16
+ %add0 = add <32 x i16> %mul0, %acc0
+
+ %s1_ptr = getelementptr i8, i8* %scalars, i32 1
+ %s1_i8 = load i8, i8* %s1_ptr, align 1
+ %s1_i16 = sext i8 %s1_i8 to i16
+ %tmp1 = insertelement <32 x i16> undef, i16 %s1_i16, i32 0
+ %splat1 = shufflevector <32 x i16> %tmp1, <32 x i16> undef, <32 x i32> zeroinitializer
+ %mul1 = mul <32 x i16> %splat1, %v16
+ %add1 = add <32 x i16> %mul1, %acc1
+
+ store <32 x i16> %add0, <32 x i16>* %acc0_ptr, align 1
+ store <32 x i16> %add1, <32 x i16>* %acc1_ptr, align 1
+
+ ret void
+}
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; RV32: {{.*}}
+; RV64: {{.*}}
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfw-web-simplification.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfw-web-simplification.ll
deleted file mode 100644
index 0561ee9addc7b..0000000000000
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfw-web-simplification.ll
+++ /dev/null
@@ -1,238 +0,0 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfh,+f,+d -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=1 | FileCheck %s --check-prefixes=NO_FOLDING,NO_FOLDING1
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfh,+f,+d -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=2 | FileCheck %s --check-prefixes=NO_FOLDING,NO_FOLDING2
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfh,+f,+d -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 | FileCheck %s --check-prefixes=FOLDING,ZVFH
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfhmin,+f,+d -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 | FileCheck %s --check-prefixes=FOLDING,ZVFHMIN
-; Check that the default value enables the web folding and
-; that it is bigger than 3.
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfh,+f,+d -verify-machineinstrs %s -o - | FileCheck %s --check-prefixes=FOLDING,ZVFH
-
-define void @vfwmul_v2f116_multiple_users(ptr %x, ptr %y, ptr %z, <2 x half> %a, <2 x half> %b, <2 x half> %b2) {
-; NO_FOLDING1-LABEL: vfwmul_v2f116_multiple_users:
-; NO_FOLDING1: # %bb.0:
-; NO_FOLDING1-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
-; NO_FOLDING1-NEXT: vfwcvt.f.f.v v11, v8
-; NO_FOLDING1-NEXT: vfwcvt.f.f.v v8, v9
-; NO_FOLDING1-NEXT: vfwcvt.f.f.v v9, v10
-; NO_FOLDING1-NEXT: vsetvli zero, zero, e32, mf2, ta, ma
-; NO_FOLDING1-NEXT: vfmul.vv v10, v11, v8
-; NO_FOLDING1-NEXT: vfadd.vv v11, v11, v9
-; NO_FOLDING1-NEXT: vfsub.vv v8, v8, v9
-; NO_FOLDING1-NEXT: vse32.v v10, (a0)
-; NO_FOLDING1-NEXT: vse32.v v11, (a1)
-; NO_FOLDING1-NEXT: vse32.v v8, (a2)
-; NO_FOLDING1-NEXT: ret
-;
-; NO_FOLDING2-LABEL: vfwmul_v2f116_multiple_users:
-; NO_FOLDING2: # %bb.0:
-; NO_FOLDING2-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
-; NO_FOLDING2-NEXT: vfwcvt.f.f.v v11, v8
-; NO_FOLDING2-NEXT: vfwcvt.f.f.v v8, v9
-; NO_FOLDING2-NEXT: vsetvli zero, zero, e32, mf2, ta, ma
-; NO_FOLDING2-NEXT: vfmul.vv v9, v11, v8
-; NO_FOLDING2-NEXT: vsetvli zero, zero, e16, mf4, ta, ma
-; NO_FOLDING2-NEXT: vfwadd.wv v11, v11, v10
-; NO_FOLDING2-NEXT: vfwsub.wv v8, v8, v10
-; NO_FOLDING2-NEXT: vse32.v v9, (a0)
-; NO_FOLDING2-NEXT: vse32.v v11, (a1)
-; NO_FOLDING2-NEXT: vse32.v v8, (a2)
-; NO_FOLDING2-NEXT: ret
-;
-; ZVFH-LABEL: vfwmul_v2f116_multiple_users:
-; ZVFH: # %bb.0:
-; ZVFH-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
-; ZVFH-NEXT: vfwmul.vv v11, v8, v9
-; ZVFH-NEXT: vfwadd.vv v12, v8, v10
-; ZVFH-NEXT: vfwsub.vv v8, v9, v10
-; ZVFH-NEXT: vse32.v v11, (a0)
-; ZVFH-NEXT: vse32.v v12, (a1)
-; ZVFH-NEXT: vse32.v v8, (a2)
-; ZVFH-NEXT: ret
-;
-; ZVFHMIN-LABEL: vfwmul_v2f116_multiple_users:
-; ZVFHMIN: # %bb.0:
-; ZVFHMIN-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
-; ZVFHMIN-NEXT: vfwcvt.f.f.v v11, v8
-; ZVFHMIN-NEXT: vfwcvt.f.f.v v8, v9
-; ZVFHMIN-NEXT: vfwcvt.f.f.v v9, v10
-; ZVFHMIN-NEXT: vsetvli zero, zero, e32, mf2, ta, ma
-; ZVFHMIN-NEXT: vfmul.vv v10, v11, v8
-; ZVFHMIN-NEXT: vfadd.vv v11, v11, v9
-; ZVFHMIN-NEXT: vfsub.vv v8, v8, v9
-; ZVFHMIN-NEXT: vse32.v v10, (a0)
-; ZVFHMIN-NEXT: vse32.v v11, (a1)
-; ZVFHMIN-NEXT: vse32.v v8, (a2)
-; ZVFHMIN-NEXT: ret
- %c = fpext <2 x half> %a to <2 x float>
- %d = fpext <2 x half> %b to <2 x float>
- %d2 = fpext <2 x half> %b2 to <2 x float>
- %e = fmul <2 x float> %c, %d
- %f = fadd <2 x float> %c, %d2
- %g = fsub <2 x float> %d, %d2
- store <2 x float> %e, ptr %x
- store <2 x float> %f, ptr %y
- store <2 x float> %g, ptr %z
- ret void
-}
-
-define void @vfwmul_v2f32_multiple_users(ptr %x, ptr %y, ptr %z, <2 x float> %a, <2 x float> %b, <2 x float> %b2) {
-; NO_FOLDING1-LABEL: vfwmul_v2f32_multiple_users:
-; NO_FOLDING1: # %bb.0:
-; NO_FOLDING1-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING1-NEXT: vfwcvt.f.f.v v11, v8
-; NO_FOLDING1-NEXT: vfwcvt.f.f.v v8, v9
-; NO_FOLDING1-NEXT: vfwcvt.f.f.v v9, v10
-; NO_FOLDING1-NEXT: vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING1-NEXT: vfmul.vv v10, v11, v8
-; NO_FOLDING1-NEXT: vfadd.vv v11, v11, v9
-; NO_FOLDING1-NEXT: vfsub.vv v8, v8, v9
-; NO_FOLDING1-NEXT: vse64.v v10, (a0)
-; NO_FOLDING1-NEXT: vse64.v v11, (a1)
-; NO_FOLDING1-NEXT: vse64.v v8, (a2)
-; NO_FOLDING1-NEXT: ret
-;
-; NO_FOLDING2-LABEL: vfwmul_v2f32_multiple_users:
-; NO_FOLDING2: # %bb.0:
-; NO_FOLDING2-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING2-NEXT: vfwcvt.f.f.v v11, v8
-; NO_FOLDING2-NEXT: vfwcvt.f.f.v v8, v9
-; NO_FOLDING2-NEXT: vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING2-NEXT: vfmul.vv v9, v11, v8
-; NO_FOLDING2-NEXT: vsetvli zero, zero, e32, mf2, ta, ma
-; NO_FOLDING2-NEXT: vfwadd.wv v11, v11, v10
-; NO_FOLDING2-NEXT: vfwsub.wv v8, v8, v10
-; NO_FOLDING2-NEXT: vse64.v v9, (a0)
-; NO_FOLDING2-NEXT: vse64.v v11, (a1)
-; NO_FOLDING2-NEXT: vse64.v v8, (a2)
-; NO_FOLDING2-NEXT: ret
-;
-; FOLDING-LABEL: vfwmul_v2f32_multiple_users:
-; FOLDING: # %bb.0:
-; FOLDING-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; FOLDING-NEXT: vfwmul.vv v11, v8, v9
-; FOLDING-NEXT: vfwadd.vv v12, v8, v10
-; FOLDING-NEXT: vfwsub.vv v8, v9, v10
-; FOLDING-NEXT: vse64.v v11, (a0)
-; FOLDING-NEXT: vse64.v v12, (a1)
-; FOLDING-NEXT: vse64.v v8, (a2)
-; FOLDING-NEXT: ret
- %c = fpext <2 x float> %a to <2 x double>
- %d = fpext <2 x float> %b to <2 x double>
- %d2 = fpext <2 x float> %b2 to <2 x double>
- %e = fmul <2 x double> %c, %d
- %f = fadd <2 x double> %c, %d2
- %g = fsub <2 x double> %d, %d2
- store <2 x double> %e, ptr %x
- store <2 x double> %f, ptr %y
- store <2 x double> %g, ptr %z
- ret void
-}
-
-define void @vfwmacc_v2f32_multiple_users(ptr %x, ptr %y, ptr %z, <2 x float> %a, <2 x float> %b, <2 x float> %b2, <2 x double> %w) {
-; NO_FOLDING-LABEL: vfwmacc_v2f32_multiple_users:
-; NO_FOLDING: # %bb.0:
-; NO_FOLDING-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING-NEXT: vfwcvt.f.f.v v12, v8
-; NO_FOLDING-NEXT: vfwcvt.f.f.v v8, v9
-; NO_FOLDING-NEXT: vfwcvt.f.f.v v9, v10
-; NO_FOLDING-NEXT: vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING-NEXT: vfmul.vv v10, v12, v8
-; NO_FOLDING-NEXT: vfmadd.vv v12, v9, v11
-; NO_FOLDING-NEXT: vfsub.vv v8, v8, v9
-; NO_FOLDING-NEXT: vse64.v v10, (a0)
-; NO_FOLDING-NEXT: vse64.v v12, (a1)
-; NO_FOLDING-NEXT: vse64.v v8, (a2)
-; NO_FOLDING-NEXT: ret
-;
-; FOLDING-LABEL: vfwmacc_v2f32_multiple_users:
-; FOLDING: # %bb.0:
-; FOLDING-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; FOLDING-NEXT: vfwmul.vv v12, v8, v9
-; FOLDING-NEXT: vfwsub.vv v13, v9, v10
-; FOLDING-NEXT: vfwmacc.vv v11, v8, v10
-; FOLDING-NEXT: vse64.v v12, (a0)
-; FOLDING-NEXT: vse64.v v11, (a1)
-; FOLDING-NEXT: vse64.v v13, (a2)
-; FOLDING-NEXT: ret
- %c = fpext <2 x float> %a to <2 x double>
- %d = fpext <2 x float> %b to <2 x double>
- %d2 = fpext <2 x float> %b2 to <2 x double>
- %e = fmul <2 x double> %c, %d
- %f = call <2 x double> @llvm.fma(<2 x double> %c, <2 x double> %d2, <2 x double> %w)
- %g = fsub <2 x double> %d, %d2
- store <2 x double> %e, ptr %x
- store <2 x double> %f, ptr %y
- store <2 x double> %g, ptr %z
- ret void
-}
-
-; Negative test. We can't fold because the FMA addend is a user.
-define void @vfwmacc_v2f32_multiple_users_addend_user(ptr %x, ptr %y, ptr %z, <2 x float> %a, <2 x float> %b, <2 x float> %b2) {
-; NO_FOLDING-LABEL: vfwmacc_v2f32_multiple_users_addend_user:
-; NO_FOLDING: # %bb.0:
-; NO_FOLDING-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING-NEXT: vfwcvt.f.f.v v11, v8
-; NO_FOLDING-NEXT: vfwcvt.f.f.v v8, v9
-; NO_FOLDING-NEXT: vfwcvt.f.f.v v9, v10
-; NO_FOLDING-NEXT: vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING-NEXT: vfmul.vv v10, v11, v8
-; NO_FOLDING-NEXT: vfmadd.vv v11, v9, v8
-; NO_FOLDING-NEXT: vfsub.vv v8, v8, v9
-; NO_FOLDING-NEXT: vse64.v v10, (a0)
-; NO_FOLDING-NEXT: vse64.v v11, (a1)
-; NO_FOLDING-NEXT: vse64.v v8, (a2)
-; NO_FOLDING-NEXT: ret
-;
-; FOLDING-LABEL: vfwmacc_v2f32_multiple_users_addend_user:
-; FOLDING: # %bb.0:
-; FOLDING-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; FOLDING-NEXT: vfwcvt.f.f.v v11, v8
-; FOLDING-NEXT: vfwcvt.f.f.v v8, v9
-; FOLDING-NEXT: vfwcvt.f.f.v v9, v10
-; FOLDING-NEXT: vsetvli zero, zero, e64, m1, ta, ma
-; FOLDING-NEXT: vfmul.vv v10, v11, v8
-; FOLDING-NEXT: vfmadd.vv v11, v9, v8
-; FOLDING-NEXT: vfsub.vv v8, v8, v9
-; FOLDING-NEXT: vse64.v v10, (a0)
-; FOLDING-NEXT: vse64.v v11, (a1)
-; FOLDING-NEXT: vse64.v v8, (a2)
-; FOLDING-NEXT: ret
- %c = fpext <2 x float> %a to <2 x double>
- %d = fpext <2 x float> %b to <2 x double>
- %d2 = fpext <2 x float> %b2 to <2 x double>
- %e = fmul <2 x double> %c, %d
- %f = call <2 x double> @llvm.fma(<2 x double> %c, <2 x double> %d2, <2 x double> %d)
- %g = fsub <2 x double> %d, %d2
- store <2 x double> %e, ptr %x
- store <2 x double> %f, ptr %y
- store <2 x double> %g, ptr %z
- ret void
-}
-
-; Negative test. We can't fold because the FMA addend is a user.
-define void @vfwmacc_v2f32_addend_user(ptr %x, <2 x float> %a, <2 x float> %b) {
-; NO_FOLDING-LABEL: vfwmacc_v2f32_addend_user:
-; NO_FOLDING: # %bb.0:
-; NO_FOLDING-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING-NEXT: vfwcvt.f.f.v v10, v8
-; NO_FOLDING-NEXT: vfwcvt.f.f.v v8, v9
-; NO_FOLDING-NEXT: vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING-NEXT: vfmadd.vv v8, v10, v8
-; NO_FOLDING-NEXT: vse64.v v8, (a0)
-; NO_FOLDING-NEXT: ret
-;
-; FOLDING-LABEL: vfwmacc_v2f32_addend_user:
-; FOLDING: # %bb.0:
-; FOLDING-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; FOLDING-NEXT: vfwcvt.f.f.v v10, v8
-; FOLDING-NEXT: vfwcvt.f.f.v v8, v9
-; FOLDING-NEXT: vsetvli zero, zero, e64, m1, ta, ma
-; FOLDING-NEXT: vfmadd.vv v8, v10, v8
-; FOLDING-NEXT: vse64.v v8, (a0)
-; FOLDING-NEXT: ret
- %c = fpext <2 x float> %a to <2 x double>
- %d = fpext <2 x float> %b to <2 x double>
- %f = call <2 x double> @llvm.fma(<2 x double> %c, <2 x double> %d, <2 x double> %d)
- store <2 x double> %f, ptr %x
- ret void
-}
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vw-web-simplification.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vw-web-simplification.ll
deleted file mode 100644
index b9466e9f7cc15..0000000000000
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vw-web-simplification.ll
+++ /dev/null
@@ -1,100 +0,0 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=1 | FileCheck %s --check-prefixes=NO_FOLDING1
-; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=1 | FileCheck %s --check-prefixes=NO_FOLDING1
-; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=2 | FileCheck %s --check-prefixes=NO_FOLDING2
-; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=2 | FileCheck %s --check-prefixes=NO_FOLDING2
-; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 | FileCheck %s --check-prefixes=FOLDING
-; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 | FileCheck %s --check-prefixes=FOLDING
-; Check that the default value enables the web folding and
-; that it is bigger than 3.
-; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs %...
[truncated]
|
b501480 to
4a1918d
Compare
|
This patch fixes the reported issue and still allows the whole web to be handled at once. The important part is that |
This actually works. But duplicating similar logic in two places often leads to problems. We should remove the worklist inside unless it is necessary. |
4a1918d to
66af1db
Compare
|
I looked at the changes @topperc made last week. To stay compatible with those changes, the only approach I can currently think of is as mentioned: keep the internal worklist, but also add the users of the updated nodes to the outer worklist. In any case, I don’t think maintaining another worklist internally is good practice. But for now, it seems to be the only option we have. |
Adding this worklist was not taken lightly. Originally we did it for each node, with a check that the add/sub/mul was the only user of the extend. The worklist system was added so we could remove the one use check and ensure all users of the extend could be widened. It looks like we did not add any negative tests that show cases where the extend is used by a node that can't be widened. If we had, I think this patch would show a change. |
@preames and I had a conversation about this last week when I was preparing my change. We think the right long term direction is to move this entire transform to a MachineIR pass. That would still need some kind of worklist and visited set, but it wouldn't be nested inside DAG combine. It would also allow the optimization to be shared with GlobalISel. |
|
@OMG-link Do you mind if I take your test and create a PR with my change? |
I was planning to change my PR tomorrow because I can't connect to my server right now. But of course, you can do it if you want. |
66af1db to
4b50405
Compare
|
I have rebuilt the commit. This version keeps the private worklist and, when inserting newly created nodes into the outer worklist, preserves the same insertion logic used by the outer worklist. Although this leads to duplicated code, it’s the only way to maintain consistent worklist behavior as long as the private worklist exists. In addition, this version of the changes does not affect the results of any existing tests; it only changes the new test case from mul+add to macc. diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 38cce26e44af..75fab219b6c6 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -18300,8 +18300,24 @@ static SDValue combineOp_VLToVWOp_VL(SDNode *N,
}
}
for (std::pair<SDValue, SDValue> OldNewValues : ValuesToReplace) {
- DAG.ReplaceAllUsesOfValueWith(OldNewValues.first, OldNewValues.second);
- DCI.AddToWorklist(OldNewValues.second.getNode());
+ // This helper function does what is done for RV in `DAGCombiner::Run`.
+ [&DAG, &DCI](SDNode *N, SDValue RV) -> void {
+ if (N->getNumValues() == RV->getNumValues())
+ DAG.ReplaceAllUsesWith(N, RV.getNode());
+ else {
+ assert(N->getValueType(0) == RV.getValueType() &&
+ N->getNumValues() == 1 && "Type mismatch");
+ DAG.ReplaceAllUsesWith(N, &RV);
+ }
+
+ if (RV.getOpcode() != ISD::EntryToken) {
+ DCI.AddToWorklist(RV.getNode());
+ for (SDNode *Node : RV.getNode()->users())
+ DCI.AddToWorklist(Node);
+ }
+
+ DCI.recursivelyDeleteUnusedNodes(N);
+ }(OldNewValues.first.getNode(), OldNewValues.second);
}
return InputRootReplacement;
}However, I noticed that in the initial version of the changes, there was the following transformation: diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll
index 5ec5ce11f85a..94c3138fd330 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll
@@ -866,16 +866,16 @@ define <2 x i16> @vwmul_v2i16_multiuse(ptr %x, ptr %y, ptr %z, ptr %w) {
; CHECK-NEXT: vle8.v v8, (a1)
; CHECK-NEXT: vle8.v v9, (a2)
; CHECK-NEXT: vsext.vf2 v10, v8
-; CHECK-NEXT: vsext.vf2 v11, v9
-; CHECK-NEXT: vdivu.vv v9, v10, v11
-; CHECK-NEXT: vle8.v v10, (a0)
+; CHECK-NEXT: vsext.vf2 v8, v9
+; CHECK-NEXT: vdivu.vv v8, v10, v8
+; CHECK-NEXT: vle8.v v9, (a0)
; CHECK-NEXT: vle8.v v11, (a3)
-; CHECK-NEXT: vsetvli zero, zero, e8, mf8, ta, ma
-; CHECK-NEXT: vwmul.vv v12, v10, v11
-; CHECK-NEXT: vwmul.vv v10, v8, v11
-; CHECK-NEXT: vsetvli zero, zero, e16, mf4, ta, ma
-; CHECK-NEXT: vor.vv v8, v12, v10
-; CHECK-NEXT: vor.vv v8, v8, v9
+; CHECK-NEXT: vsext.vf2 v12, v9
+; CHECK-NEXT: vsext.vf2 v9, v11
+; CHECK-NEXT: vmul.vv v11, v12, v9
+; CHECK-NEXT: vmul.vv v9, v10, v9
+; CHECK-NEXT: vor.vv v9, v11, v9
+; CHECK-NEXT: vor.vv v8, v9, v8
; CHECK-NEXT: ret
%a = load <2 x i8>, ptr %x
%b = load <2 x i8>, ptr %yThe new version can no longer discover this transformation (it introduces two additional |
Looks like it doesn't do the transform because one of the vsext is used by the vdivu and one of the multiplies. So the transform gives up because it can't widen all users of that vsext. Perhaps we should loosen that restrict for multiplies since we don't have a .wv version of multiply. The restriction probably makes sense for add/sub where we can widen a single operand. |
4b50405 to
5147468
Compare
topperc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| store <32 x i16> %add0, <32 x i16>* %acc0_ptr, align 1 | ||
| store <32 x i16> %add1, <32 x i16>* %acc1_ptr, align 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typed pointers are deprecated.
| store <32 x i16> %add0, <32 x i16>* %acc0_ptr, align 1 | |
| store <32 x i16> %add1, <32 x i16>* %acc1_ptr, align 1 | |
| store <32 x i16> %add0, ptr %acc0_ptr, align 1 | |
| store <32 x i16> %add1, ptr %acc1_ptr, align 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. @topperc
diff --git a/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll b/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
index 2bbaf73e6e0d..f4615df5d310 100644
--- a/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
@@ -2,7 +2,7 @@
; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32
; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64
-define void @matmul_min(<32 x i8>* %vptr, i8* %scalars, <32 x i16>* %acc0_ptr, <32 x i16>* %acc1_ptr) {
+define void @matmul_min(ptr %vptr, ptr %scalars, ptr %acc0_ptr, ptr %acc1_ptr) {
; CHECK-LABEL: matmul_min:
; CHECK: # %bb.0: # %entry
; CHECK-NEXT: li a4, 64
@@ -25,30 +25,30 @@ define void @matmul_min(<32 x i8>* %vptr, i8* %scalars, <32 x i16>* %acc0_ptr, <
; CHECK-NEXT: vse8.v v12, (a3)
; CHECK-NEXT: ret
entry:
- %acc0 = load <32 x i16>, <32 x i16>* %acc0_ptr, align 1
- %acc1 = load <32 x i16>, <32 x i16>* %acc1_ptr, align 1
+ %acc0 = load <32 x i16>, ptr %acc0_ptr, align 1
+ %acc1 = load <32 x i16>, ptr %acc1_ptr, align 1
- %v8 = load <32 x i8>, <32 x i8>* %vptr, align 1
+ %v8 = load <32 x i8>, ptr %vptr, align 1
%v16 = sext <32 x i8> %v8 to <32 x i16>
- %s0_ptr = getelementptr i8, i8* %scalars, i32 0
- %s0_i8 = load i8, i8* %s0_ptr, align 1
+ %s0_ptr = getelementptr i8, ptr %scalars, i32 0
+ %s0_i8 = load i8, ptr %s0_ptr, align 1
%s0_i16 = sext i8 %s0_i8 to i16
%tmp0 = insertelement <32 x i16> undef, i16 %s0_i16, i32 0
%splat0 = shufflevector <32 x i16> %tmp0, <32 x i16> undef, <32 x i32> zeroinitializer
%mul0 = mul <32 x i16> %splat0, %v16
%add0 = add <32 x i16> %mul0, %acc0
- %s1_ptr = getelementptr i8, i8* %scalars, i32 1
- %s1_i8 = load i8, i8* %s1_ptr, align 1
+ %s1_ptr = getelementptr i8, ptr %scalars, i32 1
+ %s1_i8 = load i8, ptr %s1_ptr, align 1
%s1_i16 = sext i8 %s1_i8 to i16
%tmp1 = insertelement <32 x i16> undef, i16 %s1_i16, i32 0
%splat1 = shufflevector <32 x i16> %tmp1, <32 x i16> undef, <32 x i32> zeroinitializer
%mul1 = mul <32 x i16> %splat1, %v16
%add1 = add <32 x i16> %mul1, %acc1
- store <32 x i16> %add0, <32 x i16>* %acc0_ptr, align 1
- store <32 x i16> %add1, <32 x i16>* %acc1_ptr, align 1
+ store <32 x i16> %add0, ptr %acc0_ptr, align 1
+ store <32 x i16> %add1, ptr %acc1_ptr, align 1
ret void5147468 to
590531a
Compare
topperc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
✅ With the latest revision this PR passed the undef deprecator. |
| %s0_ptr = getelementptr i8, ptr %scalars, i32 0 | ||
| %s0_i8 = load i8, ptr %s0_ptr, align 1 | ||
| %s0_i16 = sext i8 %s0_i8 to i16 | ||
| %tmp0 = insertelement <32 x i16> undef, i16 %s0_i16, i32 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| %tmp0 = insertelement <32 x i16> undef, i16 %s0_i16, i32 0 | |
| %tmp0 = insertelement <32 x i16> poison, i16 %s0_i16, i32 0 |
To fix the warning about new uses of undef.
| %s1_ptr = getelementptr i8, ptr %scalars, i32 1 | ||
| %s1_i8 = load i8, ptr %s1_ptr, align 1 | ||
| %s1_i16 = sext i8 %s1_i8 to i16 | ||
| %tmp1 = insertelement <32 x i16> undef, i16 %s1_i16, i32 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| %tmp1 = insertelement <32 x i16> undef, i16 %s1_i16, i32 0 | |
| %tmp1 = insertelement <32 x i16> poison, i16 %s1_i16, i32 0 |
Add a minimal reproducer for consecutive vwmacc-like operations to illustrate that the previous DAG combine logic may miss combining mul+add chains into a single vwmacc.vx instruction.
The previous implementation of combineOp_VLToVWOp_VL manually replaced old nodes with newly created widened nodes, but only added the new node itself to the DAGCombiner worklist. Since the users of the new node were not added, some combine opportunities could be missed when external DAGCombiner passes expected those users to be reconsidered. This patch replaces the custom replacement logic with a call to DCI.CombineTo(), which performs node replacement in a way consistent with DAGCombiner::Run: - Replace all uses of the old node. - Add the new node and its users to the worklist. - Clean up unused nodes when appropriate. Using CombineTo ensures that combineOp_VLToVWOp_VL behaves consistently with the standard DAGCombiner update model, avoiding discrepancies between the private worklist inside this routine and the global worklist managed by the combiner. This resolves missed combine cases involving VL -> VW operator widening.
590531a to
16d98bc
Compare
Problem:
The previous implementation of
combineOp_VLToVWOp_VLmaintained a private worklist to fold multiple nodes, but did not inform the external DAGCombiner. This could lead to missed combines, because combine logic of some users of updated nodes are outside this function.Changes:
Simplify
combineOp_VLToVWOp_VLRemove obsolete command line option in tests
The
-ext-max-web-sizeoption was only used to limit the size of the private worklist. It is now unused and has been deleted.Updated tests by removing RUN lines containing this option, keeping the test coverage:
fixed-vectors-vfw-web-simplification.llfixed-vectors-vw-web-simplification.llvscale-vw-web-simplification.llAdd a new test
combine-vl-vw-macc.llverifies that consecutive vwmacc-like operations are correctly generated.Behavioral Improvement
A minimal reproducer (
combine-vl-vw-macc.ll) is included in the test suite.Before this patch, the DAG produces:
vwmacc.vxvwmul.vxvadd.vvAfter this patch, both mul+add chains are folded into:
vwmacc.vxVerification:
ninja check-llvm-codegen-riscvand confirmed the new test passes.vwmacc.vxinstructions.