Skip to content

Conversation

@OMG-link
Copy link

@OMG-link OMG-link commented Nov 14, 2025

Problem:

The previous implementation of combineOp_VLToVWOp_VL maintained a private worklist to fold multiple nodes, but did not inform the external DAGCombiner. This could lead to missed combines, because combine logic of some users of updated nodes are outside this function.

Changes:

  1. Simplify combineOp_VLToVWOp_VL

    • Only attempts to combine the input node.
    • Leaves replacement and user updates to the external DAGCombiner worklist.
  2. Remove obsolete command line option in tests

    • The -ext-max-web-size option was only used to limit the size of the private worklist. It is now unused and has been deleted.

    • Updated tests by removing RUN lines containing this option, keeping the test coverage:

      • fixed-vectors-vfw-web-simplification.ll
      • fixed-vectors-vw-web-simplification.ll
      • vscale-vw-web-simplification.ll
  3. Add a new test

    • combine-vl-vw-macc.ll verifies that consecutive vwmacc-like operations are correctly generated.

Behavioral Improvement

A minimal reproducer (combine-vl-vw-macc.ll) is included in the test suite.

Before this patch, the DAG produces:

  • vwmacc.vx
  • vwmul.vx
  • vadd.vv

After this patch, both mul+add chains are folded into:

  • vwmacc.vx
  • no redundant vwmul/vadd

Verification:

  • Ran ninja check-llvm-codegen-riscv and confirmed the new test passes.
  • Checked that the patch generates the intended vwmacc.vx instructions.

@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Nov 14, 2025

@llvm/pr-subscribers-backend-risc-v

Author: None (OMG-link)

Changes

Problem:

The previous implementation of combineOp_VLToVWOp_VL maintained a private worklist to fold multiple nodes, but did not inform the external DAGCombiner. This could lead to missed combines, because combine logic of some users of updated nodes are outside this function.

Changes:

  1. Simplify combineOp_VLToVWOp_VL

    • Only attempts to combine the input node.
    • Leaves replacement and user updates to the external DAGCombiner worklist.
  2. Remove obsolete code and tests

    • The -ext-max-web-size option was only used to limit the size of this private worklist. It is now unused and has been deleted.

    • Removed tests that relied on this option:

      • fixed-vectors-vfw-web-simplification.ll
      • fixed-vectors-vw-web-simplification.ll
      • vscale-vw-web-simplification.ll
  3. Add a new test

    • combine-vl-vw-macc.ll verifies that consecutive vwmacc-like operations are correctly generated.

Behavioral Improvement

A minimal reproducer (combine-vl-vw-macc.ll) is included in the test suite.

Before this patch, the DAG produces:

  • vwmacc.vx
  • vwmul.vx
  • vadd.vv

After this patch, both mul+add chains are folded into:

  • vwmacc.vx
  • no redundant vwmul/vadd

Verification:

  • Ran ninja check-llvm-codegen-riscv and confirmed the new test passes.
  • Checked that the patch generates the intended vwmacc.vx instructions.

Patch is 51.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/168026.diff

6 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+18-103)
  • (added) llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll (+54)
  • (removed) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfw-web-simplification.ll (-238)
  • (removed) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vw-web-simplification.ll (-100)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll (+9-9)
  • (removed) llvm/test/CodeGen/RISCV/rvv/vscale-vw-web-simplification.ll (-652)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 38cce26e44af4..ae29b39d5b675 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -57,12 +57,6 @@ using namespace llvm;
 
 STATISTIC(NumTailCalls, "Number of tail calls");
 
-static cl::opt<unsigned> ExtensionMaxWebSize(
-    DEBUG_TYPE "-ext-max-web-size", cl::Hidden,
-    cl::desc("Give the maximum size (in number of nodes) of the web of "
-             "instructions that we will consider for VW expansion"),
-    cl::init(18));
-
 static cl::opt<bool>
     AllowSplatInVW_W(DEBUG_TYPE "-form-vw-w-with-splat", cl::Hidden,
                      cl::desc("Allow the formation of VW_W operations (e.g., "
@@ -18201,109 +18195,30 @@ static SDValue combineOp_VLToVWOp_VL(SDNode *N,
   if (!NodeExtensionHelper::isSupportedRoot(N, Subtarget))
     return SDValue();
 
-  SmallVector<SDNode *> Worklist;
-  SmallPtrSet<SDNode *, 8> Inserted;
-  SmallPtrSet<SDNode *, 8> ExtensionsToRemove;
-  Worklist.push_back(N);
-  Inserted.insert(N);
-  SmallVector<CombineResult> CombinesToApply;
-
-  while (!Worklist.empty()) {
-    SDNode *Root = Worklist.pop_back_val();
-
-    NodeExtensionHelper LHS(Root, 0, DAG, Subtarget);
-    NodeExtensionHelper RHS(Root, 1, DAG, Subtarget);
-    auto AppendUsersIfNeeded =
-        [&Worklist, &Subtarget, &Inserted,
-         &ExtensionsToRemove](const NodeExtensionHelper &Op) {
-          if (Op.needToPromoteOtherUsers()) {
-            // Remember that we're supposed to remove this extension.
-            ExtensionsToRemove.insert(Op.OrigOperand.getNode());
-            for (SDUse &Use : Op.OrigOperand->uses()) {
-              SDNode *TheUser = Use.getUser();
-              if (!NodeExtensionHelper::isSupportedRoot(TheUser, Subtarget))
-                return false;
-              // We only support the first 2 operands of FMA.
-              if (Use.getOperandNo() >= 2)
-                return false;
-              if (Inserted.insert(TheUser).second)
-                Worklist.push_back(TheUser);
-            }
-          }
-          return true;
-        };
+  NodeExtensionHelper LHS(N, 0, DAG, Subtarget);
+  NodeExtensionHelper RHS(N, 1, DAG, Subtarget);
 
-    // Control the compile time by limiting the number of node we look at in
-    // total.
-    if (Inserted.size() > ExtensionMaxWebSize)
-      return SDValue();
+  SmallVector<NodeExtensionHelper::CombineToTry> FoldingStrategies =
+      NodeExtensionHelper::getSupportedFoldings(N);
 
-    SmallVector<NodeExtensionHelper::CombineToTry> FoldingStrategies =
-        NodeExtensionHelper::getSupportedFoldings(Root);
-
-    assert(!FoldingStrategies.empty() && "Nothing to be folded");
-    bool Matched = false;
-    for (int Attempt = 0;
-         (Attempt != 1 + NodeExtensionHelper::isCommutative(Root)) && !Matched;
-         ++Attempt) {
-
-      for (NodeExtensionHelper::CombineToTry FoldingStrategy :
-           FoldingStrategies) {
-        std::optional<CombineResult> Res =
-            FoldingStrategy(Root, LHS, RHS, DAG, Subtarget);
-        if (Res) {
-          // If this strategy wouldn't remove an extension we're supposed to
-          // remove, reject it.
-          if (!Res->LHSExt.has_value() &&
-              ExtensionsToRemove.contains(LHS.OrigOperand.getNode()))
-            continue;
-          if (!Res->RHSExt.has_value() &&
-              ExtensionsToRemove.contains(RHS.OrigOperand.getNode()))
-            continue;
+  if (FoldingStrategies.empty())
+    return SDValue();
 
-          Matched = true;
-          CombinesToApply.push_back(*Res);
-          // All the inputs that are extended need to be folded, otherwise
-          // we would be leaving the old input (since it is may still be used),
-          // and the new one.
-          if (Res->LHSExt.has_value())
-            if (!AppendUsersIfNeeded(LHS))
-              return SDValue();
-          if (Res->RHSExt.has_value())
-            if (!AppendUsersIfNeeded(RHS))
-              return SDValue();
-          break;
-        }
+  bool IsComm = NodeExtensionHelper::isCommutative(N);
+  for (int Attempt = 0; Attempt != 1 + IsComm; ++Attempt) {
+    for (NodeExtensionHelper::CombineToTry FoldingStrategy :
+         FoldingStrategies) {
+      std::optional<CombineResult> Res =
+          FoldingStrategy(N, LHS, RHS, DAG, Subtarget);
+      if (Res) {
+        SDValue NewValue = Res->materialize(DAG, Subtarget);
+        return NewValue;
       }
-      std::swap(LHS, RHS);
-    }
-    // Right now we do an all or nothing approach.
-    if (!Matched)
-      return SDValue();
-  }
-  // Store the value for the replacement of the input node separately.
-  SDValue InputRootReplacement;
-  // We do the RAUW after we materialize all the combines, because some replaced
-  // nodes may be feeding some of the yet-to-be-replaced nodes. Put differently,
-  // some of these nodes may appear in the NodeExtensionHelpers of some of the
-  // yet-to-be-visited CombinesToApply roots.
-  SmallVector<std::pair<SDValue, SDValue>> ValuesToReplace;
-  ValuesToReplace.reserve(CombinesToApply.size());
-  for (CombineResult Res : CombinesToApply) {
-    SDValue NewValue = Res.materialize(DAG, Subtarget);
-    if (!InputRootReplacement) {
-      assert(Res.Root == N &&
-             "First element is expected to be the current node");
-      InputRootReplacement = NewValue;
-    } else {
-      ValuesToReplace.emplace_back(SDValue(Res.Root, 0), NewValue);
     }
+    std::swap(LHS, RHS);
   }
-  for (std::pair<SDValue, SDValue> OldNewValues : ValuesToReplace) {
-    DAG.ReplaceAllUsesOfValueWith(OldNewValues.first, OldNewValues.second);
-    DCI.AddToWorklist(OldNewValues.second.getNode());
-  }
-  return InputRootReplacement;
+
+  return SDValue();
 }
 
 // Fold (vwadd(u).wv y, (vmerge cond, x, 0)) -> vwadd(u).wv y, x, y, cond
diff --git a/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll b/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
new file mode 100644
index 0000000000000..6c179e4f1472c
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
@@ -0,0 +1,54 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32
+; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64
+
+define void @matmul_min(<32 x i8>* %vptr, i8* %scalars, <32 x i16>* %acc0_ptr, <32 x i16>* %acc1_ptr) {
+; CHECK-LABEL: matmul_min:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    li a4, 64
+; CHECK-NEXT:    li a5, 32
+; CHECK-NEXT:    vsetvli zero, a5, e8, m2, ta, ma
+; CHECK-NEXT:    vle8.v v16, (a0)
+; CHECK-NEXT:    lb a0, 0(a1)
+; CHECK-NEXT:    lb a1, 1(a1)
+; CHECK-NEXT:    vsetvli zero, a4, e8, m4, ta, ma
+; CHECK-NEXT:    vle8.v v8, (a2)
+; CHECK-NEXT:    vle8.v v12, (a3)
+; CHECK-NEXT:    vsetvli zero, a5, e8, m2, ta, ma
+; CHECK-NEXT:    vwmacc.vx v8, a0, v16
+; CHECK-NEXT:    vwmacc.vx v12, a1, v16
+; CHECK-NEXT:    vsetvli zero, a4, e8, m4, ta, ma
+; CHECK-NEXT:    vse8.v v8, (a2)
+; CHECK-NEXT:    vse8.v v12, (a3)
+; CHECK-NEXT:    ret
+entry:
+  %acc0 = load <32 x i16>, <32 x i16>* %acc0_ptr, align 1
+  %acc1 = load <32 x i16>, <32 x i16>* %acc1_ptr, align 1
+
+  %v8 = load <32 x i8>, <32 x i8>* %vptr, align 1
+  %v16 = sext <32 x i8> %v8 to <32 x i16>
+
+  %s0_ptr = getelementptr i8, i8* %scalars, i32 0
+  %s0_i8 = load i8, i8* %s0_ptr, align 1
+  %s0_i16 = sext i8 %s0_i8 to i16
+  %tmp0 = insertelement <32 x i16> undef, i16 %s0_i16, i32 0
+  %splat0 = shufflevector <32 x i16> %tmp0, <32 x i16> undef, <32 x i32> zeroinitializer
+  %mul0 = mul <32 x i16> %splat0, %v16
+  %add0 = add <32 x i16> %mul0, %acc0
+
+  %s1_ptr = getelementptr i8, i8* %scalars, i32 1
+  %s1_i8 = load i8, i8* %s1_ptr, align 1
+  %s1_i16 = sext i8 %s1_i8 to i16
+  %tmp1 = insertelement <32 x i16> undef, i16 %s1_i16, i32 0
+  %splat1 = shufflevector <32 x i16> %tmp1, <32 x i16> undef, <32 x i32> zeroinitializer
+  %mul1 = mul <32 x i16> %splat1, %v16
+  %add1 = add <32 x i16> %mul1, %acc1
+
+  store <32 x i16> %add0, <32 x i16>* %acc0_ptr, align 1
+  store <32 x i16> %add1, <32 x i16>* %acc1_ptr, align 1
+
+  ret void
+}
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; RV32: {{.*}}
+; RV64: {{.*}}
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfw-web-simplification.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfw-web-simplification.ll
deleted file mode 100644
index 0561ee9addc7b..0000000000000
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfw-web-simplification.ll
+++ /dev/null
@@ -1,238 +0,0 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfh,+f,+d -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=1 | FileCheck %s --check-prefixes=NO_FOLDING,NO_FOLDING1
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfh,+f,+d -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=2 | FileCheck %s --check-prefixes=NO_FOLDING,NO_FOLDING2
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfh,+f,+d -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 | FileCheck %s --check-prefixes=FOLDING,ZVFH
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfhmin,+f,+d -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 | FileCheck %s --check-prefixes=FOLDING,ZVFHMIN
-; Check that the default value enables the web folding and
-; that it is bigger than 3.
-; RUN: llc -mtriple=riscv64 -mattr=+v,+zvfh,+f,+d -verify-machineinstrs %s -o - | FileCheck %s --check-prefixes=FOLDING,ZVFH
-
-define void @vfwmul_v2f116_multiple_users(ptr %x, ptr %y, ptr %z, <2 x half> %a, <2 x half> %b, <2 x half> %b2) {
-; NO_FOLDING1-LABEL: vfwmul_v2f116_multiple_users:
-; NO_FOLDING1:       # %bb.0:
-; NO_FOLDING1-NEXT:    vsetivli zero, 2, e16, mf4, ta, ma
-; NO_FOLDING1-NEXT:    vfwcvt.f.f.v v11, v8
-; NO_FOLDING1-NEXT:    vfwcvt.f.f.v v8, v9
-; NO_FOLDING1-NEXT:    vfwcvt.f.f.v v9, v10
-; NO_FOLDING1-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; NO_FOLDING1-NEXT:    vfmul.vv v10, v11, v8
-; NO_FOLDING1-NEXT:    vfadd.vv v11, v11, v9
-; NO_FOLDING1-NEXT:    vfsub.vv v8, v8, v9
-; NO_FOLDING1-NEXT:    vse32.v v10, (a0)
-; NO_FOLDING1-NEXT:    vse32.v v11, (a1)
-; NO_FOLDING1-NEXT:    vse32.v v8, (a2)
-; NO_FOLDING1-NEXT:    ret
-;
-; NO_FOLDING2-LABEL: vfwmul_v2f116_multiple_users:
-; NO_FOLDING2:       # %bb.0:
-; NO_FOLDING2-NEXT:    vsetivli zero, 2, e16, mf4, ta, ma
-; NO_FOLDING2-NEXT:    vfwcvt.f.f.v v11, v8
-; NO_FOLDING2-NEXT:    vfwcvt.f.f.v v8, v9
-; NO_FOLDING2-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; NO_FOLDING2-NEXT:    vfmul.vv v9, v11, v8
-; NO_FOLDING2-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
-; NO_FOLDING2-NEXT:    vfwadd.wv v11, v11, v10
-; NO_FOLDING2-NEXT:    vfwsub.wv v8, v8, v10
-; NO_FOLDING2-NEXT:    vse32.v v9, (a0)
-; NO_FOLDING2-NEXT:    vse32.v v11, (a1)
-; NO_FOLDING2-NEXT:    vse32.v v8, (a2)
-; NO_FOLDING2-NEXT:    ret
-;
-; ZVFH-LABEL: vfwmul_v2f116_multiple_users:
-; ZVFH:       # %bb.0:
-; ZVFH-NEXT:    vsetivli zero, 2, e16, mf4, ta, ma
-; ZVFH-NEXT:    vfwmul.vv v11, v8, v9
-; ZVFH-NEXT:    vfwadd.vv v12, v8, v10
-; ZVFH-NEXT:    vfwsub.vv v8, v9, v10
-; ZVFH-NEXT:    vse32.v v11, (a0)
-; ZVFH-NEXT:    vse32.v v12, (a1)
-; ZVFH-NEXT:    vse32.v v8, (a2)
-; ZVFH-NEXT:    ret
-;
-; ZVFHMIN-LABEL: vfwmul_v2f116_multiple_users:
-; ZVFHMIN:       # %bb.0:
-; ZVFHMIN-NEXT:    vsetivli zero, 2, e16, mf4, ta, ma
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v11, v8
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v8, v9
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v9, v10
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; ZVFHMIN-NEXT:    vfmul.vv v10, v11, v8
-; ZVFHMIN-NEXT:    vfadd.vv v11, v11, v9
-; ZVFHMIN-NEXT:    vfsub.vv v8, v8, v9
-; ZVFHMIN-NEXT:    vse32.v v10, (a0)
-; ZVFHMIN-NEXT:    vse32.v v11, (a1)
-; ZVFHMIN-NEXT:    vse32.v v8, (a2)
-; ZVFHMIN-NEXT:    ret
-  %c = fpext <2 x half> %a to <2 x float>
-  %d = fpext <2 x half> %b to <2 x float>
-  %d2 = fpext <2 x half> %b2 to <2 x float>
-  %e = fmul <2 x float> %c, %d
-  %f = fadd <2 x float> %c, %d2
-  %g = fsub <2 x float> %d, %d2
-  store <2 x float> %e, ptr %x
-  store <2 x float> %f, ptr %y
-  store <2 x float> %g, ptr %z
-  ret void
-}
-
-define void @vfwmul_v2f32_multiple_users(ptr %x, ptr %y, ptr %z, <2 x float> %a, <2 x float> %b, <2 x float> %b2) {
-; NO_FOLDING1-LABEL: vfwmul_v2f32_multiple_users:
-; NO_FOLDING1:       # %bb.0:
-; NO_FOLDING1-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING1-NEXT:    vfwcvt.f.f.v v11, v8
-; NO_FOLDING1-NEXT:    vfwcvt.f.f.v v8, v9
-; NO_FOLDING1-NEXT:    vfwcvt.f.f.v v9, v10
-; NO_FOLDING1-NEXT:    vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING1-NEXT:    vfmul.vv v10, v11, v8
-; NO_FOLDING1-NEXT:    vfadd.vv v11, v11, v9
-; NO_FOLDING1-NEXT:    vfsub.vv v8, v8, v9
-; NO_FOLDING1-NEXT:    vse64.v v10, (a0)
-; NO_FOLDING1-NEXT:    vse64.v v11, (a1)
-; NO_FOLDING1-NEXT:    vse64.v v8, (a2)
-; NO_FOLDING1-NEXT:    ret
-;
-; NO_FOLDING2-LABEL: vfwmul_v2f32_multiple_users:
-; NO_FOLDING2:       # %bb.0:
-; NO_FOLDING2-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING2-NEXT:    vfwcvt.f.f.v v11, v8
-; NO_FOLDING2-NEXT:    vfwcvt.f.f.v v8, v9
-; NO_FOLDING2-NEXT:    vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING2-NEXT:    vfmul.vv v9, v11, v8
-; NO_FOLDING2-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; NO_FOLDING2-NEXT:    vfwadd.wv v11, v11, v10
-; NO_FOLDING2-NEXT:    vfwsub.wv v8, v8, v10
-; NO_FOLDING2-NEXT:    vse64.v v9, (a0)
-; NO_FOLDING2-NEXT:    vse64.v v11, (a1)
-; NO_FOLDING2-NEXT:    vse64.v v8, (a2)
-; NO_FOLDING2-NEXT:    ret
-;
-; FOLDING-LABEL: vfwmul_v2f32_multiple_users:
-; FOLDING:       # %bb.0:
-; FOLDING-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; FOLDING-NEXT:    vfwmul.vv v11, v8, v9
-; FOLDING-NEXT:    vfwadd.vv v12, v8, v10
-; FOLDING-NEXT:    vfwsub.vv v8, v9, v10
-; FOLDING-NEXT:    vse64.v v11, (a0)
-; FOLDING-NEXT:    vse64.v v12, (a1)
-; FOLDING-NEXT:    vse64.v v8, (a2)
-; FOLDING-NEXT:    ret
-  %c = fpext <2 x float> %a to <2 x double>
-  %d = fpext <2 x float> %b to <2 x double>
-  %d2 = fpext <2 x float> %b2 to <2 x double>
-  %e = fmul <2 x double> %c, %d
-  %f = fadd <2 x double> %c, %d2
-  %g = fsub <2 x double> %d, %d2
-  store <2 x double> %e, ptr %x
-  store <2 x double> %f, ptr %y
-  store <2 x double> %g, ptr %z
-  ret void
-}
-
-define void @vfwmacc_v2f32_multiple_users(ptr %x, ptr %y, ptr %z, <2 x float> %a, <2 x float> %b, <2 x float> %b2, <2 x double> %w) {
-; NO_FOLDING-LABEL: vfwmacc_v2f32_multiple_users:
-; NO_FOLDING:       # %bb.0:
-; NO_FOLDING-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING-NEXT:    vfwcvt.f.f.v v12, v8
-; NO_FOLDING-NEXT:    vfwcvt.f.f.v v8, v9
-; NO_FOLDING-NEXT:    vfwcvt.f.f.v v9, v10
-; NO_FOLDING-NEXT:    vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING-NEXT:    vfmul.vv v10, v12, v8
-; NO_FOLDING-NEXT:    vfmadd.vv v12, v9, v11
-; NO_FOLDING-NEXT:    vfsub.vv v8, v8, v9
-; NO_FOLDING-NEXT:    vse64.v v10, (a0)
-; NO_FOLDING-NEXT:    vse64.v v12, (a1)
-; NO_FOLDING-NEXT:    vse64.v v8, (a2)
-; NO_FOLDING-NEXT:    ret
-;
-; FOLDING-LABEL: vfwmacc_v2f32_multiple_users:
-; FOLDING:       # %bb.0:
-; FOLDING-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; FOLDING-NEXT:    vfwmul.vv v12, v8, v9
-; FOLDING-NEXT:    vfwsub.vv v13, v9, v10
-; FOLDING-NEXT:    vfwmacc.vv v11, v8, v10
-; FOLDING-NEXT:    vse64.v v12, (a0)
-; FOLDING-NEXT:    vse64.v v11, (a1)
-; FOLDING-NEXT:    vse64.v v13, (a2)
-; FOLDING-NEXT:    ret
-  %c = fpext <2 x float> %a to <2 x double>
-  %d = fpext <2 x float> %b to <2 x double>
-  %d2 = fpext <2 x float> %b2 to <2 x double>
-  %e = fmul <2 x double> %c, %d
-  %f = call <2 x double> @llvm.fma(<2 x double> %c, <2 x double> %d2, <2 x double> %w)
-  %g = fsub <2 x double> %d, %d2
-  store <2 x double> %e, ptr %x
-  store <2 x double> %f, ptr %y
-  store <2 x double> %g, ptr %z
-  ret void
-}
-
-; Negative test. We can't fold because the FMA addend is a user.
-define void @vfwmacc_v2f32_multiple_users_addend_user(ptr %x, ptr %y, ptr %z, <2 x float> %a, <2 x float> %b, <2 x float> %b2) {
-; NO_FOLDING-LABEL: vfwmacc_v2f32_multiple_users_addend_user:
-; NO_FOLDING:       # %bb.0:
-; NO_FOLDING-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING-NEXT:    vfwcvt.f.f.v v11, v8
-; NO_FOLDING-NEXT:    vfwcvt.f.f.v v8, v9
-; NO_FOLDING-NEXT:    vfwcvt.f.f.v v9, v10
-; NO_FOLDING-NEXT:    vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING-NEXT:    vfmul.vv v10, v11, v8
-; NO_FOLDING-NEXT:    vfmadd.vv v11, v9, v8
-; NO_FOLDING-NEXT:    vfsub.vv v8, v8, v9
-; NO_FOLDING-NEXT:    vse64.v v10, (a0)
-; NO_FOLDING-NEXT:    vse64.v v11, (a1)
-; NO_FOLDING-NEXT:    vse64.v v8, (a2)
-; NO_FOLDING-NEXT:    ret
-;
-; FOLDING-LABEL: vfwmacc_v2f32_multiple_users_addend_user:
-; FOLDING:       # %bb.0:
-; FOLDING-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; FOLDING-NEXT:    vfwcvt.f.f.v v11, v8
-; FOLDING-NEXT:    vfwcvt.f.f.v v8, v9
-; FOLDING-NEXT:    vfwcvt.f.f.v v9, v10
-; FOLDING-NEXT:    vsetvli zero, zero, e64, m1, ta, ma
-; FOLDING-NEXT:    vfmul.vv v10, v11, v8
-; FOLDING-NEXT:    vfmadd.vv v11, v9, v8
-; FOLDING-NEXT:    vfsub.vv v8, v8, v9
-; FOLDING-NEXT:    vse64.v v10, (a0)
-; FOLDING-NEXT:    vse64.v v11, (a1)
-; FOLDING-NEXT:    vse64.v v8, (a2)
-; FOLDING-NEXT:    ret
-  %c = fpext <2 x float> %a to <2 x double>
-  %d = fpext <2 x float> %b to <2 x double>
-  %d2 = fpext <2 x float> %b2 to <2 x double>
-  %e = fmul <2 x double> %c, %d
-  %f = call <2 x double> @llvm.fma(<2 x double> %c, <2 x double> %d2, <2 x double> %d)
-  %g = fsub <2 x double> %d, %d2
-  store <2 x double> %e, ptr %x
-  store <2 x double> %f, ptr %y
-  store <2 x double> %g, ptr %z
-  ret void
-}
-
-; Negative test. We can't fold because the FMA addend is a user.
-define void @vfwmacc_v2f32_addend_user(ptr %x, <2 x float> %a, <2 x float> %b) {
-; NO_FOLDING-LABEL: vfwmacc_v2f32_addend_user:
-; NO_FOLDING:       # %bb.0:
-; NO_FOLDING-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; NO_FOLDING-NEXT:    vfwcvt.f.f.v v10, v8
-; NO_FOLDING-NEXT:    vfwcvt.f.f.v v8, v9
-; NO_FOLDING-NEXT:    vsetvli zero, zero, e64, m1, ta, ma
-; NO_FOLDING-NEXT:    vfmadd.vv v8, v10, v8
-; NO_FOLDING-NEXT:    vse64.v v8, (a0)
-; NO_FOLDING-NEXT:    ret
-;
-; FOLDING-LABEL: vfwmacc_v2f32_addend_user:
-; FOLDING:       # %bb.0:
-; FOLDING-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; FOLDING-NEXT:    vfwcvt.f.f.v v10, v8
-; FOLDING-NEXT:    vfwcvt.f.f.v v8, v9
-; FOLDING-NEXT:    vsetvli zero, zero, e64, m1, ta, ma
-; FOLDING-NEXT:    vfmadd.vv v8, v10, v8
-; FOLDING-NEXT:    vse64.v v8, (a0)
-; FOLDING-NEXT:    ret
-  %c = fpext <2 x float> %a to <2 x double>
-  %d = fpext <2 x float> %b to <2 x double>
-  %f = call <2 x double> @llvm.fma(<2 x double> %c, <2 x double> %d, <2 x double> %d)
-  store <2 x double> %f, ptr %x
-  ret void
-}
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vw-web-simplification.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vw-web-simplification.ll
deleted file mode 100644
index b9466e9f7cc15..0000000000000
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vw-web-simplification.ll
+++ /dev/null
@@ -1,100 +0,0 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=1 | FileCheck %s --check-prefixes=NO_FOLDING1
-; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=1 | FileCheck %s --check-prefixes=NO_FOLDING1
-; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=2 | FileCheck %s --check-prefixes=NO_FOLDING2
-; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=2 | FileCheck %s --check-prefixes=NO_FOLDING2
-; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 | FileCheck %s --check-prefixes=FOLDING
-; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 | FileCheck %s --check-prefixes=FOLDING
-; Check that the default value enables the web folding and
-; that it is bigger than 3.
-; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs %...
[truncated]

@OMG-link OMG-link force-pushed the fix-dagcombine-vl-vw-fold branch from b501480 to 4a1918d Compare November 14, 2025 08:00
@topperc
Copy link
Collaborator

topperc commented Nov 14, 2025

This patch fixes the reported issue and still allows the whole web to be handled at once. The important part is that combineTo also adds the users of the replaced node to the worklist.

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 38cce26e44af..8ba1215561dc 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -18300,8 +18300,7 @@ static SDValue combineOp_VLToVWOp_VL(SDNode *N,
     }
   }
   for (std::pair<SDValue, SDValue> OldNewValues : ValuesToReplace) {
-    DAG.ReplaceAllUsesOfValueWith(OldNewValues.first, OldNewValues.second);
-    DCI.AddToWorklist(OldNewValues.second.getNode());
+    DCI.CombineTo(OldNewValues.first.getNode(), OldNewValues.second);
   }
   return InputRootReplacement;
 }

@OMG-link
Copy link
Author

OMG-link commented Nov 14, 2025

This patch fixes the reported issue and still allows the whole web to be handled at once. The important part is that combineTo also adds the users of the replaced node to the worklist.

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 38cce26e44af..8ba1215561dc 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -18300,8 +18300,7 @@ static SDValue combineOp_VLToVWOp_VL(SDNode *N,
     }
   }
   for (std::pair<SDValue, SDValue> OldNewValues : ValuesToReplace) {
-    DAG.ReplaceAllUsesOfValueWith(OldNewValues.first, OldNewValues.second);
-    DCI.AddToWorklist(OldNewValues.second.getNode());
+    DCI.CombineTo(OldNewValues.first.getNode(), OldNewValues.second);
   }
   return InputRootReplacement;
 }

This actually works. But duplicating similar logic in two places often leads to problems. We should remove the worklist inside unless it is necessary.

@OMG-link OMG-link force-pushed the fix-dagcombine-vl-vw-fold branch from 4a1918d to 66af1db Compare November 14, 2025 10:31
@OMG-link
Copy link
Author

I looked at the changes @topperc made last week. To stay compatible with those changes, the only approach I can currently think of is as mentioned: keep the internal worklist, but also add the users of the updated nodes to the outer worklist.

In any case, I don’t think maintaining another worklist internally is good practice. But for now, it seems to be the only option we have.

@topperc
Copy link
Collaborator

topperc commented Nov 14, 2025

This patch fixes the reported issue and still allows the whole web to be handled at once. The important part is that combineTo also adds the users of the replaced node to the worklist.

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 38cce26e44af..8ba1215561dc 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -18300,8 +18300,7 @@ static SDValue combineOp_VLToVWOp_VL(SDNode *N,
     }
   }
   for (std::pair<SDValue, SDValue> OldNewValues : ValuesToReplace) {
-    DAG.ReplaceAllUsesOfValueWith(OldNewValues.first, OldNewValues.second);
-    DCI.AddToWorklist(OldNewValues.second.getNode());
+    DCI.CombineTo(OldNewValues.first.getNode(), OldNewValues.second);
   }
   return InputRootReplacement;
 }

This actually works. But duplicating similar logic in two places often leads to problems. We should remove the worklist inside unless it is necessary.

Adding this worklist was not taken lightly. Originally we did it for each node, with a check that the add/sub/mul was the only user of the extend. The worklist system was added so we could remove the one use check and ensure all users of the extend could be widened. It looks like we did not add any negative tests that show cases where the extend is used by a node that can't be widened. If we had, I think this patch would show a change.

@topperc
Copy link
Collaborator

topperc commented Nov 14, 2025

I looked at the changes @topperc made last week. To stay compatible with those changes, the only approach I can currently think of is as mentioned: keep the internal worklist, but also add the users of the updated nodes to the outer worklist.

In any case, I don’t think maintaining another worklist internally is good practice. But for now, it seems to be the only option we have.

@preames and I had a conversation about this last week when I was preparing my change. We think the right long term direction is to move this entire transform to a MachineIR pass. That would still need some kind of worklist and visited set, but it wouldn't be nested inside DAG combine. It would also allow the optimization to be shared with GlobalISel.

@topperc
Copy link
Collaborator

topperc commented Nov 14, 2025

@OMG-link Do you mind if I take your test and create a PR with my change?

@OMG-link
Copy link
Author

@OMG-link Do you mind if I take your test and create a PR with my change?

I was planning to change my PR tomorrow because I can't connect to my server right now. But of course, you can do it if you want.

@OMG-link OMG-link force-pushed the fix-dagcombine-vl-vw-fold branch from 66af1db to 4b50405 Compare November 15, 2025 08:52
@OMG-link
Copy link
Author

I have rebuilt the commit. This version keeps the private worklist and, when inserting newly created nodes into the outer worklist, preserves the same insertion logic used by the outer worklist. Although this leads to duplicated code, it’s the only way to maintain consistent worklist behavior as long as the private worklist exists.

In addition, this version of the changes does not affect the results of any existing tests; it only changes the new test case from mul+add to macc.

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 38cce26e44af..75fab219b6c6 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -18300,8 +18300,24 @@ static SDValue combineOp_VLToVWOp_VL(SDNode *N,
     }
   }
   for (std::pair<SDValue, SDValue> OldNewValues : ValuesToReplace) {
-    DAG.ReplaceAllUsesOfValueWith(OldNewValues.first, OldNewValues.second);
-    DCI.AddToWorklist(OldNewValues.second.getNode());
+    // This helper function does what is done for RV in `DAGCombiner::Run`.
+    [&DAG, &DCI](SDNode *N, SDValue RV) -> void {
+      if (N->getNumValues() == RV->getNumValues())
+        DAG.ReplaceAllUsesWith(N, RV.getNode());
+      else {
+        assert(N->getValueType(0) == RV.getValueType() &&
+               N->getNumValues() == 1 && "Type mismatch");
+        DAG.ReplaceAllUsesWith(N, &RV);
+      }
+
+      if (RV.getOpcode() != ISD::EntryToken) {
+        DCI.AddToWorklist(RV.getNode());
+        for (SDNode *Node : RV.getNode()->users())
+          DCI.AddToWorklist(Node);
+      }
+
+      DCI.recursivelyDeleteUnusedNodes(N);
+    }(OldNewValues.first.getNode(), OldNewValues.second);
   }
   return InputRootReplacement;
 }

However, I noticed that in the initial version of the changes, there was the following transformation:

diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll
index 5ec5ce11f85a..94c3138fd330 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll
@@ -866,16 +866,16 @@ define <2 x i16> @vwmul_v2i16_multiuse(ptr %x, ptr %y, ptr %z, ptr %w) {
 ; CHECK-NEXT:    vle8.v v8, (a1)
 ; CHECK-NEXT:    vle8.v v9, (a2)
 ; CHECK-NEXT:    vsext.vf2 v10, v8
-; CHECK-NEXT:    vsext.vf2 v11, v9
-; CHECK-NEXT:    vdivu.vv v9, v10, v11
-; CHECK-NEXT:    vle8.v v10, (a0)
+; CHECK-NEXT:    vsext.vf2 v8, v9
+; CHECK-NEXT:    vdivu.vv v8, v10, v8
+; CHECK-NEXT:    vle8.v v9, (a0)
 ; CHECK-NEXT:    vle8.v v11, (a3)
-; CHECK-NEXT:    vsetvli zero, zero, e8, mf8, ta, ma
-; CHECK-NEXT:    vwmul.vv v12, v10, v11
-; CHECK-NEXT:    vwmul.vv v10, v8, v11
-; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
-; CHECK-NEXT:    vor.vv v8, v12, v10
-; CHECK-NEXT:    vor.vv v8, v8, v9
+; CHECK-NEXT:    vsext.vf2 v12, v9
+; CHECK-NEXT:    vsext.vf2 v9, v11
+; CHECK-NEXT:    vmul.vv v11, v12, v9
+; CHECK-NEXT:    vmul.vv v9, v10, v9
+; CHECK-NEXT:    vor.vv v9, v11, v9
+; CHECK-NEXT:    vor.vv v8, v9, v8
 ; CHECK-NEXT:    ret
   %a = load <2 x i8>, ptr %x
   %b = load <2 x i8>, ptr %y

The new version can no longer discover this transformation (it introduces two additional vsetvli instructions, but this is clearly not the reason the current code fails to detect this opportunity). Perhaps we should open a separate PR to discuss this issue.

@topperc

@topperc
Copy link
Collaborator

topperc commented Nov 15, 2025

I have rebuilt the commit. This version keeps the private worklist and, when inserting newly created nodes into the outer worklist, preserves the same insertion logic used by the outer worklist. Although this leads to duplicated code, it’s the only way to maintain consistent worklist behavior as long as the private worklist exists.

In addition, this version of the changes does not affect the results of any existing tests; it only changes the new test case from mul+add to macc.

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 38cce26e44af..75fab219b6c6 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -18300,8 +18300,24 @@ static SDValue combineOp_VLToVWOp_VL(SDNode *N,
     }
   }
   for (std::pair<SDValue, SDValue> OldNewValues : ValuesToReplace) {
-    DAG.ReplaceAllUsesOfValueWith(OldNewValues.first, OldNewValues.second);
-    DCI.AddToWorklist(OldNewValues.second.getNode());
+    // This helper function does what is done for RV in `DAGCombiner::Run`.
+    [&DAG, &DCI](SDNode *N, SDValue RV) -> void {
+      if (N->getNumValues() == RV->getNumValues())
+        DAG.ReplaceAllUsesWith(N, RV.getNode());
+      else {
+        assert(N->getValueType(0) == RV.getValueType() &&
+               N->getNumValues() == 1 && "Type mismatch");
+        DAG.ReplaceAllUsesWith(N, &RV);
+      }
+
+      if (RV.getOpcode() != ISD::EntryToken) {
+        DCI.AddToWorklist(RV.getNode());
+        for (SDNode *Node : RV.getNode()->users())
+          DCI.AddToWorklist(Node);
+      }
+
+      DCI.recursivelyDeleteUnusedNodes(N);
+    }(OldNewValues.first.getNode(), OldNewValues.second);
   }
   return InputRootReplacement;
 }

However, I noticed that in the initial version of the changes, there was the following transformation:

diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll
index 5ec5ce11f85a..94c3138fd330 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll
@@ -866,16 +866,16 @@ define <2 x i16> @vwmul_v2i16_multiuse(ptr %x, ptr %y, ptr %z, ptr %w) {
 ; CHECK-NEXT:    vle8.v v8, (a1)
 ; CHECK-NEXT:    vle8.v v9, (a2)
 ; CHECK-NEXT:    vsext.vf2 v10, v8
-; CHECK-NEXT:    vsext.vf2 v11, v9
-; CHECK-NEXT:    vdivu.vv v9, v10, v11
-; CHECK-NEXT:    vle8.v v10, (a0)
+; CHECK-NEXT:    vsext.vf2 v8, v9
+; CHECK-NEXT:    vdivu.vv v8, v10, v8
+; CHECK-NEXT:    vle8.v v9, (a0)
 ; CHECK-NEXT:    vle8.v v11, (a3)
-; CHECK-NEXT:    vsetvli zero, zero, e8, mf8, ta, ma
-; CHECK-NEXT:    vwmul.vv v12, v10, v11
-; CHECK-NEXT:    vwmul.vv v10, v8, v11
-; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
-; CHECK-NEXT:    vor.vv v8, v12, v10
-; CHECK-NEXT:    vor.vv v8, v8, v9
+; CHECK-NEXT:    vsext.vf2 v12, v9
+; CHECK-NEXT:    vsext.vf2 v9, v11
+; CHECK-NEXT:    vmul.vv v11, v12, v9
+; CHECK-NEXT:    vmul.vv v9, v10, v9
+; CHECK-NEXT:    vor.vv v9, v11, v9
+; CHECK-NEXT:    vor.vv v8, v9, v8
 ; CHECK-NEXT:    ret
   %a = load <2 x i8>, ptr %x
   %b = load <2 x i8>, ptr %y

The new version can no longer discover this transformation (it introduces two additional vsetvli instructions, but this is clearly not the reason the current code fails to detect this opportunity). Perhaps we should open a separate PR to discuss this issue.

@topperc

Looks like it doesn't do the transform because one of the vsext is used by the vdivu and one of the multiplies. So the transform gives up because it can't widen all users of that vsext. Perhaps we should loosen that restrict for multiplies since we don't have a .wv version of multiply. The restriction probably makes sense for add/sub where we can widen a single operand.

@OMG-link OMG-link force-pushed the fix-dagcombine-vl-vw-fold branch from 4b50405 to 5147468 Compare November 15, 2025 09:22
Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 47 to 48
store <32 x i16> %add0, <32 x i16>* %acc0_ptr, align 1
store <32 x i16> %add1, <32 x i16>* %acc1_ptr, align 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typed pointers are deprecated.

Suggested change
store <32 x i16> %add0, <32 x i16>* %acc0_ptr, align 1
store <32 x i16> %add1, <32 x i16>* %acc1_ptr, align 1
store <32 x i16> %add0, ptr %acc0_ptr, align 1
store <32 x i16> %add1, ptr %acc1_ptr, align 1

Copy link
Author

@OMG-link OMG-link Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. @topperc

diff --git a/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll b/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
index 2bbaf73e6e0d..f4615df5d310 100644
--- a/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/combine-vl-vw-macc.ll
@@ -2,7 +2,7 @@
 ; RUN: llc -mtriple=riscv32 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32
 ; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64
 
-define void @matmul_min(<32 x i8>* %vptr, i8* %scalars, <32 x i16>* %acc0_ptr, <32 x i16>* %acc1_ptr) {
+define void @matmul_min(ptr %vptr, ptr %scalars, ptr %acc0_ptr, ptr %acc1_ptr) {
 ; CHECK-LABEL: matmul_min:
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    li a4, 64
@@ -25,30 +25,30 @@ define void @matmul_min(<32 x i8>* %vptr, i8* %scalars, <32 x i16>* %acc0_ptr, <
 ; CHECK-NEXT:    vse8.v v12, (a3)
 ; CHECK-NEXT:    ret
 entry:
-  %acc0 = load <32 x i16>, <32 x i16>* %acc0_ptr, align 1
-  %acc1 = load <32 x i16>, <32 x i16>* %acc1_ptr, align 1
+  %acc0 = load <32 x i16>, ptr %acc0_ptr, align 1
+  %acc1 = load <32 x i16>, ptr %acc1_ptr, align 1
 
-  %v8 = load <32 x i8>, <32 x i8>* %vptr, align 1
+  %v8 = load <32 x i8>, ptr %vptr, align 1
   %v16 = sext <32 x i8> %v8 to <32 x i16>
 
-  %s0_ptr = getelementptr i8, i8* %scalars, i32 0
-  %s0_i8 = load i8, i8* %s0_ptr, align 1
+  %s0_ptr = getelementptr i8, ptr %scalars, i32 0
+  %s0_i8 = load i8, ptr %s0_ptr, align 1
   %s0_i16 = sext i8 %s0_i8 to i16
   %tmp0 = insertelement <32 x i16> undef, i16 %s0_i16, i32 0
   %splat0 = shufflevector <32 x i16> %tmp0, <32 x i16> undef, <32 x i32> zeroinitializer
   %mul0 = mul <32 x i16> %splat0, %v16
   %add0 = add <32 x i16> %mul0, %acc0
 
-  %s1_ptr = getelementptr i8, i8* %scalars, i32 1
-  %s1_i8 = load i8, i8* %s1_ptr, align 1
+  %s1_ptr = getelementptr i8, ptr %scalars, i32 1
+  %s1_i8 = load i8, ptr %s1_ptr, align 1
   %s1_i16 = sext i8 %s1_i8 to i16
   %tmp1 = insertelement <32 x i16> undef, i16 %s1_i16, i32 0
   %splat1 = shufflevector <32 x i16> %tmp1, <32 x i16> undef, <32 x i32> zeroinitializer
   %mul1 = mul <32 x i16> %splat1, %v16
   %add1 = add <32 x i16> %mul1, %acc1
 
-  store <32 x i16> %add0, <32 x i16>* %acc0_ptr, align 1
-  store <32 x i16> %add1, <32 x i16>* %acc1_ptr, align 1
+  store <32 x i16> %add0, ptr %acc0_ptr, align 1
+  store <32 x i16> %add1, ptr %acc1_ptr, align 1
 
   ret void

@OMG-link OMG-link force-pushed the fix-dagcombine-vl-vw-fold branch from 5147468 to 590531a Compare November 16, 2025 05:53
Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link

github-actions bot commented Nov 17, 2025

✅ With the latest revision this PR passed the undef deprecator.

%s0_ptr = getelementptr i8, ptr %scalars, i32 0
%s0_i8 = load i8, ptr %s0_ptr, align 1
%s0_i16 = sext i8 %s0_i8 to i16
%tmp0 = insertelement <32 x i16> undef, i16 %s0_i16, i32 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%tmp0 = insertelement <32 x i16> undef, i16 %s0_i16, i32 0
%tmp0 = insertelement <32 x i16> poison, i16 %s0_i16, i32 0

To fix the warning about new uses of undef.

%s1_ptr = getelementptr i8, ptr %scalars, i32 1
%s1_i8 = load i8, ptr %s1_ptr, align 1
%s1_i16 = sext i8 %s1_i8 to i16
%tmp1 = insertelement <32 x i16> undef, i16 %s1_i16, i32 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%tmp1 = insertelement <32 x i16> undef, i16 %s1_i16, i32 0
%tmp1 = insertelement <32 x i16> poison, i16 %s1_i16, i32 0

Kai Lin added 2 commits November 17, 2025 13:01
Add a minimal reproducer for consecutive vwmacc-like operations
to illustrate that the previous DAG combine logic may miss combining
mul+add chains into a single vwmacc.vx instruction.
The previous implementation of combineOp_VLToVWOp_VL manually replaced old
nodes with newly created widened nodes, but only added the new node itself to
the DAGCombiner worklist. Since the users of the new node were not added,
some combine opportunities could be missed when external DAGCombiner passes
expected those users to be reconsidered.

This patch replaces the custom replacement logic with a call to
DCI.CombineTo(), which performs node replacement in a way consistent with
DAGCombiner::Run:
- Replace all uses of the old node.
- Add the new node and its users to the worklist.
- Clean up unused nodes when appropriate.

Using CombineTo ensures that combineOp_VLToVWOp_VL behaves consistently with
the standard DAGCombiner update model, avoiding discrepancies between the
private worklist inside this routine and the global worklist managed by the
combiner.

This resolves missed combine cases involving VL -> VW operator widening.
@OMG-link OMG-link force-pushed the fix-dagcombine-vl-vw-fold branch from 590531a to 16d98bc Compare November 17, 2025 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants