[LoongArch] Custom legalize vector_shuffle to `xvinsve0.{w/d}` when possible #160857

zhaoqi5 · 2025-09-26T11:14:54Z

No description provided.

Factor out from #151275 Remove all UnsafeFPMath uses but ABI tags related part.

Windows paths have different slashes, but I don't think we care about the exact paths there anyway so I've just checked for the final filename. Fixes #160652

An inline asm constraint "Jr", in AArch32, means that if the input value is a compile-time constant in the range -4095 to +4095, then it can be inserted into the assembly language as an immediate operand, and otherwise it will be placed in a register. The comment in the Arm backend said "It is not clear what this constraint is intended for". I believe the answer is that that range of immediate values are the ones you can use in a LDR or STR instruction. So it's suitable for cases like this: asm("str %0,[%1,%2]" : : "r"(data), "r"(base), "Jr"(offset) : "memory"); in the same way that the "Ir" constraint is suitable for the immediate in a data-processing instruction such as ADD or EOR.

#159258)

We were looking for any mention of the feature name in cpuinfo, which could have hit anything including features with common prefixes like sme, sme2, smefa64. Luckily this was not a problem but I'm changing this to find the features line and split the features into a list. Then we are only looking for exact matches. Here's the information for one core as an example: ``` processor : 7 BogoMIPS : 200.00 Features : fp asimd evtstrm crc32 atomics fphp asimdhp cpuid <...> CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd0f CPU revision : 0 ``` (and to avoid any doubt, this is from a CPU simulated in Arm's FVP, it's not real) Note that the layout of the label, colon, values is sometimes aligned but not always. So I trim whitespace a few times to normalise that. This repeats once for each core so we only need to find one features line.

#160823) - Fix #156591 (comment) - As per https://cdrdv2.intel.com/v1/dl/getContent/671200 default rounding mode is **round to nearest**.

Split out from #151300 to isolate TargetTransformInfo cost modelling for fault-only-first loads from VPlan implementation details. This change adds costing support for vp.load.ff independently of the VPlan work. For now, model a vp.load.ff as cost-equivalent to a vp.load.

This fixes the ifdefs added in e9e166e; we need to include int_lib.h first before we can expect these defines to be set. Also remove the XFAILs for aarch64 windows. As this test now became a no-op on platforms that lack CRT_HAS_128BIT or CRT_HAS_F128 (aarch64 windows lacks the latter), it no longer fails.

…remat (#159110) Currently, something like: ``` $eax = MOV32ri -11, implicit-def $rax %al = COPY $eax ``` Can be rematerialized as: ``` dead $eax = MOV32ri -11, implicit-def $rax ``` Which marks the full $rax as used, not just $al. With this change, this is rematerialized as: ``` dead $eax = MOV32ri -11, implicit-def dead $rax, implicit-def $al ``` To indicate that only $al is used. Note: This issue is latent right now, but is exposed when #134408 is applied, as it results in the register pressure being incorrectly calculated (unless this patch is applied too). I think this change is in line with past fixes in this area, notably: 059cead 69cd121

…if successor is loop header (#154063) This addresses a performance issue for our downstream GPU target that sets requiresStructuredCFG to true. The issue is that EarlyMachineLICM pass does not hoist loop invariants because a critical edge is not split. The critical edge's destination a loop header. Splitting the critical edge will not break structured CFG. Add a nvptx test to demonstrate the issue since the target also requires structured CFG. --------- Co-authored-by: Matt Arsenault <[email protected]>

In the im2col decomposition, propagate the filter tensor encoding (if specified) through the tensor.collapse_shape op, so that it can be used by the consuming linalg.generic matmul op. Signed-off-by: Fabrizio Indirli <[email protected]>

Additional CSE opportunities are exposed after converting to concrete recipes/dissolving regions and materializing various expressions. Run CSE later, to capitalize on some of the late opportunities. PR: #160572

…freeze(x),freeze(y)) (#160835)

…s(freeze(x),freeze(y)) (#160837)

…ilvar(freeze(x),freeze(y)) (#160836)

This flags enables the compiler to generate most of the debug information in a separate file which can be useful for executable size and link times. Clang already supports this flag. I have tried to follow the logic of the clang implementation where possible. Some functions were moved where they could be used by both clang and flang. The `addOtherOptions` was renamed to `addDebugOptions` to better reflect its purpose. Clang also set the `splitDebugFilename` field of the `DICompileUnit` in the IR when this option is present. That part is currently missing from this patch and will come in a follow-up PR.

…160021) This patch makes the following updates to the `QualGroup` documentation: ✅ 1. Move to Reference section Relocated the Qualification Working Group (QualGroup) docs from the main index into the Reference section for better organization and consistency. ✅ 2. Add link in GettingInvolved Inserted a proper link to the QualGroup documentation in the GettingInvolved sync-ups table, improving discoverability for newcomers. ✅ 3. Align structure with Security Group Revised the documentation layout to follow the same structure pattern as the Security Group docs, ensuring consistency across LLVM working group references.

… `_LIBCPP_VERSION` (#160627) And add some guaranteed cases (namely, for `expected`, `optional`, and `variant`) to `is_implicit_lifetime.pass.cpp`. It's somehow unfortunate that `pair` and `tuple` are not guaranteed to propagate triviality of copy/move constructors, and MSVC STL fails to do so due to ABI compatibility. This affects the implicit-lifetime property.

…oisonForTargetNode - add X86ISD::PSHUFB handling (#160842) X86ISD::PSHUFB shuffles can't create undef/poison itself, allowing us to fold freeze(pshufb(x,y)) -> pshufb(freeze(x),freeze(y))

On targets where f32 maximumnum is legal, but maximumnum on vectors of smaller types is not legal (e.g. v2f16), try unrolling the vector first as part of the expansion. Only fall back to expanding the full maximumnum computation into compares + selects if maximumnum on the scalar element type cannot be supported.

Program itself is unused in that file, so just include the needed headers.

… clang (#160605) When cross-compiling the LLVM project as a whole (from llvm/), if it cannot find presupplied tools it will create a native build environment to build the tools it needs. However, when doing a standalone build of clang (that is, from clang/ and linking against an existing libLLVM) this doesn't work. Instead a _target_ binary is built which predictably then fails. The conventional workaround for this is to build the native tools in a separate native compile phase and pass the paths to the cross build, for example see OpenEmbedded[1] or Nix[2]. But we can do better! The first problem is that LLVM_USE_HOST_TOOLS is only set in the llvm/ CMakeLists.txt, so setup_host_tool() will never consider building a native binary. This can be solved by setting LLVM_USE_HOST_TOOLS based on CMAKE_CROSSCOMPILING in clang/CMakeLists.txt in the standalone case. Now setup_host_tool() will try to build a native tool, but it needs build_native_tool() from CrossCompile.cmake, so that also needs to be included. Finally, the native binary then fails because there's no provider for the dependency "CONFIGURE_Clang_NATIVE", so use llvm_create_cross_target to create the native environment. These few lines mirror what the lldb CMakeLists.txt does in the standalone case, so there is prior art for this. [1] https://git.openembedded.org/openembedded-core/tree/meta/recipes-devtools/clang/clang_git.bb?id=e18d697e92b55e57124e80234369d46575226386#n212 [2] https://github.com/NixOS/nixpkgs/blob/3354d448f2a26117a74638957b0131ce3da9c8c4/pkgs/development/compilers/llvm/common/tblgen.nix#L54

…oisonForTargetNode - add X86ISD::VPERMV handling (#160845) X86ISD::VPERMV shuffles can't create undef/poison itself, allowing us to fold freeze(vpermps(x,y)) -> vpermps(freeze(x),freeze(y))

llvm.convert/to.fp16 and from.fp16 are no longer used / deprecated and do not need to be tested any more.

Mostly mechanical changes to add the missing field.

…ertps(freeze(x),freeze(y),i) (#160852)

…oisonForTargetNode - add X86ISD::VPERMILPV handling (#160849) X86ISD::VPERMILPV shuffles can't create undef/poison itself, allowing us to fold freeze(vpermilps(x,y)) -> vpermilps(freeze(x),freeze(y))

llvmbot · 2025-09-26T11:15:31Z

@llvm/pr-subscribers-backend-loongarch

Author: ZhaoQi (zhaoqi5)

Changes

Patch is 39.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/160857.diff

4 Files Affected:

(modified) llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp (+52)
(modified) llvm/lib/Target/LoongArch/LoongArchISelLowering.h (+1)
(modified) llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td (+9)
(modified) llvm/test/CodeGen/LoongArch/lasx/ir-instruction/shuffle-as-xvinsve0.ll (+118-586)

diff --git a/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp b/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
index 5d4a8fd080202..194f42995d55a 100644
--- a/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
+++ b/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
@@ -2317,6 +2317,54 @@ static SDValue lowerVECTOR_SHUFFLE_XVPICKOD(const SDLoc &DL, ArrayRef<int> Mask,
   return DAG.getNode(LoongArchISD::VPICKOD, DL, VT, V2, V1);
 }
 
+// Check if exactly one element of the Mask is replaced by 'Replaced', while
+// all other elements are either 'Base + i' or undef (-1). On success, return
+// the index of the replaced element. Otherwise, just return -1.
+static int checkReplaceOne(ArrayRef<int> Mask, int Base, int Replaced) {
+  int MaskSize = Mask.size();
+  int Idx = -1;
+  for (int i = 0; i < MaskSize; ++i) {
+    if (Mask[i] == Base + i || Mask[i] == -1)
+      continue;
+    if (Mask[i] != Replaced)
+      return -1;
+    if (Idx == -1)
+      Idx = i;
+    else
+      return -1;
+  }
+  return Idx;
+}
+
+/// Lower VECTOR_SHUFFLE into XVINSVE0 (if possible).
+static SDValue
+lowerVECTOR_SHUFFLE_XVINSVE0(const SDLoc &DL, ArrayRef<int> Mask, MVT VT,
+                             SDValue V1, SDValue V2, SelectionDAG &DAG,
+                             const LoongArchSubtarget &Subtarget) {
+  // LoongArch LASX only supports xvinsve0.{w/d}.
+  if (VT != MVT::v8i32 && VT != MVT::v8f32 && VT != MVT::v4i64 &&
+      VT != MVT::v4f64)
+    return SDValue();
+
+  MVT GRLenVT = Subtarget.getGRLenVT();
+  int MaskSize = Mask.size();
+  assert(MaskSize == (int)VT.getVectorNumElements() && "Unexpected mask size");
+
+  // Case 1: the lowest element of V2 replaces one element in V1.
+  int Idx = checkReplaceOne(Mask, 0, MaskSize);
+  if (Idx != -1)
+    return DAG.getNode(LoongArchISD::XVINSVE0, DL, VT, V1, V2,
+                       DAG.getConstant(Idx, DL, GRLenVT));
+
+  // Case 2: the lowest element of V1 replaces one element in V2.
+  Idx = checkReplaceOne(Mask, MaskSize, 0);
+  if (Idx != -1)
+    return DAG.getNode(LoongArchISD::XVINSVE0, DL, VT, V2, V1,
+                       DAG.getConstant(Idx, DL, GRLenVT));
+
+  return SDValue();
+}
+
 /// Lower VECTOR_SHUFFLE into XVSHUF (if possible).
 static SDValue lowerVECTOR_SHUFFLE_XVSHUF(const SDLoc &DL, ArrayRef<int> Mask,
                                           MVT VT, SDValue V1, SDValue V2,
@@ -2593,6 +2641,9 @@ static SDValue lower256BitShuffle(const SDLoc &DL, ArrayRef<int> Mask, MVT VT,
   if ((Result = lowerVECTOR_SHUFFLEAsShift(DL, Mask, VT, V1, V2, DAG, Subtarget,
                                            Zeroable)))
     return Result;
+  if ((Result =
+           lowerVECTOR_SHUFFLE_XVINSVE0(DL, Mask, VT, V1, V2, DAG, Subtarget)))
+    return Result;
   if ((Result = lowerVECTOR_SHUFFLEAsByteRotate(DL, Mask, VT, V1, V2, DAG,
                                                 Subtarget)))
     return Result;
@@ -7450,6 +7501,7 @@ const char *LoongArchTargetLowering::getTargetNodeName(unsigned Opcode) const {
     NODE_NAME_CASE(XVPERM)
     NODE_NAME_CASE(XVREPLVE0)
     NODE_NAME_CASE(XVREPLVE0Q)
+    NODE_NAME_CASE(XVINSVE0)
     NODE_NAME_CASE(VPICK_SEXT_ELT)
     NODE_NAME_CASE(VPICK_ZEXT_ELT)
     NODE_NAME_CASE(VREPLVE)
diff --git a/llvm/lib/Target/LoongArch/LoongArchISelLowering.h b/llvm/lib/Target/LoongArch/LoongArchISelLowering.h
index b2fccf59169ff..3e7ea5ebba79e 100644
--- a/llvm/lib/Target/LoongArch/LoongArchISelLowering.h
+++ b/llvm/lib/Target/LoongArch/LoongArchISelLowering.h
@@ -151,6 +151,7 @@ enum NodeType : unsigned {
   XVPERM,
   XVREPLVE0,
   XVREPLVE0Q,
+  XVINSVE0,
 
   // Extended vector element extraction
   VPICK_SEXT_ELT,
diff --git a/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td b/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td
index adfe990ba1234..dfcbfff2a9a72 100644
--- a/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td
+++ b/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td
@@ -20,6 +20,7 @@ def loongarch_xvpermi: SDNode<"LoongArchISD::XVPERMI", SDT_LoongArchV1RUimm>;
 def loongarch_xvperm: SDNode<"LoongArchISD::XVPERM", SDT_LoongArchXVPERM>;
 def loongarch_xvreplve0: SDNode<"LoongArchISD::XVREPLVE0", SDT_LoongArchXVREPLVE0>;
 def loongarch_xvreplve0q: SDNode<"LoongArchISD::XVREPLVE0Q", SDT_LoongArchXVREPLVE0>;
+def loongarch_xvinsve0 : SDNode<"LoongArchISD::XVINSVE0", SDT_LoongArchV2RUimm>;
 def loongarch_xvmskltz: SDNode<"LoongArchISD::XVMSKLTZ", SDT_LoongArchVMSKCOND>;
 def loongarch_xvmskgez: SDNode<"LoongArchISD::XVMSKGEZ", SDT_LoongArchVMSKCOND>;
 def loongarch_xvmskeqz: SDNode<"LoongArchISD::XVMSKEQZ", SDT_LoongArchVMSKCOND>;
@@ -1708,6 +1709,14 @@ def : Pat<(vector_insert v4f64:$xd, (f64(bitconvert i64:$rj)), uimm2:$imm),
           (XVINSGR2VR_D v4f64:$xd, GPR:$rj, uimm2:$imm)>;
 
 // XVINSVE0_{W/D}
+def : Pat<(loongarch_xvinsve0 v8i32:$xd, v8i32:$xj, uimm3:$imm),
+          (XVINSVE0_W v8i32:$xd, v8i32:$xj, uimm3:$imm)>;
+def : Pat<(loongarch_xvinsve0 v4i64:$xd, v4i64:$xj, uimm2:$imm),
+          (XVINSVE0_D v4i64:$xd, v4i64:$xj, uimm2:$imm)>;
+def : Pat<(loongarch_xvinsve0 v8f32:$xd, v8f32:$xj, uimm3:$imm),
+          (XVINSVE0_W v8f32:$xd, v8f32:$xj, uimm3:$imm)>;
+def : Pat<(loongarch_xvinsve0 v4f64:$xd, v4f64:$xj, uimm2:$imm),
+          (XVINSVE0_D v4f64:$xd, v4f64:$xj, uimm2:$imm)>;
 def : Pat<(vector_insert v8f32:$xd, FPR32:$fj, uimm3:$imm),
           (XVINSVE0_W v8f32:$xd, (SUBREG_TO_REG(i64 0), FPR32:$fj, sub_32),
               uimm3:$imm)>;
diff --git a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/shuffle-as-xvinsve0.ll b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/shuffle-as-xvinsve0.ll
index b6c9c4da05e5a..d5a7dbf5d57af 100644
--- a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/shuffle-as-xvinsve0.ll
+++ b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/shuffle-as-xvinsve0.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
-; RUN: llc --mtriple=loongarch32 --mattr=+32s,+lasx < %s | FileCheck %s --check-prefixes=CHECK,LA32
-; RUN: llc --mtriple=loongarch64 --mattr=+lasx < %s | FileCheck %s --check-prefixes=CHECK,LA64
+; RUN: llc --mtriple=loongarch32 --mattr=+32s,+lasx < %s | FileCheck %s
+; RUN: llc --mtriple=loongarch64 --mattr=+lasx < %s | FileCheck %s
 
 ;; xvinsve0.w
 define void @xvinsve0_v8i32_l_0(ptr %d, ptr %a, ptr %b) nounwind {
@@ -8,10 +8,8 @@ define void @xvinsve0_v8i32_l_0(ptr %d, ptr %a, ptr %b) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    xvld $xr0, $a1, 0
 ; CHECK-NEXT:    xvld $xr1, $a2, 0
-; CHECK-NEXT:    pcalau12i $a1, %pc_hi20(.LCPI0_0)
-; CHECK-NEXT:    xvld $xr2, $a1, %pc_lo12(.LCPI0_0)
-; CHECK-NEXT:    xvshuf.w $xr2, $xr1, $xr0
-; CHECK-NEXT:    xvst $xr2, $a0, 0
+; CHECK-NEXT:    xvinsve0.w $xr0, $xr1, 0
+; CHECK-NEXT:    xvst $xr0, $a0, 0
 ; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
@@ -26,10 +24,8 @@ define void @xvinsve0_v8i32_l_1(ptr %d, ptr %a, ptr %b) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    xvld $xr0, $a1, 0
 ; CHECK-NEXT:    xvld $xr1, $a2, 0
-; CHECK-NEXT:    pcalau12i $a1, %pc_hi20(.LCPI1_0)
-; CHECK-NEXT:    xvld $xr2, $a1, %pc_lo12(.LCPI1_0)
-; CHECK-NEXT:    xvshuf.w $xr2, $xr1, $xr0
-; CHECK-NEXT:    xvst $xr2, $a0, 0
+; CHECK-NEXT:    xvinsve0.w $xr0, $xr1, 1
+; CHECK-NEXT:    xvst $xr0, $a0, 0
 ; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
@@ -44,10 +40,8 @@ define void @xvinsve0_v8i32_l_2(ptr %d, ptr %a, ptr %b) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    xvld $xr0, $a1, 0
 ; CHECK-NEXT:    xvld $xr1, $a2, 0
-; CHECK-NEXT:    pcalau12i $a1, %pc_hi20(.LCPI2_0)
-; CHECK-NEXT:    xvld $xr2, $a1, %pc_lo12(.LCPI2_0)
-; CHECK-NEXT:    xvshuf.w $xr2, $xr1, $xr0
-; CHECK-NEXT:    xvst $xr2, $a0, 0
+; CHECK-NEXT:    xvinsve0.w $xr0, $xr1, 2
+; CHECK-NEXT:    xvst $xr0, $a0, 0
 ; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
@@ -62,10 +56,8 @@ define void @xvinsve0_v8i32_l_3(ptr %d, ptr %a, ptr %b) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    xvld $xr0, $a1, 0
 ; CHECK-NEXT:    xvld $xr1, $a2, 0
-; CHECK-NEXT:    pcalau12i $a1, %pc_hi20(.LCPI3_0)
-; CHECK-NEXT:    xvld $xr2, $a1, %pc_lo12(.LCPI3_0)
-; CHECK-NEXT:    xvshuf.w $xr2, $xr1, $xr0
-; CHECK-NEXT:    xvst $xr2, $a0, 0
+; CHECK-NEXT:    xvinsve0.w $xr0, $xr1, 3
+; CHECK-NEXT:    xvst $xr0, $a0, 0
 ; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
@@ -76,52 +68,13 @@ entry:
 }
 
 define void @xvinsve0_v8i32_l_4(ptr %d, ptr %a, ptr %b) nounwind {
-; LA32-LABEL: xvinsve0_v8i32_l_4:
-; LA32:       # %bb.0: # %entry
-; LA32-NEXT:    ld.w $a2, $a2, 0
-; LA32-NEXT:    xvld $xr0, $a1, 0
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a2, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 5
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 1
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 6
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 2
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 7
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 3
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 0
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 1
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 1
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 2
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 2
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 3
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 3
-; LA32-NEXT:    xvpermi.q $xr2, $xr1, 2
-; LA32-NEXT:    xvst $xr2, $a0, 0
-; LA32-NEXT:    ret
-;
-; LA64-LABEL: xvinsve0_v8i32_l_4:
-; LA64:       # %bb.0: # %entry
-; LA64-NEXT:    xvld $xr0, $a2, 0
-; LA64-NEXT:    xvld $xr1, $a1, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 0
-; LA64-NEXT:    vinsgr2vr.w $vr0, $a1, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 5
-; LA64-NEXT:    vinsgr2vr.w $vr0, $a1, 1
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 6
-; LA64-NEXT:    vinsgr2vr.w $vr0, $a1, 2
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 7
-; LA64-NEXT:    vinsgr2vr.w $vr0, $a1, 3
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 0
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 1
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 1
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 2
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 2
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 3
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 3
-; LA64-NEXT:    xvpermi.q $xr2, $xr0, 2
-; LA64-NEXT:    xvst $xr2, $a0, 0
-; LA64-NEXT:    ret
+; CHECK-LABEL: xvinsve0_v8i32_l_4:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    xvld $xr0, $a1, 0
+; CHECK-NEXT:    xvld $xr1, $a2, 0
+; CHECK-NEXT:    xvinsve0.w $xr0, $xr1, 4
+; CHECK-NEXT:    xvst $xr0, $a0, 0
+; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
   %vb = load <8 x i32>, ptr %b
@@ -131,52 +84,13 @@ entry:
 }
 
 define void @xvinsve0_v8i32_l_5(ptr %d, ptr %a, ptr %b) nounwind {
-; LA32-LABEL: xvinsve0_v8i32_l_5:
-; LA32:       # %bb.0: # %entry
-; LA32-NEXT:    xvld $xr0, $a1, 0
-; LA32-NEXT:    ld.w $a1, $a2, 0
-; LA32-NEXT:    xvpickve2gr.w $a2, $xr0, 4
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a2, 0
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 1
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 6
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 2
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 7
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 3
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 0
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 1
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 1
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 2
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 2
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 3
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 3
-; LA32-NEXT:    xvpermi.q $xr2, $xr1, 2
-; LA32-NEXT:    xvst $xr2, $a0, 0
-; LA32-NEXT:    ret
-;
-; LA64-LABEL: xvinsve0_v8i32_l_5:
-; LA64:       # %bb.0: # %entry
-; LA64-NEXT:    xvld $xr0, $a1, 0
-; LA64-NEXT:    xvld $xr1, $a2, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 4
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 0
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 1
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 6
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 2
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 7
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 3
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 0
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 1
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 1
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 2
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 2
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 3
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 3
-; LA64-NEXT:    xvpermi.q $xr1, $xr2, 2
-; LA64-NEXT:    xvst $xr1, $a0, 0
-; LA64-NEXT:    ret
+; CHECK-LABEL: xvinsve0_v8i32_l_5:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    xvld $xr0, $a1, 0
+; CHECK-NEXT:    xvld $xr1, $a2, 0
+; CHECK-NEXT:    xvinsve0.w $xr0, $xr1, 5
+; CHECK-NEXT:    xvst $xr0, $a0, 0
+; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
   %vb = load <8 x i32>, ptr %b
@@ -186,52 +100,13 @@ entry:
 }
 
 define void @xvinsve0_v8i32_l_6(ptr %d, ptr %a, ptr %b) nounwind {
-; LA32-LABEL: xvinsve0_v8i32_l_6:
-; LA32:       # %bb.0: # %entry
-; LA32-NEXT:    xvld $xr0, $a1, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 4
-; LA32-NEXT:    ld.w $a2, $a2, 0
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 5
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 1
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a2, 2
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 7
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 3
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 0
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 1
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 1
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 2
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 2
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 3
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 3
-; LA32-NEXT:    xvpermi.q $xr2, $xr1, 2
-; LA32-NEXT:    xvst $xr2, $a0, 0
-; LA32-NEXT:    ret
-;
-; LA64-LABEL: xvinsve0_v8i32_l_6:
-; LA64:       # %bb.0: # %entry
-; LA64-NEXT:    xvld $xr0, $a1, 0
-; LA64-NEXT:    xvld $xr1, $a2, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 4
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 5
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 1
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 0
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 2
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 7
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 3
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 0
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 1
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 1
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 2
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 2
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 3
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 3
-; LA64-NEXT:    xvpermi.q $xr1, $xr2, 2
-; LA64-NEXT:    xvst $xr1, $a0, 0
-; LA64-NEXT:    ret
+; CHECK-LABEL: xvinsve0_v8i32_l_6:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    xvld $xr0, $a1, 0
+; CHECK-NEXT:    xvld $xr1, $a2, 0
+; CHECK-NEXT:    xvinsve0.w $xr0, $xr1, 6
+; CHECK-NEXT:    xvst $xr0, $a0, 0
+; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
   %vb = load <8 x i32>, ptr %b
@@ -241,52 +116,13 @@ entry:
 }
 
 define void @xvinsve0_v8i32_l_7(ptr %d, ptr %a, ptr %b) nounwind {
-; LA32-LABEL: xvinsve0_v8i32_l_7:
-; LA32:       # %bb.0: # %entry
-; LA32-NEXT:    xvld $xr0, $a1, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 4
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 5
-; LA32-NEXT:    ld.w $a2, $a2, 0
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 1
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 6
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 2
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a2, 3
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 0
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 1
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 1
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 2
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 2
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 3
-; LA32-NEXT:    vinsgr2vr.w $vr2, $a1, 3
-; LA32-NEXT:    xvpermi.q $xr2, $xr1, 2
-; LA32-NEXT:    xvst $xr2, $a0, 0
-; LA32-NEXT:    ret
-;
-; LA64-LABEL: xvinsve0_v8i32_l_7:
-; LA64:       # %bb.0: # %entry
-; LA64-NEXT:    xvld $xr0, $a1, 0
-; LA64-NEXT:    xvld $xr1, $a2, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 4
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 5
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 1
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 6
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 2
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr1, 0
-; LA64-NEXT:    vinsgr2vr.w $vr2, $a1, 3
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 0
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 0
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 1
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 1
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 2
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 2
-; LA64-NEXT:    xvpickve2gr.w $a1, $xr0, 3
-; LA64-NEXT:    vinsgr2vr.w $vr1, $a1, 3
-; LA64-NEXT:    xvpermi.q $xr1, $xr2, 2
-; LA64-NEXT:    xvst $xr1, $a0, 0
-; LA64-NEXT:    ret
+; CHECK-LABEL: xvinsve0_v8i32_l_7:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    xvld $xr0, $a1, 0
+; CHECK-NEXT:    xvld $xr1, $a2, 0
+; CHECK-NEXT:    xvinsve0.w $xr0, $xr1, 7
+; CHECK-NEXT:    xvst $xr0, $a0, 0
+; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
   %vb = load <8 x i32>, ptr %b
@@ -300,10 +136,8 @@ define void @xvinsve0_v8f32_l(ptr %d, ptr %a, ptr %b) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    xvld $xr0, $a1, 0
 ; CHECK-NEXT:    xvld $xr1, $a2, 0
-; CHECK-NEXT:    pcalau12i $a1, %pc_hi20(.LCPI8_0)
-; CHECK-NEXT:    xvld $xr2, $a1, %pc_lo12(.LCPI8_0)
-; CHECK-NEXT:    xvshuf.w $xr2, $xr1, $xr0
-; CHECK-NEXT:    xvst $xr2, $a0, 0
+; CHECK-NEXT:    xvinsve0.w $xr0, $xr1, 0
+; CHECK-NEXT:    xvst $xr0, $a0, 0
 ; CHECK-NEXT:    ret
 entry:
   %va = load <8 x float>, ptr %a
@@ -318,10 +152,8 @@ define void @xvinsve0_v8i32_h_0(ptr %d, ptr %a, ptr %b) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    xvld $xr0, $a1, 0
 ; CHECK-NEXT:    xvld $xr1, $a2, 0
-; CHECK-NEXT:    pcalau12i $a1, %pc_hi20(.LCPI9_0)
-; CHECK-NEXT:    xvld $xr2, $a1, %pc_lo12(.LCPI9_0)
-; CHECK-NEXT:    xvshuf.w $xr2, $xr1, $xr0
-; CHECK-NEXT:    xvst $xr2, $a0, 0
+; CHECK-NEXT:    xvinsve0.w $xr1, $xr0, 0
+; CHECK-NEXT:    xvst $xr1, $a0, 0
 ; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
@@ -336,10 +168,8 @@ define void @xvinsve0_v8i32_h_1(ptr %d, ptr %a, ptr %b) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    xvld $xr0, $a1, 0
 ; CHECK-NEXT:    xvld $xr1, $a2, 0
-; CHECK-NEXT:    pcalau12i $a1, %pc_hi20(.LCPI10_0)
-; CHECK-NEXT:    xvld $xr2, $a1, %pc_lo12(.LCPI10_0)
-; CHECK-NEXT:    xvshuf.w $xr2, $xr1, $xr0
-; CHECK-NEXT:    xvst $xr2, $a0, 0
+; CHECK-NEXT:    xvinsve0.w $xr1, $xr0, 1
+; CHECK-NEXT:    xvst $xr1, $a0, 0
 ; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
@@ -354,10 +184,8 @@ define void @xvinsve0_v8i32_h_2(ptr %d, ptr %a, ptr %b) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    xvld $xr0, $a1, 0
 ; CHECK-NEXT:    xvld $xr1, $a2, 0
-; CHECK-NEXT:    pcalau12i $a1, %pc_hi20(.LCPI11_0)
-; CHECK-NEXT:    xvld $xr2, $a1, %pc_lo12(.LCPI11_0)
-; CHECK-NEXT:    xvshuf.w $xr2, $xr1, $xr0
-; CHECK-NEXT:    xvst $xr2, $a0, 0
+; CHECK-NEXT:    xvinsve0.w $xr1, $xr0, 2
+; CHECK-NEXT:    xvst $xr1, $a0, 0
 ; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
@@ -372,10 +200,8 @@ define void @xvinsve0_v8i32_h_3(ptr %d, ptr %a, ptr %b) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    xvld $xr0, $a1, 0
 ; CHECK-NEXT:    xvld $xr1, $a2, 0
-; CHECK-NEXT:    pcalau12i $a1, %pc_hi20(.LCPI12_0)
-; CHECK-NEXT:    xvld $xr2, $a1, %pc_lo12(.LCPI12_0)
-; CHECK-NEXT:    xvshuf.w $xr2, $xr1, $xr0
-; CHECK-NEXT:    xvst $xr2, $a0, 0
+; CHECK-NEXT:    xvinsve0.w $xr1, $xr0, 3
+; CHECK-NEXT:    xvst $xr1, $a0, 0
 ; CHECK-NEXT:    ret
 entry:
   %va = load <8 x i32>, ptr %a
@@ -386,52 +212,13 @@ entry:
 }
 
 define void @xvinsve0_v8i32_h_4(ptr %d, ptr %a, ptr %b) nounwind {
-; LA32-LABEL: xvinsve0_v8i32_h_4:
-; LA32:       # %bb.0: # %entry
-; LA32-NEXT:    ld.w $a1, $a1, 0
-; LA32-NEXT:    xvld $xr0, $a2, 0
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 0
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 5
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 1
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 6
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 2
-; LA32-NEXT:    xvpickve2gr.w $a1, $xr0, 7
-; LA32-NEXT:    vinsgr2vr.w $vr1, $a1, 3
-; LA32-NEX...
[truncated]

zhaoqi5 · 2025-09-29T09:12:16Z

... Sorry for my mistake. Please forgive me for disturbing all of you.

jayfoad and others added 28 commits September 26, 2025 08:39

[AMDGPU] Skip debug uses in SIInsertWaitcnts::shouldFlushVmCnt (#160818)

8cd917b

[ARM] Remove UnsafeFPMath uses in code generation part (#160801)

3257dc3

Factor out from #151275 Remove all UnsafeFPMath uses but ABI tags related part.

[lldb][test] Fix elf-no-shdrs-pt-notes.yaml on Windows (#160827)

368d599

Windows paths have different slashes, but I don't think we care about the exact paths there anyway so I've just checked for the final filename. Fixes #160652

[LoongArch] Generate [x]vldi instructions with special constant splats (

9de1bc0

#159258)

[X86] Set default rounding mode round to nearest for llvm.set.rounding (

9b270fc

#160823) - Fix #156591 (comment) - As per https://cdrdv2.intel.com/v1/dl/getContent/671200 default rounding mode is **round to nearest**.

[VPlan] Run CSE closer to VPlan::execute. (#160572)

78af056

Additional CSE opportunities are exposed after converting to concrete recipes/dissolving regions and materializing various expressions. Run CSE later, to capitalize on some of the late opportunities. PR: #160572

[X86] Add test showing failure to fold freeze(pshufb(x,y)) -> pshufb(…

ef5e0c7

…freeze(x),freeze(y)) (#160835)

[X86] Add test showing failure to fold freeze(vpermps(x,y)) -> vpermp…

c10befb

…s(freeze(x),freeze(y)) (#160837)

[X86] Add test showing failure to fold freeze(permilvar(x,y)) -> perm…

c731291

…ilvar(freeze(x),freeze(y)) (#160836)

[X86] canCreateUndefOrPoisonForTargetNode/isGuaranteedNotToBeUndefOrP…

81aafd9

…oisonForTargetNode - add X86ISD::PSHUFB handling (#160842) X86ISD::PSHUFB shuffles can't create undef/poison itself, allowing us to fold freeze(pshufb(x,y)) -> pshufb(freeze(x),freeze(y))

[clang][bytecode] Remove Program include from InterpFrame.h (#160843)

347df23

Program itself is unused in that file, so just include the needed headers.

[X86] canCreateUndefOrPoisonForTargetNode/isGuaranteedNotToBeUndefOrP…

3073bb5

…oisonForTargetNode - add X86ISD::VPERMV handling (#160845) X86ISD::VPERMV shuffles can't create undef/poison itself, allowing us to fold freeze(vpermps(x,y)) -> vpermps(freeze(x),freeze(y))

[ARM] Remove -fno-unsafe-math from a number of tests. NFC

02746f8

llvm.convert/to.fp16 and from.fp16 are no longer used / deprecated and do not need to be tested any more.

[mlir] Add splitDebugFilename field in DIComplileUnitAttr. (#160704)

e38e0bd

Mostly mechanical changes to add the missing field.

[X86] Add test showing failure to fold freeze(insertps(x,y,i)) -> ins…

0aad055

…ertps(freeze(x),freeze(y),i) (#160852)

[X86] canCreateUndefOrPoisonForTargetNode/isGuaranteedNotToBeUndefOrP…

3fa3e09

…oisonForTargetNode - add X86ISD::VPERMILPV handling (#160849) X86ISD::VPERMILPV shuffles can't create undef/poison itself, allowing us to fold freeze(vpermilps(x,y)) -> vpermilps(freeze(x),freeze(y))

llvmbot added the backend:loongarch label Sep 26, 2025

zhaoqi5 closed this Sep 29, 2025

zhaoqi5 mentioned this pull request Sep 29, 2025

[LoongArch] Custom legalize vector_shuffle to xvinsve0.{w/d} when possible #161156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LoongArch] Custom legalize vector_shuffle to `xvinsve0.{w/d}` when possible #160857

[LoongArch] Custom legalize vector_shuffle to `xvinsve0.{w/d}` when possible #160857

Uh oh!

zhaoqi5 commented Sep 26, 2025

Uh oh!

llvmbot commented Sep 26, 2025

Uh oh!

zhaoqi5 commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[LoongArch] Custom legalize vector_shuffle to xvinsve0.{w/d} when possible #160857

[LoongArch] Custom legalize vector_shuffle to xvinsve0.{w/d} when possible #160857

Uh oh!

Conversation

zhaoqi5 commented Sep 26, 2025

Uh oh!

llvmbot commented Sep 26, 2025

Uh oh!

zhaoqi5 commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[LoongArch] Custom legalize vector_shuffle to `xvinsve0.{w/d}` when possible #160857

[LoongArch] Custom legalize vector_shuffle to `xvinsve0.{w/d}` when possible #160857