Skip to content

Conversation

@abhishek-kaushik22
Copy link
Contributor

@abhishek-kaushik22 abhishek-kaushik22 commented Jan 2, 2025

When lowering _mm512_mul_epu32 intrinsic if the generated value if later used in a vector shuffle we generate vpmullq instead of vpmuludq (https://godbolt.org/z/WbaGMqs8e) because SimplifyDemandedVectorElts simplifies the arguments and we fail the combine to PMULDQ.

Added an override to shouldSimplifyDemandedVectorElts in X86TargetLowering to check if we can combine the MUL to PMULUDQ first.

When lowering `_mm512_mul_epu32` intrinsic if the generated value if later used in a vector shuffle we generate `vpmullq` instead of `vpmuludq` (https://godbolt.org/z/WbaGMqs8e) because `SimplifyDemandedVectorElts` simplifies the arguments and we fail the combine to `PMULDQ`.

Added an override to `shouldSimplifyDemandedVectorElts` in `X86TargetLowering` to check if we can combine the `MUL` to `PMULDQ` first.
@llvmbot
Copy link
Member

llvmbot commented Jan 2, 2025

@llvm/pr-subscribers-backend-x86

Author: None (abhishek-kaushik22)

Changes

When lowering _mm512_mul_epu32 intrinsic if the generated value if later used in a vector shuffle we generate vpmullq instead of vpmuludq (https://godbolt.org/z/WbaGMqs8e) because SimplifyDemandedVectorElts simplifies the arguments and we fail the combine to PMULDQ.

Added an override to shouldSimplifyDemandedVectorElts in X86TargetLowering to check if we can combine the MUL to PMULDQ first.


Full diff: https://github.com/llvm/llvm-project/pull/121456.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+21)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.h (+3)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index a0514e93d6598b..e104264bcbf918 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -60832,3 +60832,24 @@ Align X86TargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
     return Align(1ULL << ExperimentalPrefInnermostLoopAlignment);
   return TargetLowering::getPrefLoopAlignment();
 }
+
+bool X86TargetLowering::shouldSimplifyDemandedVectorElts(
+    SDValue Op, const TargetLoweringOpt &TLO) const {
+  if (Op.getOpcode() == ISD::VECTOR_SHUFFLE) {
+    SDValue V0 = peekThroughBitcasts(Op.getOperand(0));
+    SDValue V1 = peekThroughBitcasts(Op.getOperand(1));
+
+    if (V0.getOpcode() == ISD::MUL || V1.getOpcode() == ISD::MUL) {
+      SDNode *Mul = V0.getOpcode() == ISD::MUL ? V0.getNode() : V1.getNode();
+      SelectionDAG &DAG = TLO.DAG;
+      const X86Subtarget &Subtarget = DAG.getSubtarget<X86Subtarget>();
+      const SDLoc DL(Mul);
+
+      if (SDValue V = combineMulToPMULDQ(Mul, DL, DAG, Subtarget)) {
+        DAG.ReplaceAllUsesWith(Mul, V.getNode());
+        return false;
+      }
+    }
+  }
+  return true;
+}
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 2b7a8eaf249d83..0a6cd53f557bb2 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1207,6 +1207,9 @@ namespace llvm {
 
     bool hasBitTest(SDValue X, SDValue Y) const override;
 
+    bool shouldSimplifyDemandedVectorElts(
+        SDValue Op, const TargetLoweringOpt &TLO) const override;
+
     bool shouldProduceAndByConstByHoistingConstFromShiftsLHSOfAnd(
         SDValue X, ConstantSDNode *XC, ConstantSDNode *CC, SDValue Y,
         unsigned OldShiftOpcode, unsigned NewShiftOpcode,

@abhishek-kaushik22
Copy link
Contributor Author

@phoebewang @e-kud @RKSimon can you please review?

@phoebewang
Copy link
Contributor

because SimplifyDemandedVectorElts simplifies the arguments and we fail the combine to PMULDQ.

Is it possible to combine the new patten instead of disble SimplifyDemandedVectorElts?

@abhishek-kaushik22
Copy link
Contributor Author

because SimplifyDemandedVectorElts simplifies the arguments and we fail the combine to PMULDQ.

Is it possible to combine the new patten instead of disble SimplifyDemandedVectorElts?

I'm not sure. We start with

t2: v8i64,ch = CopyFromReg t0, Register:v8i64 %0
t4: v8i64,ch = CopyFromReg t0, Register:v8i64 %1
t6: v8i64 = BUILD_VECTOR Constant:i64<4294967295>, Constant:i64<4294967295>, Constant:i64<4294967295>, Constant:i64<4294967295>, Constant:i64<4294967295>, Constant:i64<4294967295>, Constant:i64<4294967295>, Constant:i64<4294967295>
t7: v8i64 = and t2, t6
t8: v8i64 = and t4, t6
t9: v8i64 = mul nuw t8, t7

and this gets replaced by

t2: v8i64,ch = CopyFromReg t0, Register:v8i64 %0
t4: v8i64,ch = CopyFromReg t0, Register:v8i64 %1
t9: v8i64 = mul t4, t2

Is this safe to replace with PMULUDQ? We don't have any info about the mul being unsigned here.

@RKSimon RKSimon self-requested a review January 2, 2025 11:22
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure shouldSimplifyDemandedVectorElts is the best way to handle this tbh

%4 = shufflevector <16 x i32> <i32 0, i32 poison, i32 0, i32 poison, i32 0, i32 poison, i32 0, i32 poison, i32 0, i32 poison, i32 0, i32 poison, i32 0, i32 poison, i32 0, i32 poison>, <16 x i32> %3, <16 x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8, i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
%5 = bitcast <16 x i32> %4 to <8 x i64>
ret <8 x i64> %5
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be happening for v2i64/v4i64 as well on avx512dq targets? Please can you add test coverage for those as well.

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc < %s -mtriple=x86_64-- -mattr=+avx512dq | FileCheck %s

define <8 x i64> @pr121456(<8 x i64> %a, <8 x i64> %b) {
Copy link
Collaborator

@RKSimon RKSimon Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't name the test file after a pull request - "pr" is the old llvm term for problem report - the number should be based off a reported issue number (is there one?) - otherwise I'd probably suggest adding these tests to combine-pmuldq.ll instead


bool X86TargetLowering::shouldSimplifyDemandedVectorElts(
SDValue Op, const TargetLoweringOpt &TLO) const {
if (Op.getOpcode() == ISD::VECTOR_SHUFFLE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to work in the general case, it will just help shuffles:
https://llvm.godbolt.org/z/no1Gc6fzT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants