Skip to content

Conversation

@MatzeB
Copy link
Contributor

@MatzeB MatzeB commented Oct 21, 2024

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

  • setOperationAction of v4f16, v8f16 and v16f16 to Custom so
    TargetTransformInfo::getStoreMinimumVF reports them as acceptable.
  • Add missing cost entries to X86TTIImpl::getCastInstrCost
    conversion from/to fp16. Note that conversion from f64 to f16 is not
    supported by an X86 instruction.

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

- `setOperationAction` of v4f16, v8f16 and v16f16 to Custom so
  `TargetTransformInfo::getStoreMinimumVF` reports them as acceptable.
- Add missing cost entries to `X86TTIImpl::getCastInstrCost`
  conversion from/to fp16. Note that conversion from f64 to f16 is not
  supported by an X86 instruction.
@llvmbot
Copy link
Member

llvmbot commented Oct 21, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Matthias Braun (MatzeB)

Changes

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

  • setOperationAction of v4f16, v8f16 and v16f16 to Custom so
    TargetTransformInfo::getStoreMinimumVF reports them as acceptable.
  • Add missing cost entries to X86TTIImpl::getCastInstrCost
    conversion from/to fp16. Note that conversion from f64 to f16 is not
    supported by an X86 instruction.

Patch is 35.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113195.diff

3 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+6)
  • (modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+25)
  • (added) llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll (+601)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index bcb84add65d83e..da88a1a0a5a3b8 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1714,6 +1714,9 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
       setOperationPromotedToType(Opc, MVT::v8f16, MVT::v8f32);
       setOperationPromotedToType(Opc, MVT::v16f16, MVT::v16f32);
     }
+    // trunc+store via vcvtps2ph
+    setOperationAction(ISD::STORE, MVT::v4f16, Custom);
+    setOperationAction(ISD::STORE, MVT::v8f16, Custom);
   }
 
   // This block controls legalization of the mask vector sizes that are
@@ -1784,6 +1787,9 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
 
     for (auto VT : { MVT::v1i1, MVT::v2i1, MVT::v4i1, MVT::v8i1 })
       setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
+
+    // trunc+store via vcvtps2ph
+    setOperationAction(ISD::STORE, MVT::v16f16, Custom);
   }
   if (Subtarget.hasDQI() && Subtarget.hasVLX()) {
     for (MVT VT : {MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64}) {
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 413ef0136d5c06..2d2c804ed46e54 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2296,7 +2296,10 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f32,   { 1, 1, 1, 1 } },
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v16f32,  { 3, 1, 1, 1 } },
     { ISD::FP_EXTEND, MVT::v16f64,  MVT::v16f32,  { 4, 1, 1, 1 } }, // 2*vcvtps2pd+vextractf64x4
+    { ISD::FP_EXTEND, MVT::v16f32,  MVT::v16f16,  { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
     { ISD::FP_ROUND,  MVT::v8f32,   MVT::v8f64,   { 1, 1, 1, 1 } },
+    { ISD::FP_ROUND,  MVT::v16f16,  MVT::v16f32,  { 1, 1, 1, 1 } }, // vcvtps2ph
 
     { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
     { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
@@ -2973,6 +2976,14 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
     { ISD::TRUNCATE,    MVT::v4i32,  MVT::v2i64,  { 1, 1, 1, 1 } }, // PSHUFD
   };
 
+  static const TypeConversionCostKindTblEntry F16ConversionTbl[] = {
+    { ISD::FP_ROUND,  MVT::v8f16,   MVT::v8f32,   { 1, 1, 1, 1 } }, // vcvtps2ph
+    { ISD::FP_ROUND,  MVT::v4f16,   MVT::v4f32,   { 1, 1, 1, 1 } }, // vcvtps2ph
+    { ISD::FP_EXTEND, MVT::v8f32,   MVT::v8f16,   { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v4f32,   MVT::v4f16,   { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v4f64,   MVT::v4f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
+  };
+
   // Attempt to map directly to (simple) MVT types to let us match custom entries.
   EVT SrcTy = TLI->getValueType(DL, Src);
   EVT DstTy = TLI->getValueType(DL, Dst);
@@ -3034,6 +3045,13 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
           return *KindCost;
     }
 
+    if (ST->hasF16C()) {
+      if (const auto *Entry = ConvertCostTableLookup(F16ConversionTbl, ISD,
+                                                     SimpleDstTy, SimpleSrcTy))
+        if (auto KindCost = Entry->Cost[CostKind])
+          return *KindCost;
+    }
+
     if (ST->hasSSE41()) {
       if (const auto *Entry = ConvertCostTableLookup(SSE41ConversionTbl, ISD,
                                                      SimpleDstTy, SimpleSrcTy))
@@ -3107,6 +3125,13 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
       if (auto KindCost = Entry->Cost[CostKind])
         return std::max(LTSrc.first, LTDest.first) * *KindCost;
 
+  if (ST->hasF16C()) {
+    if (const auto *Entry = ConvertCostTableLookup(F16ConversionTbl, ISD,
+                                                   LTDest.second, LTSrc.second))
+      if (auto KindCost = Entry->Cost[CostKind])
+        return std::max(LTSrc.first, LTDest.first) * *KindCost;
+  }
+
   if (ST->hasSSE41())
     if (const auto *Entry = ConvertCostTableLookup(SSE41ConversionTbl, ISD,
                                                    LTDest.second, LTSrc.second))
diff --git a/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll b/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll
new file mode 100644
index 00000000000000..1d5dee6cb8121c
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll
@@ -0,0 +1,601 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx2 | FileCheck %s --check-prefix=CHECK
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx2 -mattr=+f16c | FileCheck %s --check-prefix=CHECK-F16C
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx512f | FileCheck %s --check-prefix=CHECK-AVX512
+
+define void @fpext_v4xf16_v4xf32(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to float
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to float
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to float
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to float
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D3:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 3
+; CHECK-NEXT:    store float [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store float [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store float [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store float [[E3]], ptr [[D3]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x float>
+; CHECK-F16C-NEXT:    store <4 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x float>
+; CHECK-AVX512-NEXT:    store <4 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+
+  %e0 = fpext half %l0 to float
+  %e1 = fpext half %l1 to float
+  %e2 = fpext half %l2 to float
+  %e3 = fpext half %l3 to float
+
+  %d1 = getelementptr inbounds float, ptr %d0, i64 1
+  %d2 = getelementptr inbounds float, ptr %d0, i64 2
+  %d3 = getelementptr inbounds float, ptr %d0, i64 3
+  store float %e0, ptr %d0, align 8
+  store float %e1, ptr %d1, align 8
+  store float %e2, ptr %d2, align 8
+  store float %e3, ptr %d3, align 8
+  ret void
+}
+
+define void @fpext_v4xf16_v4xf64(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to double
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to double
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to double
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to double
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D3:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 3
+; CHECK-NEXT:    store double [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store double [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store double [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store double [[E3]], ptr [[D3]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x double>
+; CHECK-F16C-NEXT:    store <4 x double> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x double>
+; CHECK-AVX512-NEXT:    store <4 x double> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+
+  %e0 = fpext half %l0 to double
+  %e1 = fpext half %l1 to double
+  %e2 = fpext half %l2 to double
+  %e3 = fpext half %l3 to double
+
+  %d1 = getelementptr inbounds double, ptr %d0, i64 1
+  %d2 = getelementptr inbounds double, ptr %d0, i64 2
+  %d3 = getelementptr inbounds double, ptr %d0, i64 3
+  store double %e0, ptr %d0, align 8
+  store double %e1, ptr %d1, align 8
+  store double %e2, ptr %d2, align 8
+  store double %e3, ptr %d3, align 8
+  ret void
+}
+
+define void @fpext_v16xf15_v16xf32(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[S4:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 4
+; CHECK-NEXT:    [[S5:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 5
+; CHECK-NEXT:    [[S6:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 6
+; CHECK-NEXT:    [[S7:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 7
+; CHECK-NEXT:    [[S8:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 8
+; CHECK-NEXT:    [[S9:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 9
+; CHECK-NEXT:    [[S10:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 10
+; CHECK-NEXT:    [[S11:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 11
+; CHECK-NEXT:    [[S12:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 12
+; CHECK-NEXT:    [[S13:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 13
+; CHECK-NEXT:    [[S14:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 14
+; CHECK-NEXT:    [[S15:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 15
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[L4:%.*]] = load half, ptr [[S4]], align 2
+; CHECK-NEXT:    [[L5:%.*]] = load half, ptr [[S5]], align 2
+; CHECK-NEXT:    [[L6:%.*]] = load half, ptr [[S6]], align 2
+; CHECK-NEXT:    [[L7:%.*]] = load half, ptr [[S7]], align 2
+; CHECK-NEXT:    [[L8:%.*]] = load half, ptr [[S8]], align 2
+; CHECK-NEXT:    [[L9:%.*]] = load half, ptr [[S9]], align 2
+; CHECK-NEXT:    [[L10:%.*]] = load half, ptr [[S10]], align 2
+; CHECK-NEXT:    [[L11:%.*]] = load half, ptr [[S11]], align 2
+; CHECK-NEXT:    [[L12:%.*]] = load half, ptr [[S12]], align 2
+; CHECK-NEXT:    [[L13:%.*]] = load half, ptr [[S13]], align 2
+; CHECK-NEXT:    [[L14:%.*]] = load half, ptr [[S14]], align 2
+; CHECK-NEXT:    [[L15:%.*]] = load half, ptr [[S15]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to float
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to float
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to float
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to float
+; CHECK-NEXT:    [[E4:%.*]] = fpext half [[L4]] to float
+; CHECK-NEXT:    [[E5:%.*]] = fpext half [[L5]] to float
+; CHECK-NEXT:    [[E6:%.*]] = fpext half [[L6]] to float
+; CHECK-NEXT:    [[E7:%.*]] = fpext half [[L7]] to float
+; CHECK-NEXT:    [[E8:%.*]] = fpext half [[L8]] to float
+; CHECK-NEXT:    [[E9:%.*]] = fpext half [[L9]] to float
+; CHECK-NEXT:    [[E10:%.*]] = fpext half [[L10]] to float
+; CHECK-NEXT:    [[E11:%.*]] = fpext half [[L11]] to float
+; CHECK-NEXT:    [[E12:%.*]] = fpext half [[L12]] to float
+; CHECK-NEXT:    [[E13:%.*]] = fpext half [[L13]] to float
+; CHECK-NEXT:    [[E14:%.*]] = fpext half [[L14]] to float
+; CHECK-NEXT:    [[E15:%.*]] = fpext half [[L15]] to float
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D15:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 3
+; CHECK-NEXT:    [[D4:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 4
+; CHECK-NEXT:    [[D5:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 5
+; CHECK-NEXT:    [[D6:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 6
+; CHECK-NEXT:    [[D7:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 7
+; CHECK-NEXT:    [[D8:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 8
+; CHECK-NEXT:    [[D9:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 9
+; CHECK-NEXT:    [[D10:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 10
+; CHECK-NEXT:    [[D11:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 11
+; CHECK-NEXT:    [[D12:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 12
+; CHECK-NEXT:    [[D13:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 13
+; CHECK-NEXT:    [[D14:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 14
+; CHECK-NEXT:    [[D16:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 15
+; CHECK-NEXT:    store float [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store float [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store float [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store float [[E3]], ptr [[D15]], align 8
+; CHECK-NEXT:    store float [[E4]], ptr [[D4]], align 8
+; CHECK-NEXT:    store float [[E5]], ptr [[D5]], align 8
+; CHECK-NEXT:    store float [[E6]], ptr [[D6]], align 8
+; CHECK-NEXT:    store float [[E7]], ptr [[D7]], align 8
+; CHECK-NEXT:    store float [[E8]], ptr [[D8]], align 8
+; CHECK-NEXT:    store float [[E9]], ptr [[D9]], align 8
+; CHECK-NEXT:    store float [[E10]], ptr [[D10]], align 8
+; CHECK-NEXT:    store float [[E11]], ptr [[D11]], align 8
+; CHECK-NEXT:    store float [[E12]], ptr [[D12]], align 8
+; CHECK-NEXT:    store float [[E13]], ptr [[D13]], align 8
+; CHECK-NEXT:    store float [[E14]], ptr [[D14]], align 8
+; CHECK-NEXT:    store float [[E15]], ptr [[D16]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-F16C-NEXT:    [[S8:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 8
+; CHECK-F16C-NEXT:    [[D8:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 8
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <8 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <8 x half> [[TMP1]] to <8 x float>
+; CHECK-F16C-NEXT:    [[TMP3:%.*]] = load <8 x half>, ptr [[S8]], align 2
+; CHECK-F16C-NEXT:    [[TMP4:%.*]] = fpext <8 x half> [[TMP3]] to <8 x float>
+; CHECK-F16C-NEXT:    store <8 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    store <8 x float> [[TMP4]], ptr [[D8]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <16 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <16 x half> [[TMP1]] to <16 x float>
+; CHECK-AVX512-NEXT:    store <16 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %s4 = getelementptr inbounds half, ptr %s0, i64 4
+  %s5 = getelementptr inbounds half, ptr %s0, i64 5
+  %s6 = getelementptr inbounds half, ptr %s0, i64 6
+  %s7 = getelementptr inbounds half, ptr %s0, i64 7
+  %s8 = getelementptr inbounds half, ptr %s0, i64 8
+  %s9 = getelementptr inbounds half, ptr %s0, i64 9
+  %s10 = getelementptr inbounds half, ptr %s0, i64 10
+  %s11 = getelementptr inbounds half, ptr %s0, i64 11
+  %s12 = getelementptr inbounds half, ptr %s0, i64 12
+  %s13 = getelementptr inbounds half, ptr %s0, i64 13
+  %s14 = getelementptr inbounds half, ptr %s0, i64 14
+  %s15 = getelementptr inbounds half, ptr %s0, i64 15
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+  %l4 = load half, ptr %s4, align 2
+  %l5 = load half, ptr %s5, align 2
+  %l6 = load half, ptr %s6, align 2
+  %l7 = load half, ptr %s7, align 2
+  %l8 = load half, ptr %s8, align 2
+  %l9 = load half, ptr %s9, align 2
+  %l10 = load half, ptr %s10, align 2
+  %l11 = load half, ptr %s11, align 2
+  %l12 = load half, ptr %s12, align 2
+  %l13 = load half, ptr %s13, align 2
+  %l14 = load half, ptr %s14, align 2
+  %l15 = load half, ptr %s15, align 2
+
+  %e0 = fpext half %l0 to float
+  %e1 = fpext half %l1 to float
+  %e2 = fpext half %l2 to float
+  %e3 = fpext half %l3 to float
+  %e4 = fpext half %l4 to float
+  %e5 = fpext half %l5 to float
+  %e6 = fpext half %l6 to float
+  %e7 = fpext half %l7 to float
+  %e8 = fpext half %l8 to float
+  %e9 = fpext half %l9 to float
+  %e10 = fpext half %l10 to float
+  %e11 = fpext half %l11 to float
+  %e12 = fpext half %l12 to float
+  %e13 = fpext half %l13 to float
+  %e14 = fpext half %l14 to float
+  %e15 = fpext half %l15 to float
+
+  %d1 = getelementptr inbounds float, ptr %d0, i64 1
+  %d2 = getelementptr inbounds float, ptr %d0, i64 2
+  %d3 = getelementptr inbounds float, ptr %d0, i64 3
+  %d4 = getelementptr inbounds float, ptr %d0, i64 4
+  %d5 = getelementptr inbounds float, ptr %d0, i64 5
+  %d6 = getelementptr inbounds float, ptr %d0, i64 6
+  %d7 = getelementptr inbounds float, ptr %d0, i64 7
+  %d8 = getelementptr inbounds float, ptr %d0, i64 8
+  %d9 = getelementptr inbounds float, ptr %d0, i64 9
+  %d10 = getelementptr inbounds float, ptr %d0, i64 10
+  %d11 = getelementptr inbounds float, ptr %d0, i64 11
+  %d12 = getelementptr inbounds float, ptr %d0, i64 12
+  %d13 = getelementptr inbounds float, ptr %d0, i64 13
+  %d14 = getelementptr inbounds float, ptr %d0, i64 14
+  %d15 = getelementptr inbounds float, ptr %d0, i64 15
+  store float %e0, ptr %d0, align 8
+  store float %e1, ptr %d1, ali...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 21, 2024

@llvm/pr-subscribers-backend-x86

Author: Matthias Braun (MatzeB)

Changes

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

  • setOperationAction of v4f16, v8f16 and v16f16 to Custom so
    TargetTransformInfo::getStoreMinimumVF reports them as acceptable.
  • Add missing cost entries to X86TTIImpl::getCastInstrCost
    conversion from/to fp16. Note that conversion from f64 to f16 is not
    supported by an X86 instruction.

Patch is 35.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113195.diff

3 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+6)
  • (modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+25)
  • (added) llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll (+601)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index bcb84add65d83e..da88a1a0a5a3b8 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1714,6 +1714,9 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
       setOperationPromotedToType(Opc, MVT::v8f16, MVT::v8f32);
       setOperationPromotedToType(Opc, MVT::v16f16, MVT::v16f32);
     }
+    // trunc+store via vcvtps2ph
+    setOperationAction(ISD::STORE, MVT::v4f16, Custom);
+    setOperationAction(ISD::STORE, MVT::v8f16, Custom);
   }
 
   // This block controls legalization of the mask vector sizes that are
@@ -1784,6 +1787,9 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
 
     for (auto VT : { MVT::v1i1, MVT::v2i1, MVT::v4i1, MVT::v8i1 })
       setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
+
+    // trunc+store via vcvtps2ph
+    setOperationAction(ISD::STORE, MVT::v16f16, Custom);
   }
   if (Subtarget.hasDQI() && Subtarget.hasVLX()) {
     for (MVT VT : {MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64}) {
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 413ef0136d5c06..2d2c804ed46e54 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2296,7 +2296,10 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f32,   { 1, 1, 1, 1 } },
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v16f32,  { 3, 1, 1, 1 } },
     { ISD::FP_EXTEND, MVT::v16f64,  MVT::v16f32,  { 4, 1, 1, 1 } }, // 2*vcvtps2pd+vextractf64x4
+    { ISD::FP_EXTEND, MVT::v16f32,  MVT::v16f16,  { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
     { ISD::FP_ROUND,  MVT::v8f32,   MVT::v8f64,   { 1, 1, 1, 1 } },
+    { ISD::FP_ROUND,  MVT::v16f16,  MVT::v16f32,  { 1, 1, 1, 1 } }, // vcvtps2ph
 
     { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
     { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
@@ -2973,6 +2976,14 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
     { ISD::TRUNCATE,    MVT::v4i32,  MVT::v2i64,  { 1, 1, 1, 1 } }, // PSHUFD
   };
 
+  static const TypeConversionCostKindTblEntry F16ConversionTbl[] = {
+    { ISD::FP_ROUND,  MVT::v8f16,   MVT::v8f32,   { 1, 1, 1, 1 } }, // vcvtps2ph
+    { ISD::FP_ROUND,  MVT::v4f16,   MVT::v4f32,   { 1, 1, 1, 1 } }, // vcvtps2ph
+    { ISD::FP_EXTEND, MVT::v8f32,   MVT::v8f16,   { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v4f32,   MVT::v4f16,   { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v4f64,   MVT::v4f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
+  };
+
   // Attempt to map directly to (simple) MVT types to let us match custom entries.
   EVT SrcTy = TLI->getValueType(DL, Src);
   EVT DstTy = TLI->getValueType(DL, Dst);
@@ -3034,6 +3045,13 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
           return *KindCost;
     }
 
+    if (ST->hasF16C()) {
+      if (const auto *Entry = ConvertCostTableLookup(F16ConversionTbl, ISD,
+                                                     SimpleDstTy, SimpleSrcTy))
+        if (auto KindCost = Entry->Cost[CostKind])
+          return *KindCost;
+    }
+
     if (ST->hasSSE41()) {
       if (const auto *Entry = ConvertCostTableLookup(SSE41ConversionTbl, ISD,
                                                      SimpleDstTy, SimpleSrcTy))
@@ -3107,6 +3125,13 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
       if (auto KindCost = Entry->Cost[CostKind])
         return std::max(LTSrc.first, LTDest.first) * *KindCost;
 
+  if (ST->hasF16C()) {
+    if (const auto *Entry = ConvertCostTableLookup(F16ConversionTbl, ISD,
+                                                   LTDest.second, LTSrc.second))
+      if (auto KindCost = Entry->Cost[CostKind])
+        return std::max(LTSrc.first, LTDest.first) * *KindCost;
+  }
+
   if (ST->hasSSE41())
     if (const auto *Entry = ConvertCostTableLookup(SSE41ConversionTbl, ISD,
                                                    LTDest.second, LTSrc.second))
diff --git a/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll b/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll
new file mode 100644
index 00000000000000..1d5dee6cb8121c
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll
@@ -0,0 +1,601 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx2 | FileCheck %s --check-prefix=CHECK
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx2 -mattr=+f16c | FileCheck %s --check-prefix=CHECK-F16C
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx512f | FileCheck %s --check-prefix=CHECK-AVX512
+
+define void @fpext_v4xf16_v4xf32(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to float
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to float
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to float
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to float
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D3:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 3
+; CHECK-NEXT:    store float [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store float [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store float [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store float [[E3]], ptr [[D3]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x float>
+; CHECK-F16C-NEXT:    store <4 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x float>
+; CHECK-AVX512-NEXT:    store <4 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+
+  %e0 = fpext half %l0 to float
+  %e1 = fpext half %l1 to float
+  %e2 = fpext half %l2 to float
+  %e3 = fpext half %l3 to float
+
+  %d1 = getelementptr inbounds float, ptr %d0, i64 1
+  %d2 = getelementptr inbounds float, ptr %d0, i64 2
+  %d3 = getelementptr inbounds float, ptr %d0, i64 3
+  store float %e0, ptr %d0, align 8
+  store float %e1, ptr %d1, align 8
+  store float %e2, ptr %d2, align 8
+  store float %e3, ptr %d3, align 8
+  ret void
+}
+
+define void @fpext_v4xf16_v4xf64(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to double
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to double
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to double
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to double
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D3:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 3
+; CHECK-NEXT:    store double [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store double [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store double [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store double [[E3]], ptr [[D3]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x double>
+; CHECK-F16C-NEXT:    store <4 x double> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x double>
+; CHECK-AVX512-NEXT:    store <4 x double> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+
+  %e0 = fpext half %l0 to double
+  %e1 = fpext half %l1 to double
+  %e2 = fpext half %l2 to double
+  %e3 = fpext half %l3 to double
+
+  %d1 = getelementptr inbounds double, ptr %d0, i64 1
+  %d2 = getelementptr inbounds double, ptr %d0, i64 2
+  %d3 = getelementptr inbounds double, ptr %d0, i64 3
+  store double %e0, ptr %d0, align 8
+  store double %e1, ptr %d1, align 8
+  store double %e2, ptr %d2, align 8
+  store double %e3, ptr %d3, align 8
+  ret void
+}
+
+define void @fpext_v16xf15_v16xf32(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[S4:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 4
+; CHECK-NEXT:    [[S5:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 5
+; CHECK-NEXT:    [[S6:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 6
+; CHECK-NEXT:    [[S7:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 7
+; CHECK-NEXT:    [[S8:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 8
+; CHECK-NEXT:    [[S9:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 9
+; CHECK-NEXT:    [[S10:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 10
+; CHECK-NEXT:    [[S11:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 11
+; CHECK-NEXT:    [[S12:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 12
+; CHECK-NEXT:    [[S13:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 13
+; CHECK-NEXT:    [[S14:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 14
+; CHECK-NEXT:    [[S15:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 15
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[L4:%.*]] = load half, ptr [[S4]], align 2
+; CHECK-NEXT:    [[L5:%.*]] = load half, ptr [[S5]], align 2
+; CHECK-NEXT:    [[L6:%.*]] = load half, ptr [[S6]], align 2
+; CHECK-NEXT:    [[L7:%.*]] = load half, ptr [[S7]], align 2
+; CHECK-NEXT:    [[L8:%.*]] = load half, ptr [[S8]], align 2
+; CHECK-NEXT:    [[L9:%.*]] = load half, ptr [[S9]], align 2
+; CHECK-NEXT:    [[L10:%.*]] = load half, ptr [[S10]], align 2
+; CHECK-NEXT:    [[L11:%.*]] = load half, ptr [[S11]], align 2
+; CHECK-NEXT:    [[L12:%.*]] = load half, ptr [[S12]], align 2
+; CHECK-NEXT:    [[L13:%.*]] = load half, ptr [[S13]], align 2
+; CHECK-NEXT:    [[L14:%.*]] = load half, ptr [[S14]], align 2
+; CHECK-NEXT:    [[L15:%.*]] = load half, ptr [[S15]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to float
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to float
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to float
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to float
+; CHECK-NEXT:    [[E4:%.*]] = fpext half [[L4]] to float
+; CHECK-NEXT:    [[E5:%.*]] = fpext half [[L5]] to float
+; CHECK-NEXT:    [[E6:%.*]] = fpext half [[L6]] to float
+; CHECK-NEXT:    [[E7:%.*]] = fpext half [[L7]] to float
+; CHECK-NEXT:    [[E8:%.*]] = fpext half [[L8]] to float
+; CHECK-NEXT:    [[E9:%.*]] = fpext half [[L9]] to float
+; CHECK-NEXT:    [[E10:%.*]] = fpext half [[L10]] to float
+; CHECK-NEXT:    [[E11:%.*]] = fpext half [[L11]] to float
+; CHECK-NEXT:    [[E12:%.*]] = fpext half [[L12]] to float
+; CHECK-NEXT:    [[E13:%.*]] = fpext half [[L13]] to float
+; CHECK-NEXT:    [[E14:%.*]] = fpext half [[L14]] to float
+; CHECK-NEXT:    [[E15:%.*]] = fpext half [[L15]] to float
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D15:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 3
+; CHECK-NEXT:    [[D4:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 4
+; CHECK-NEXT:    [[D5:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 5
+; CHECK-NEXT:    [[D6:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 6
+; CHECK-NEXT:    [[D7:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 7
+; CHECK-NEXT:    [[D8:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 8
+; CHECK-NEXT:    [[D9:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 9
+; CHECK-NEXT:    [[D10:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 10
+; CHECK-NEXT:    [[D11:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 11
+; CHECK-NEXT:    [[D12:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 12
+; CHECK-NEXT:    [[D13:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 13
+; CHECK-NEXT:    [[D14:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 14
+; CHECK-NEXT:    [[D16:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 15
+; CHECK-NEXT:    store float [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store float [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store float [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store float [[E3]], ptr [[D15]], align 8
+; CHECK-NEXT:    store float [[E4]], ptr [[D4]], align 8
+; CHECK-NEXT:    store float [[E5]], ptr [[D5]], align 8
+; CHECK-NEXT:    store float [[E6]], ptr [[D6]], align 8
+; CHECK-NEXT:    store float [[E7]], ptr [[D7]], align 8
+; CHECK-NEXT:    store float [[E8]], ptr [[D8]], align 8
+; CHECK-NEXT:    store float [[E9]], ptr [[D9]], align 8
+; CHECK-NEXT:    store float [[E10]], ptr [[D10]], align 8
+; CHECK-NEXT:    store float [[E11]], ptr [[D11]], align 8
+; CHECK-NEXT:    store float [[E12]], ptr [[D12]], align 8
+; CHECK-NEXT:    store float [[E13]], ptr [[D13]], align 8
+; CHECK-NEXT:    store float [[E14]], ptr [[D14]], align 8
+; CHECK-NEXT:    store float [[E15]], ptr [[D16]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-F16C-NEXT:    [[S8:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 8
+; CHECK-F16C-NEXT:    [[D8:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 8
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <8 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <8 x half> [[TMP1]] to <8 x float>
+; CHECK-F16C-NEXT:    [[TMP3:%.*]] = load <8 x half>, ptr [[S8]], align 2
+; CHECK-F16C-NEXT:    [[TMP4:%.*]] = fpext <8 x half> [[TMP3]] to <8 x float>
+; CHECK-F16C-NEXT:    store <8 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    store <8 x float> [[TMP4]], ptr [[D8]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <16 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <16 x half> [[TMP1]] to <16 x float>
+; CHECK-AVX512-NEXT:    store <16 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %s4 = getelementptr inbounds half, ptr %s0, i64 4
+  %s5 = getelementptr inbounds half, ptr %s0, i64 5
+  %s6 = getelementptr inbounds half, ptr %s0, i64 6
+  %s7 = getelementptr inbounds half, ptr %s0, i64 7
+  %s8 = getelementptr inbounds half, ptr %s0, i64 8
+  %s9 = getelementptr inbounds half, ptr %s0, i64 9
+  %s10 = getelementptr inbounds half, ptr %s0, i64 10
+  %s11 = getelementptr inbounds half, ptr %s0, i64 11
+  %s12 = getelementptr inbounds half, ptr %s0, i64 12
+  %s13 = getelementptr inbounds half, ptr %s0, i64 13
+  %s14 = getelementptr inbounds half, ptr %s0, i64 14
+  %s15 = getelementptr inbounds half, ptr %s0, i64 15
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+  %l4 = load half, ptr %s4, align 2
+  %l5 = load half, ptr %s5, align 2
+  %l6 = load half, ptr %s6, align 2
+  %l7 = load half, ptr %s7, align 2
+  %l8 = load half, ptr %s8, align 2
+  %l9 = load half, ptr %s9, align 2
+  %l10 = load half, ptr %s10, align 2
+  %l11 = load half, ptr %s11, align 2
+  %l12 = load half, ptr %s12, align 2
+  %l13 = load half, ptr %s13, align 2
+  %l14 = load half, ptr %s14, align 2
+  %l15 = load half, ptr %s15, align 2
+
+  %e0 = fpext half %l0 to float
+  %e1 = fpext half %l1 to float
+  %e2 = fpext half %l2 to float
+  %e3 = fpext half %l3 to float
+  %e4 = fpext half %l4 to float
+  %e5 = fpext half %l5 to float
+  %e6 = fpext half %l6 to float
+  %e7 = fpext half %l7 to float
+  %e8 = fpext half %l8 to float
+  %e9 = fpext half %l9 to float
+  %e10 = fpext half %l10 to float
+  %e11 = fpext half %l11 to float
+  %e12 = fpext half %l12 to float
+  %e13 = fpext half %l13 to float
+  %e14 = fpext half %l14 to float
+  %e15 = fpext half %l15 to float
+
+  %d1 = getelementptr inbounds float, ptr %d0, i64 1
+  %d2 = getelementptr inbounds float, ptr %d0, i64 2
+  %d3 = getelementptr inbounds float, ptr %d0, i64 3
+  %d4 = getelementptr inbounds float, ptr %d0, i64 4
+  %d5 = getelementptr inbounds float, ptr %d0, i64 5
+  %d6 = getelementptr inbounds float, ptr %d0, i64 6
+  %d7 = getelementptr inbounds float, ptr %d0, i64 7
+  %d8 = getelementptr inbounds float, ptr %d0, i64 8
+  %d9 = getelementptr inbounds float, ptr %d0, i64 9
+  %d10 = getelementptr inbounds float, ptr %d0, i64 10
+  %d11 = getelementptr inbounds float, ptr %d0, i64 11
+  %d12 = getelementptr inbounds float, ptr %d0, i64 12
+  %d13 = getelementptr inbounds float, ptr %d0, i64 13
+  %d14 = getelementptr inbounds float, ptr %d0, i64 14
+  %d15 = getelementptr inbounds float, ptr %d0, i64 15
+  store float %e0, ptr %d0, align 8
+  store float %e1, ptr %d1, ali...
[truncated]

@github-actions
Copy link

github-actions bot commented Oct 21, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 8ae39c8e34de2d24c46827b324c76bac845c18b0 84a22ef1a5ced6fc2383c38b99fb040df17db76e --extensions cpp,h -- llvm/lib/Target/X86/X86TargetTransformInfo.cpp llvm/lib/Target/X86/X86TargetTransformInfo.h
View the diff from clang-format here.
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index bae223243b..9abc2051bf 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2293,143 +2293,221 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
   // 256-bit wide vectors.
 
   static const TypeConversionCostKindTblEntry AVX512FConversionTbl[] = {
-    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f32,   { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v16f32,  { 3, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::v16f64,  MVT::v16f32,  { 4, 1, 1, 1 } }, // 2*vcvtps2pd+vextractf64x4
-    { ISD::FP_EXTEND, MVT::v16f32,  MVT::v16f16,  { 1, 1, 1, 1 } }, // vcvtph2ps
-    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
-    { ISD::FP_ROUND,  MVT::v8f32,   MVT::v8f64,   { 1, 1, 1, 1 } },
-    { ISD::FP_ROUND,  MVT::v16f16,  MVT::v16f32,  { 1, 1, 1, 1 } }, // vcvtps2ph
-
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v16i1,   MVT::v16i8,   { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i16,   { 3, 1, 1, 1 } }, // sext+vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i16,   { 3, 1, 1, 1 } }, // sext+vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i16,   { 3, 1, 1, 1 } }, // sext+vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v16i1,   MVT::v16i16,  { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i32,   { 2, 1, 1, 1 } }, // zmm vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i32,   { 2, 1, 1, 1 } }, // zmm vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i32,   { 2, 1, 1, 1 } }, // zmm vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v16i1,   MVT::v16i32,  { 2, 1, 1, 1 } }, // vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i64,   { 2, 1, 1, 1 } }, // zmm vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i64,   { 2, 1, 1, 1 } }, // zmm vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i64,   { 2, 1, 1, 1 } }, // vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v2i8,    MVT::v2i32,   { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v4i8,    MVT::v4i32,   { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v16i8,   MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v32i8,   MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v64i8,   MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v16i16,  MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdw
-    { ISD::TRUNCATE,  MVT::v32i16,  MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdw
-    { ISD::TRUNCATE,  MVT::v2i8,    MVT::v2i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v2i16,   MVT::v2i64,   { 1, 1, 1, 1 } }, // vpshufb
-    { ISD::TRUNCATE,  MVT::v8i8,    MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v16i8,   MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v32i8,   MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v64i8,   MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v8i16,   MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqw
-    { ISD::TRUNCATE,  MVT::v16i16,  MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqw
-    { ISD::TRUNCATE,  MVT::v32i16,  MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqw
-    { ISD::TRUNCATE,  MVT::v8i32,   MVT::v8i64,   { 1, 1, 1, 1 } }, // vpmovqd
-    { ISD::TRUNCATE,  MVT::v4i32,   MVT::v4i64,   { 1, 1, 1, 1 } }, // zmm vpmovqd
-    { ISD::TRUNCATE,  MVT::v16i8,   MVT::v16i64,  { 5, 1, 1, 1 } },// 2*vpmovqd+concat+vpmovdb
-
-    { ISD::TRUNCATE,  MVT::v16i8,  MVT::v16i16,   { 3, 1, 1, 1 } }, // extend to v16i32
-    { ISD::TRUNCATE,  MVT::v32i8,  MVT::v32i16,   { 8, 1, 1, 1 } },
-    { ISD::TRUNCATE,  MVT::v64i8,  MVT::v32i16,   { 8, 1, 1, 1 } },
-
-    // Sign extend is zmm vpternlogd+vptruncdb.
-    // Zero extend is zmm broadcast load+vptruncdw.
-    { ISD::SIGN_EXTEND, MVT::v2i8,   MVT::v2i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v2i8,   MVT::v2i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v4i8,   MVT::v4i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v4i8,   MVT::v4i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i8,   MVT::v8i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i8,   MVT::v8i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v16i8,  MVT::v16i1,  { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v16i8,  MVT::v16i1,  { 4, 1, 1, 1 } },
-
-    // Sign extend is zmm vpternlogd+vptruncdw.
-    // Zero extend is zmm vpternlogd+vptruncdw+vpsrlw.
-    { ISD::SIGN_EXTEND, MVT::v2i16,  MVT::v2i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v2i16,  MVT::v2i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v4i16,  MVT::v4i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v4i16,  MVT::v4i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i16,  MVT::v8i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i16,  MVT::v8i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v16i16, MVT::v16i1,  { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v16i16, MVT::v16i1,  { 4, 1, 1, 1 } },
-
-    { ISD::SIGN_EXTEND, MVT::v2i32,  MVT::v2i1,   { 1, 1, 1, 1 } }, // zmm vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v2i32,  MVT::v2i1,   { 2, 1, 1, 1 } }, // zmm vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v4i32,  MVT::v4i1,   { 1, 1, 1, 1 } }, // zmm vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v4i32,  MVT::v4i1,   { 2, 1, 1, 1 } }, // zmm vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v8i32,  MVT::v8i1,   { 1, 1, 1, 1 } }, // zmm vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v8i32,  MVT::v8i1,   { 2, 1, 1, 1 } }, // zmm vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v2i64,  MVT::v2i1,   { 1, 1, 1, 1 } }, // zmm vpternlogq
-    { ISD::ZERO_EXTEND, MVT::v2i64,  MVT::v2i1,   { 2, 1, 1, 1 } }, // zmm vpternlogq+psrlq
-    { ISD::SIGN_EXTEND, MVT::v4i64,  MVT::v4i1,   { 1, 1, 1, 1 } }, // zmm vpternlogq
-    { ISD::ZERO_EXTEND, MVT::v4i64,  MVT::v4i1,   { 2, 1, 1, 1 } }, // zmm vpternlogq+psrlq
-
-    { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i1,  { 1, 1, 1, 1 } }, // vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i1,  { 2, 1, 1, 1 } }, // vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i1,   { 1, 1, 1, 1 } }, // vpternlogq
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i1,   { 2, 1, 1, 1 } }, // vpternlogq+psrlq
-
-    { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i8,  { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i8,  { 1, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i16, { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i16, { 1, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i8,   { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i8,   { 1, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i16,  { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i16,  { 1, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i32,  { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i32,  { 1, 1, 1, 1 } },
-
-    { ISD::SIGN_EXTEND, MVT::v32i16, MVT::v32i8,  { 3, 1, 1, 1 } }, // FIXME: May not be right
-    { ISD::ZERO_EXTEND, MVT::v32i16, MVT::v32i8,  { 3, 1, 1, 1 } }, // FIXME: May not be right
-
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v8i1,   { 4, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i1,  { 3, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v16i8,  { 2, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i8,  { 1, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v8i16,  { 2, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i16, { 1, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v8i32,  { 1, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i32, { 1, 1, 1, 1 } },
-
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i1,   { 4, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i1,  { 3, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v16i8,  { 2, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i8,  { 1, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i16,  { 2, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i16, { 1, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i32,  { 1, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i32, { 1, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f32,  MVT::v8i64,  {26, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i64,  { 5, 1, 1, 1 } },
-
-    { ISD::FP_TO_SINT,  MVT::v16i8,  MVT::v16f32, { 2, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v16i8,  MVT::v16f64, { 7, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v32i8,  MVT::v32f64, {15, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v64i8,  MVT::v64f32, {11, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v64i8,  MVT::v64f64, {31, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v8i16,  MVT::v8f64,  { 3, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v16i16, MVT::v16f64, { 7, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v32i16, MVT::v32f32, { 5, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v32i16, MVT::v32f64, {15, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v8i32,  MVT::v8f64,  { 1, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v16i32, MVT::v16f64, { 3, 1, 1, 1 } },
-
-    { ISD::FP_TO_UINT,  MVT::v8i32,  MVT::v8f64,  { 1, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v8i16,  MVT::v8f64,  { 3, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v8i8,   MVT::v8f64,  { 3, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v16i32, MVT::v16f32, { 1, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v16i16, MVT::v16f32, { 3, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v16i8,  MVT::v16f32, { 3, 1, 1, 1 } },
+      {ISD::FP_EXTEND, MVT::v8f64, MVT::v8f32, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND, MVT::v8f64, MVT::v16f32, {3, 1, 1, 1}},
+      {ISD::FP_EXTEND,
+       MVT::v16f64,
+       MVT::v16f32,
+       {4, 1, 1, 1}}, // 2*vcvtps2pd+vextractf64x4
+      {ISD::FP_EXTEND, MVT::v16f32, MVT::v16f16, {1, 1, 1, 1}}, // vcvtph2ps
+      {ISD::FP_EXTEND,
+       MVT::v8f64,
+       MVT::v8f16,
+       {2, 1, 1, 1}}, // vcvtph2ps+vcvtps2pd
+      {ISD::FP_ROUND, MVT::v8f32, MVT::v8f64, {1, 1, 1, 1}},
+      {ISD::FP_ROUND, MVT::v16f16, MVT::v16f32, {1, 1, 1, 1}}, // vcvtps2ph
+
+      {ISD::TRUNCATE,
+       MVT::v2i1,
+       MVT::v2i8,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v4i1,
+       MVT::v4i8,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v8i1,
+       MVT::v8i8,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v16i1,
+       MVT::v16i8,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v2i1,
+       MVT::v2i16,
+       {3, 1, 1, 1}}, // sext+vpsllq+vptestmq
+      {ISD::TRUNCATE,
+       MVT::v4i1,
+       MVT::v4i16,
+       {3, 1, 1, 1}}, // sext+vpsllq+vptestmq
+      {ISD::TRUNCATE,
+       MVT::v8i1,
+       MVT::v8i16,
+       {3, 1, 1, 1}}, // sext+vpsllq+vptestmq
+      {ISD::TRUNCATE,
+       MVT::v16i1,
+       MVT::v16i16,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v2i1,
+       MVT::v2i32,
+       {2, 1, 1, 1}}, // zmm vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v4i1,
+       MVT::v4i32,
+       {2, 1, 1, 1}}, // zmm vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v8i1,
+       MVT::v8i32,
+       {2, 1, 1, 1}}, // zmm vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v16i1, MVT::v16i32, {2, 1, 1, 1}}, // vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v2i1,
+       MVT::v2i64,
+       {2, 1, 1, 1}}, // zmm vpsllq+vptestmq
+      {ISD::TRUNCATE,
+       MVT::v4i1,
+       MVT::v4i64,
+       {2, 1, 1, 1}}, // zmm vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v8i1, MVT::v8i64, {2, 1, 1, 1}},   // vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v2i8, MVT::v2i32, {2, 1, 1, 1}},   // vpmovdb
+      {ISD::TRUNCATE, MVT::v4i8, MVT::v4i32, {2, 1, 1, 1}},   // vpmovdb
+      {ISD::TRUNCATE, MVT::v16i8, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdb
+      {ISD::TRUNCATE, MVT::v32i8, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdb
+      {ISD::TRUNCATE, MVT::v64i8, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdb
+      {ISD::TRUNCATE, MVT::v16i16, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdw
+      {ISD::TRUNCATE, MVT::v32i16, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdw
+      {ISD::TRUNCATE, MVT::v2i8, MVT::v2i64, {2, 1, 1, 1}},    // vpmovqb
+      {ISD::TRUNCATE, MVT::v2i16, MVT::v2i64, {1, 1, 1, 1}},   // vpshufb
+      {ISD::TRUNCATE, MVT::v8i8, MVT::v8i64, {2, 1, 1, 1}},    // vpmovqb
+      {ISD::TRUNCATE, MVT::v16i8, MVT::v8i64, {2, 1, 1, 1}},   // vpmovqb
+      {ISD::TRUNCATE, MVT::v32i8, MVT::v8i64, {2, 1, 1, 1}},   // vpmovqb
+      {ISD::TRUNCATE, MVT::v64i8, MVT::v8i64, {2, 1, 1, 1}},   // vpmovqb
+      {ISD::TRUNCATE, MVT::v8i16, MVT::v8i64, {2, 1, 1, 1}},   // vpmovqw
+      {ISD::TRUNCATE, MVT::v16i16, MVT::v8i64, {2, 1, 1, 1}},  // vpmovqw
+      {ISD::TRUNCATE, MVT::v32i16, MVT::v8i64, {2, 1, 1, 1}},  // vpmovqw
+      {ISD::TRUNCATE, MVT::v8i32, MVT::v8i64, {1, 1, 1, 1}},   // vpmovqd
+      {ISD::TRUNCATE, MVT::v4i32, MVT::v4i64, {1, 1, 1, 1}},   // zmm vpmovqd
+      {ISD::TRUNCATE,
+       MVT::v16i8,
+       MVT::v16i64,
+       {5, 1, 1, 1}}, // 2*vpmovqd+concat+vpmovdb
+
+      {ISD::TRUNCATE,
+       MVT::v16i8,
+       MVT::v16i16,
+       {3, 1, 1, 1}}, // extend to v16i32
+      {ISD::TRUNCATE, MVT::v32i8, MVT::v32i16, {8, 1, 1, 1}},
+      {ISD::TRUNCATE, MVT::v64i8, MVT::v32i16, {8, 1, 1, 1}},
+
+      // Sign extend is zmm vpternlogd+vptruncdb.
+      // Zero extend is zmm broadcast load+vptruncdw.
+      {ISD::SIGN_EXTEND, MVT::v2i8, MVT::v2i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v2i8, MVT::v2i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v4i8, MVT::v4i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v4i8, MVT::v4i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i8, MVT::v8i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i8, MVT::v8i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v16i8, MVT::v16i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v16i8, MVT::v16i1, {4, 1, 1, 1}},
+
+      // Sign extend is zmm vpternlogd+vptruncdw.
+      // Zero extend is zmm vpternlogd+vptruncdw+vpsrlw.
+      {ISD::SIGN_EXTEND, MVT::v2i16, MVT::v2i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v2i16, MVT::v2i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v4i16, MVT::v4i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v4i16, MVT::v4i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i16, MVT::v8i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i16, MVT::v8i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v16i16, MVT::v16i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v16i16, MVT::v16i1, {4, 1, 1, 1}},
+
+      {ISD::SIGN_EXTEND, MVT::v2i32, MVT::v2i1, {1, 1, 1, 1}}, // zmm vpternlogd
+      {ISD::ZERO_EXTEND,
+       MVT::v2i32,
+       MVT::v2i1,
+       {2, 1, 1, 1}}, // zmm vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v4i32, MVT::v4i1, {1, 1, 1, 1}}, // zmm vpternlogd
+      {ISD::ZERO_EXTEND,
+       MVT::v4i32,
+       MVT::v4i1,
+       {2, 1, 1, 1}}, // zmm vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v8i32, MVT::v8i1, {1, 1, 1, 1}}, // zmm vpternlogd
+      {ISD::ZERO_EXTEND,
+       MVT::v8i32,
+       MVT::v8i1,
+       {2, 1, 1, 1}}, // zmm vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v2i64, MVT::v2i1, {1, 1, 1, 1}}, // zmm vpternlogq
+      {ISD::ZERO_EXTEND,
+       MVT::v2i64,
+       MVT::v2i1,
+       {2, 1, 1, 1}}, // zmm vpternlogq+psrlq
+      {ISD::SIGN_EXTEND, MVT::v4i64, MVT::v4i1, {1, 1, 1, 1}}, // zmm vpternlogq
+      {ISD::ZERO_EXTEND,
+       MVT::v4i64,
+       MVT::v4i1,
+       {2, 1, 1, 1}}, // zmm vpternlogq+psrlq
+
+      {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i1, {1, 1, 1, 1}}, // vpternlogd
+      {ISD::ZERO_EXTEND,
+       MVT::v16i32,
+       MVT::v16i1,
+       {2, 1, 1, 1}}, // vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i1, {1, 1, 1, 1}}, // vpternlogq
+      {ISD::ZERO_EXTEND,
+       MVT::v8i64,
+       MVT::v8i1,
+       {2, 1, 1, 1}}, // vpternlogq+psrlq
+
+      {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i8, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i8, {1, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i16, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i16, {1, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i8, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i8, {1, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i16, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i16, {1, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i32, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i32, {1, 1, 1, 1}},
+
+      {ISD::SIGN_EXTEND,
+       MVT::v32i16,
+       MVT::v32i8,
+       {3, 1, 1, 1}}, // FIXME: May not be right
+      {ISD::ZERO_EXTEND,
+       MVT::v32i16,
+       MVT::v32i8,
+       {3, 1, 1, 1}}, // FIXME: May not be right
+
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i1, {4, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i1, {3, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v16i8, {2, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i8, {1, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i16, {2, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i16, {1, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i32, {1, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i32, {1, 1, 1, 1}},
+
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i1, {4, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i1, {3, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v16i8, {2, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i8, {1, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i16, {2, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i16, {1, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i32, {1, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i32, {1, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f32, MVT::v8i64, {26, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i64, {5, 1, 1, 1}},
+
+      {ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f32, {2, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f64, {7, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v32i8, MVT::v32f64, {15, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f32, {11, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f64, {31, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v8i16, MVT::v8f64, {3, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v16i16, MVT::v16f64, {7, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f32, {5, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f64, {15, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v8i32, MVT::v8f64, {1, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v16i32, MVT::v16f64, {3, 1, 1, 1}},
+
+      {ISD::FP_TO_UINT, MVT::v8i32, MVT::v8f64, {1, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v8i16, MVT::v8f64, {3, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v8i8, MVT::v8f64, {3, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v16i32, MVT::v16f32, {1, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v16i16, MVT::v16f32, {3, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v16i8, MVT::v16f32, {3, 1, 1, 1}},
   };
 
   static const TypeConversionCostKindTblEntry AVX512BWVLConversionTbl[] {
@@ -2977,14 +3055,17 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
   };
 
   static const TypeConversionCostKindTblEntry F16ConversionTbl[] = {
-    { ISD::FP_ROUND,  MVT::f16,     MVT::f32,     { 1, 1, 1, 1 } },
-    { ISD::FP_ROUND,  MVT::v8f16,   MVT::v8f32,   { 1, 1, 1, 1 } },
-    { ISD::FP_ROUND,  MVT::v4f16,   MVT::v4f32,   { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::f32,     MVT::f16,     { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::f64,     MVT::f16,     { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
-    { ISD::FP_EXTEND, MVT::v8f32,   MVT::v8f16,   { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::v4f32,   MVT::v4f16,   { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::v4f64,   MVT::v4f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
+      {ISD::FP_ROUND, MVT::f16, MVT::f32, {1, 1, 1, 1}},
+      {ISD::FP_ROUND, MVT::v8f16, MVT::v8f32, {1, 1, 1, 1}},
+      {ISD::FP_ROUND, MVT::v4f16, MVT::v4f32, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND, MVT::f32, MVT::f16, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND, MVT::f64, MVT::f16, {2, 1, 1, 1}}, // vcvtph2ps+vcvtps2pd
+      {ISD::FP_EXTEND, MVT::v8f32, MVT::v8f16, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND, MVT::v4f32, MVT::v4f16, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND,
+       MVT::v4f64,
+       MVT::v4f16,
+       {2, 1, 1, 1}}, // vcvtph2ps+vcvtps2pd
   };
 
   // Attempt to map directly to (simple) MVT types to let us match custom entries.

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 21, 2024

(seems the setOperationAction() parts break some things. Working on that right now)

@MatzeB MatzeB merged commit 054c23d into llvm:main Oct 25, 2024
7 of 8 checks passed
winner245 pushed a commit to winner245/llvm-project that referenced this pull request Oct 26, 2024
Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

- Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE
  register can hold 4xfloat converted/stored to 4xf16) this is necessary as
  fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16
  stores as legal as we generally expand fp16 operations).
- Add missing cost entries to `X86TTIImpl::getCastInstrCost`
  conversion from/to fp16. Note that conversion from f64 to f16 is not
  supported by an X86 instruction.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Oct 27, 2024
revert breaks hipRuntime build
054c23d X86: Improve cost model of fp16 conversion (llvm#113195)

Change-Id: I2dbd30b82c6b355ff83368c9fd5c8b2f83ce5db1

if (ISD == ISD::FP_ROUND && LTDest.second.getScalarType() == MVT::f16) {
// Conversion requires a libcall.
return InstructionCost::getInvalid();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking https://github.com/google/jax/blob/main/tests/lax_test.py#L3630 LazyConstantTest.testConvertElementTypeAvoidsCopies21 (dtype_in=<class 'numpy.float64'>, dtype_out=<class 'numpy.float16'>).

With

F1029 08:45:30.640847    4013 logging.cc:62] assert.h assertion failed at [third_party/llvm/llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4569](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp?l=4569&ws=joelwee/4894&snapshot=26) in VectorizationFactor llvm::LoopVectorizationPlanner::selectVectorizationFactor(): ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop"
*** Check failure stack trace: ***
    @     0x7ef66f09cf59  absl::log_internal::LogMessage::SendToLog()
    @     0x7ef66f09c4fe  absl::log_internal::LogMessage::Flush()
    @     0x7ef66f09d519  absl::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7ef67ade7314  __assert_fail
    @     0x7efa86da3f10  llvm::LoopVectorizationPlanner::selectVectorizationFactor()
    @     0x7efa86db95df  llvm::LoopVectorizationPlanner::computeBestVF()
    @     0x7efa86dcbdfd  llvm::LoopVectorizePass::processLoop()
    @     0x7efa86dd2c3d  llvm::LoopVectorizePass::runImpl()
    @     0x7efa86dd3875  llvm::LoopVectorizePass::run()
    @     0x7efa8ceb7332  llvm::detail::PassModel<>::run()
    @     0x7ef9520b9050  llvm::PassManager<>::run()
    @     0x7efaff179412  llvm::detail::PassModel<>::run()
    @     0x7ef9520be28a  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7efaff179192  llvm::detail::PassModel<>::run()
    @     0x7ef9520b7d7c  llvm::PassManager<>::run()
    @     0x7efaa12ec861  xla::cpu::CompilerFunctor::operator()()
    @     0x7efa913b0271  llvm::orc::ThreadSafeModule::withModuleDo<>()
    @     0x7efa913b000b  llvm::orc::IRCompileLayer::emit()
    @     0x7efa913e6d45  llvm::orc::BasicIRLayerMaterializationUnit::materialize()
    @     0x7efa91454337  llvm::orc::InPlaceTaskDispatcher::dispatch()
    @     0x7efa91349466  llvm::orc::ExecutionSession::dispatchOutstandingMUs()
    @     0x7efa9134e9e6  llvm::orc::ExecutionSession::OL_completeLookup()
    @     0x7efa91369a89  llvm::orc::InProgressFullLookupState::complete()
    @     0x7efa9133a0f0  llvm::orc::ExecutionSession::OL_applyQueryPhase1()
    @     0x7efa91337234  llvm::orc::ExecutionSession::lookup()
    @     0x7efa9134991e  llvm::orc::ExecutionSession::lookup()
    @     0x7efa91349de8  llvm::orc::ExecutionSession::lookup()
    @     0x7efa9134a30e  llvm::orc::ExecutionSession::lookup()
    @     0x7efa9134a459  llvm::orc::ExecutionSession::lookup()
    @     0x7efaa1719abf  xla::cpu::SimpleOrcJIT::FindCompiledSymbol()
    @     0x7efaddc247c0  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7efaddc0fb68  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7ef89847e1b6  tsl::thread::EigenEnvironment::ExecuteTask()
    @     0x7ef89847dd10  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7ef89847d940  std::__u::invoke<>()
    @     0x7ef6a5f9e25e  Thread::ThreadBody()
    @     0x7efafb6827db  start_thread
    @     0x7efabc18e05f  clone

I dumped the LLVM IR before:

; Function Attrs: nofree norecurse nosync nounwind memory(readwrite, inaccessiblemem: none) uwtable
define noalias noundef ptr @convert.2(ptr nocapture readonly %0) local_unnamed_addr #0 {
  %args_gep = getelementptr inbounds nuw i8, ptr %0, i64 24
  %args = load ptr, ptr %args_gep, align 8
  %arg0 = load ptr, ptr %args, align 8, !invariant.load !0, !dereferenceable !1, !align !2
  %arg1_gep = getelementptr i8, ptr %args, i64 16
  %arg1 = load ptr, ptr %arg1_gep, align 8, !invariant.load !0, !dereferenceable !3, !align !2
  br label %convert.2.loop_body.dim.0

convert.2.loop_body.dim.0:                        ; preds = %1, %convert.2.loop_body.dim.0
  %convert.2.invar_address.dim.0.03 = phi i64 [ 0, %1 ], [ %invar.inc, %convert.2.loop_body.dim.0 ]
  %2 = getelementptr inbounds [5 x double], ptr %arg0, i64 0, i64 %convert.2.invar_address.dim.0.03
  %3 = load double, ptr %2, align 8, !invariant.load !0, !noalias !4
  %4 = fptrunc double %3 to half
  %5 = getelementptr inbounds [5 x half], ptr %arg1, i64 0, i64 %convert.2.invar_address.dim.0.03
  store half %4, ptr %5, align 2, !alias.scope !4
  %invar.inc = add nuw nsw i64 %convert.2.invar_address.dim.0.03, 1
  %exitcond = icmp eq i64 %invar.inc, 5
  br i1 %exitcond, label %return, label %convert.2.loop_body.dim.0

return:                                           ; preds = %convert.2.loop_body.dim.0
  ret ptr null
}

Could we fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to reproduce so far. Unfortunately the dump does not contain some of the referenced metadata so I have to make guesses for that. Then trying to run opt -S -o - -passes=loop-vectorize /tmp/x.ll works just fine and I guess I need some target setup (I played with some -mtriple=x86_64 -mattr=+avx512f,+f16c but that doesn't repro either).

That said, could you try if replacing the InstructionCost::getInvalid(); with InstructionCost::getMax() or if that doesn't work with a big number like 128 helps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope #114128 fixes this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it looks like it does. Thanks! (And apologies about the bad dump)

MatzeB added a commit that referenced this pull request Oct 30, 2024
Returning invalid instruction costs when converting from/to fp16 in
`X86TTIImpl::getCastInstrCost` when there is no hardware support
available was triggering asserts. This changes the code to return a
large (arbitrary) number to model the fact that libcalls are used to
implement the conversion.

This also simplifies the code by only reporting costs for the scalar
fp16 conversion; vectorized costs being left to the fallback assuming
scalarization.

This is a follow-up to assertion issues reported for the changes in
#113195
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Nov 3, 2024
Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

- Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE
  register can hold 4xfloat converted/stored to 4xf16) this is necessary as
  fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16
  stores as legal as we generally expand fp16 operations).
- Add missing cost entries to `X86TTIImpl::getCastInstrCost`
  conversion from/to fp16. Note that conversion from f64 to f16 is not
  supported by an X86 instruction.

Change-Id: I84a8a44795fc5d76cc573884c8c76bd04dfbb24b
NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this pull request Nov 4, 2024
Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

- Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE
  register can hold 4xfloat converted/stored to 4xf16) this is necessary as
  fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16
  stores as legal as we generally expand fp16 operations).
- Add missing cost entries to `X86TTIImpl::getCastInstrCost`
  conversion from/to fp16. Note that conversion from f64 to f16 is not
  supported by an X86 instruction.
NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this pull request Nov 4, 2024
Returning invalid instruction costs when converting from/to fp16 in
`X86TTIImpl::getCastInstrCost` when there is no hardware support
available was triggering asserts. This changes the code to return a
large (arbitrary) number to model the fact that libcalls are used to
implement the conversion.

This also simplifies the code by only reporting costs for the scalar
fp16 conversion; vectorized costs being left to the fallback assuming
scalarization.

This is a follow-up to assertion issues reported for the changes in
llvm#113195
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Aug 13, 2025
Returning invalid instruction costs when converting from/to fp16 in
`X86TTIImpl::getCastInstrCost` when there is no hardware support
available was triggering asserts. This changes the code to return a
large (arbitrary) number to model the fact that libcalls are used to
implement the conversion.

This also simplifies the code by only reporting costs for the scalar
fp16 conversion; vectorized costs being left to the fallback assuming
scalarization.

This is a follow-up to assertion issues reported for the changes in
llvm#113195

upstream commit: 255e441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants