X86: Improve cost model of fp16 conversion #113195

MatzeB · 2024-10-21T17:38:56Z

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

setOperationAction of v4f16, v8f16 and v16f16 to Custom so
TargetTransformInfo::getStoreMinimumVF reports them as acceptable.
Add missing cost entries to X86TTIImpl::getCastInstrCost
conversion from/to fp16. Note that conversion from f64 to f16 is not
supported by an X86 instruction.

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer transforms the patterns: - `setOperationAction` of v4f16, v8f16 and v16f16 to Custom so `TargetTransformInfo::getStoreMinimumVF` reports them as acceptable. - Add missing cost entries to `X86TTIImpl::getCastInstrCost` conversion from/to fp16. Note that conversion from f64 to f16 is not supported by an X86 instruction.

llvmbot · 2024-10-21T17:39:31Z

@llvm/pr-subscribers-llvm-transforms

Author: Matthias Braun (MatzeB)

Changes

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

setOperationAction of v4f16, v8f16 and v16f16 to Custom so
TargetTransformInfo::getStoreMinimumVF reports them as acceptable.
Add missing cost entries to X86TTIImpl::getCastInstrCost
conversion from/to fp16. Note that conversion from f64 to f16 is not
supported by an X86 instruction.

Patch is 35.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113195.diff

3 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+6)
(modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+25)
(added) llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll (+601)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index bcb84add65d83e..da88a1a0a5a3b8 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1714,6 +1714,9 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
       setOperationPromotedToType(Opc, MVT::v8f16, MVT::v8f32);
       setOperationPromotedToType(Opc, MVT::v16f16, MVT::v16f32);
     }
+    // trunc+store via vcvtps2ph
+    setOperationAction(ISD::STORE, MVT::v4f16, Custom);
+    setOperationAction(ISD::STORE, MVT::v8f16, Custom);
   }
 
   // This block controls legalization of the mask vector sizes that are
@@ -1784,6 +1787,9 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
 
     for (auto VT : { MVT::v1i1, MVT::v2i1, MVT::v4i1, MVT::v8i1 })
       setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
+
+    // trunc+store via vcvtps2ph
+    setOperationAction(ISD::STORE, MVT::v16f16, Custom);
   }
   if (Subtarget.hasDQI() && Subtarget.hasVLX()) {
     for (MVT VT : {MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64}) {
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 413ef0136d5c06..2d2c804ed46e54 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2296,7 +2296,10 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f32,   { 1, 1, 1, 1 } },
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v16f32,  { 3, 1, 1, 1 } },
     { ISD::FP_EXTEND, MVT::v16f64,  MVT::v16f32,  { 4, 1, 1, 1 } }, // 2*vcvtps2pd+vextractf64x4
+    { ISD::FP_EXTEND, MVT::v16f32,  MVT::v16f16,  { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
     { ISD::FP_ROUND,  MVT::v8f32,   MVT::v8f64,   { 1, 1, 1, 1 } },
+    { ISD::FP_ROUND,  MVT::v16f16,  MVT::v16f32,  { 1, 1, 1, 1 } }, // vcvtps2ph
 
     { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
     { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
@@ -2973,6 +2976,14 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
     { ISD::TRUNCATE,    MVT::v4i32,  MVT::v2i64,  { 1, 1, 1, 1 } }, // PSHUFD
   };
 
+  static const TypeConversionCostKindTblEntry F16ConversionTbl[] = {
+    { ISD::FP_ROUND,  MVT::v8f16,   MVT::v8f32,   { 1, 1, 1, 1 } }, // vcvtps2ph
+    { ISD::FP_ROUND,  MVT::v4f16,   MVT::v4f32,   { 1, 1, 1, 1 } }, // vcvtps2ph
+    { ISD::FP_EXTEND, MVT::v8f32,   MVT::v8f16,   { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v4f32,   MVT::v4f16,   { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v4f64,   MVT::v4f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
+  };
+
   // Attempt to map directly to (simple) MVT types to let us match custom entries.
   EVT SrcTy = TLI->getValueType(DL, Src);
   EVT DstTy = TLI->getValueType(DL, Dst);
@@ -3034,6 +3045,13 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
           return *KindCost;
     }
 
+    if (ST->hasF16C()) {
+      if (const auto *Entry = ConvertCostTableLookup(F16ConversionTbl, ISD,
+                                                     SimpleDstTy, SimpleSrcTy))
+        if (auto KindCost = Entry->Cost[CostKind])
+          return *KindCost;
+    }
+
     if (ST->hasSSE41()) {
       if (const auto *Entry = ConvertCostTableLookup(SSE41ConversionTbl, ISD,
                                                      SimpleDstTy, SimpleSrcTy))
@@ -3107,6 +3125,13 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
       if (auto KindCost = Entry->Cost[CostKind])
         return std::max(LTSrc.first, LTDest.first) * *KindCost;
 
+  if (ST->hasF16C()) {
+    if (const auto *Entry = ConvertCostTableLookup(F16ConversionTbl, ISD,
+                                                   LTDest.second, LTSrc.second))
+      if (auto KindCost = Entry->Cost[CostKind])
+        return std::max(LTSrc.first, LTDest.first) * *KindCost;
+  }
+
   if (ST->hasSSE41())
     if (const auto *Entry = ConvertCostTableLookup(SSE41ConversionTbl, ISD,
                                                    LTDest.second, LTSrc.second))
diff --git a/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll b/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll
new file mode 100644
index 00000000000000..1d5dee6cb8121c
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll
@@ -0,0 +1,601 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx2 | FileCheck %s --check-prefix=CHECK
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx2 -mattr=+f16c | FileCheck %s --check-prefix=CHECK-F16C
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx512f | FileCheck %s --check-prefix=CHECK-AVX512
+
+define void @fpext_v4xf16_v4xf32(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to float
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to float
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to float
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to float
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D3:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 3
+; CHECK-NEXT:    store float [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store float [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store float [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store float [[E3]], ptr [[D3]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x float>
+; CHECK-F16C-NEXT:    store <4 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x float>
+; CHECK-AVX512-NEXT:    store <4 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+
+  %e0 = fpext half %l0 to float
+  %e1 = fpext half %l1 to float
+  %e2 = fpext half %l2 to float
+  %e3 = fpext half %l3 to float
+
+  %d1 = getelementptr inbounds float, ptr %d0, i64 1
+  %d2 = getelementptr inbounds float, ptr %d0, i64 2
+  %d3 = getelementptr inbounds float, ptr %d0, i64 3
+  store float %e0, ptr %d0, align 8
+  store float %e1, ptr %d1, align 8
+  store float %e2, ptr %d2, align 8
+  store float %e3, ptr %d3, align 8
+  ret void
+}
+
+define void @fpext_v4xf16_v4xf64(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to double
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to double
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to double
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to double
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D3:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 3
+; CHECK-NEXT:    store double [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store double [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store double [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store double [[E3]], ptr [[D3]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x double>
+; CHECK-F16C-NEXT:    store <4 x double> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x double>
+; CHECK-AVX512-NEXT:    store <4 x double> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+
+  %e0 = fpext half %l0 to double
+  %e1 = fpext half %l1 to double
+  %e2 = fpext half %l2 to double
+  %e3 = fpext half %l3 to double
+
+  %d1 = getelementptr inbounds double, ptr %d0, i64 1
+  %d2 = getelementptr inbounds double, ptr %d0, i64 2
+  %d3 = getelementptr inbounds double, ptr %d0, i64 3
+  store double %e0, ptr %d0, align 8
+  store double %e1, ptr %d1, align 8
+  store double %e2, ptr %d2, align 8
+  store double %e3, ptr %d3, align 8
+  ret void
+}
+
+define void @fpext_v16xf15_v16xf32(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[S4:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 4
+; CHECK-NEXT:    [[S5:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 5
+; CHECK-NEXT:    [[S6:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 6
+; CHECK-NEXT:    [[S7:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 7
+; CHECK-NEXT:    [[S8:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 8
+; CHECK-NEXT:    [[S9:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 9
+; CHECK-NEXT:    [[S10:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 10
+; CHECK-NEXT:    [[S11:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 11
+; CHECK-NEXT:    [[S12:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 12
+; CHECK-NEXT:    [[S13:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 13
+; CHECK-NEXT:    [[S14:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 14
+; CHECK-NEXT:    [[S15:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 15
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[L4:%.*]] = load half, ptr [[S4]], align 2
+; CHECK-NEXT:    [[L5:%.*]] = load half, ptr [[S5]], align 2
+; CHECK-NEXT:    [[L6:%.*]] = load half, ptr [[S6]], align 2
+; CHECK-NEXT:    [[L7:%.*]] = load half, ptr [[S7]], align 2
+; CHECK-NEXT:    [[L8:%.*]] = load half, ptr [[S8]], align 2
+; CHECK-NEXT:    [[L9:%.*]] = load half, ptr [[S9]], align 2
+; CHECK-NEXT:    [[L10:%.*]] = load half, ptr [[S10]], align 2
+; CHECK-NEXT:    [[L11:%.*]] = load half, ptr [[S11]], align 2
+; CHECK-NEXT:    [[L12:%.*]] = load half, ptr [[S12]], align 2
+; CHECK-NEXT:    [[L13:%.*]] = load half, ptr [[S13]], align 2
+; CHECK-NEXT:    [[L14:%.*]] = load half, ptr [[S14]], align 2
+; CHECK-NEXT:    [[L15:%.*]] = load half, ptr [[S15]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to float
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to float
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to float
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to float
+; CHECK-NEXT:    [[E4:%.*]] = fpext half [[L4]] to float
+; CHECK-NEXT:    [[E5:%.*]] = fpext half [[L5]] to float
+; CHECK-NEXT:    [[E6:%.*]] = fpext half [[L6]] to float
+; CHECK-NEXT:    [[E7:%.*]] = fpext half [[L7]] to float
+; CHECK-NEXT:    [[E8:%.*]] = fpext half [[L8]] to float
+; CHECK-NEXT:    [[E9:%.*]] = fpext half [[L9]] to float
+; CHECK-NEXT:    [[E10:%.*]] = fpext half [[L10]] to float
+; CHECK-NEXT:    [[E11:%.*]] = fpext half [[L11]] to float
+; CHECK-NEXT:    [[E12:%.*]] = fpext half [[L12]] to float
+; CHECK-NEXT:    [[E13:%.*]] = fpext half [[L13]] to float
+; CHECK-NEXT:    [[E14:%.*]] = fpext half [[L14]] to float
+; CHECK-NEXT:    [[E15:%.*]] = fpext half [[L15]] to float
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D15:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 3
+; CHECK-NEXT:    [[D4:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 4
+; CHECK-NEXT:    [[D5:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 5
+; CHECK-NEXT:    [[D6:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 6
+; CHECK-NEXT:    [[D7:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 7
+; CHECK-NEXT:    [[D8:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 8
+; CHECK-NEXT:    [[D9:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 9
+; CHECK-NEXT:    [[D10:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 10
+; CHECK-NEXT:    [[D11:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 11
+; CHECK-NEXT:    [[D12:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 12
+; CHECK-NEXT:    [[D13:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 13
+; CHECK-NEXT:    [[D14:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 14
+; CHECK-NEXT:    [[D16:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 15
+; CHECK-NEXT:    store float [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store float [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store float [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store float [[E3]], ptr [[D15]], align 8
+; CHECK-NEXT:    store float [[E4]], ptr [[D4]], align 8
+; CHECK-NEXT:    store float [[E5]], ptr [[D5]], align 8
+; CHECK-NEXT:    store float [[E6]], ptr [[D6]], align 8
+; CHECK-NEXT:    store float [[E7]], ptr [[D7]], align 8
+; CHECK-NEXT:    store float [[E8]], ptr [[D8]], align 8
+; CHECK-NEXT:    store float [[E9]], ptr [[D9]], align 8
+; CHECK-NEXT:    store float [[E10]], ptr [[D10]], align 8
+; CHECK-NEXT:    store float [[E11]], ptr [[D11]], align 8
+; CHECK-NEXT:    store float [[E12]], ptr [[D12]], align 8
+; CHECK-NEXT:    store float [[E13]], ptr [[D13]], align 8
+; CHECK-NEXT:    store float [[E14]], ptr [[D14]], align 8
+; CHECK-NEXT:    store float [[E15]], ptr [[D16]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-F16C-NEXT:    [[S8:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 8
+; CHECK-F16C-NEXT:    [[D8:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 8
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <8 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <8 x half> [[TMP1]] to <8 x float>
+; CHECK-F16C-NEXT:    [[TMP3:%.*]] = load <8 x half>, ptr [[S8]], align 2
+; CHECK-F16C-NEXT:    [[TMP4:%.*]] = fpext <8 x half> [[TMP3]] to <8 x float>
+; CHECK-F16C-NEXT:    store <8 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    store <8 x float> [[TMP4]], ptr [[D8]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <16 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <16 x half> [[TMP1]] to <16 x float>
+; CHECK-AVX512-NEXT:    store <16 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %s4 = getelementptr inbounds half, ptr %s0, i64 4
+  %s5 = getelementptr inbounds half, ptr %s0, i64 5
+  %s6 = getelementptr inbounds half, ptr %s0, i64 6
+  %s7 = getelementptr inbounds half, ptr %s0, i64 7
+  %s8 = getelementptr inbounds half, ptr %s0, i64 8
+  %s9 = getelementptr inbounds half, ptr %s0, i64 9
+  %s10 = getelementptr inbounds half, ptr %s0, i64 10
+  %s11 = getelementptr inbounds half, ptr %s0, i64 11
+  %s12 = getelementptr inbounds half, ptr %s0, i64 12
+  %s13 = getelementptr inbounds half, ptr %s0, i64 13
+  %s14 = getelementptr inbounds half, ptr %s0, i64 14
+  %s15 = getelementptr inbounds half, ptr %s0, i64 15
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+  %l4 = load half, ptr %s4, align 2
+  %l5 = load half, ptr %s5, align 2
+  %l6 = load half, ptr %s6, align 2
+  %l7 = load half, ptr %s7, align 2
+  %l8 = load half, ptr %s8, align 2
+  %l9 = load half, ptr %s9, align 2
+  %l10 = load half, ptr %s10, align 2
+  %l11 = load half, ptr %s11, align 2
+  %l12 = load half, ptr %s12, align 2
+  %l13 = load half, ptr %s13, align 2
+  %l14 = load half, ptr %s14, align 2
+  %l15 = load half, ptr %s15, align 2
+
+  %e0 = fpext half %l0 to float
+  %e1 = fpext half %l1 to float
+  %e2 = fpext half %l2 to float
+  %e3 = fpext half %l3 to float
+  %e4 = fpext half %l4 to float
+  %e5 = fpext half %l5 to float
+  %e6 = fpext half %l6 to float
+  %e7 = fpext half %l7 to float
+  %e8 = fpext half %l8 to float
+  %e9 = fpext half %l9 to float
+  %e10 = fpext half %l10 to float
+  %e11 = fpext half %l11 to float
+  %e12 = fpext half %l12 to float
+  %e13 = fpext half %l13 to float
+  %e14 = fpext half %l14 to float
+  %e15 = fpext half %l15 to float
+
+  %d1 = getelementptr inbounds float, ptr %d0, i64 1
+  %d2 = getelementptr inbounds float, ptr %d0, i64 2
+  %d3 = getelementptr inbounds float, ptr %d0, i64 3
+  %d4 = getelementptr inbounds float, ptr %d0, i64 4
+  %d5 = getelementptr inbounds float, ptr %d0, i64 5
+  %d6 = getelementptr inbounds float, ptr %d0, i64 6
+  %d7 = getelementptr inbounds float, ptr %d0, i64 7
+  %d8 = getelementptr inbounds float, ptr %d0, i64 8
+  %d9 = getelementptr inbounds float, ptr %d0, i64 9
+  %d10 = getelementptr inbounds float, ptr %d0, i64 10
+  %d11 = getelementptr inbounds float, ptr %d0, i64 11
+  %d12 = getelementptr inbounds float, ptr %d0, i64 12
+  %d13 = getelementptr inbounds float, ptr %d0, i64 13
+  %d14 = getelementptr inbounds float, ptr %d0, i64 14
+  %d15 = getelementptr inbounds float, ptr %d0, i64 15
+  store float %e0, ptr %d0, align 8
+  store float %e1, ptr %d1, ali...
[truncated]

llvmbot · 2024-10-21T17:39:31Z

@llvm/pr-subscribers-backend-x86

Author: Matthias Braun (MatzeB)

Changes

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

setOperationAction of v4f16, v8f16 and v16f16 to Custom so
TargetTransformInfo::getStoreMinimumVF reports them as acceptable.
Add missing cost entries to X86TTIImpl::getCastInstrCost
conversion from/to fp16. Note that conversion from f64 to f16 is not
supported by an X86 instruction.

Patch is 35.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113195.diff

3 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+6)
(modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+25)
(added) llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll (+601)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index bcb84add65d83e..da88a1a0a5a3b8 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1714,6 +1714,9 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
       setOperationPromotedToType(Opc, MVT::v8f16, MVT::v8f32);
       setOperationPromotedToType(Opc, MVT::v16f16, MVT::v16f32);
     }
+    // trunc+store via vcvtps2ph
+    setOperationAction(ISD::STORE, MVT::v4f16, Custom);
+    setOperationAction(ISD::STORE, MVT::v8f16, Custom);
   }
 
   // This block controls legalization of the mask vector sizes that are
@@ -1784,6 +1787,9 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
 
     for (auto VT : { MVT::v1i1, MVT::v2i1, MVT::v4i1, MVT::v8i1 })
       setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
+
+    // trunc+store via vcvtps2ph
+    setOperationAction(ISD::STORE, MVT::v16f16, Custom);
   }
   if (Subtarget.hasDQI() && Subtarget.hasVLX()) {
     for (MVT VT : {MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64}) {
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 413ef0136d5c06..2d2c804ed46e54 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2296,7 +2296,10 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f32,   { 1, 1, 1, 1 } },
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v16f32,  { 3, 1, 1, 1 } },
     { ISD::FP_EXTEND, MVT::v16f64,  MVT::v16f32,  { 4, 1, 1, 1 } }, // 2*vcvtps2pd+vextractf64x4
+    { ISD::FP_EXTEND, MVT::v16f32,  MVT::v16f16,  { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
     { ISD::FP_ROUND,  MVT::v8f32,   MVT::v8f64,   { 1, 1, 1, 1 } },
+    { ISD::FP_ROUND,  MVT::v16f16,  MVT::v16f32,  { 1, 1, 1, 1 } }, // vcvtps2ph
 
     { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
     { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
@@ -2973,6 +2976,14 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
     { ISD::TRUNCATE,    MVT::v4i32,  MVT::v2i64,  { 1, 1, 1, 1 } }, // PSHUFD
   };
 
+  static const TypeConversionCostKindTblEntry F16ConversionTbl[] = {
+    { ISD::FP_ROUND,  MVT::v8f16,   MVT::v8f32,   { 1, 1, 1, 1 } }, // vcvtps2ph
+    { ISD::FP_ROUND,  MVT::v4f16,   MVT::v4f32,   { 1, 1, 1, 1 } }, // vcvtps2ph
+    { ISD::FP_EXTEND, MVT::v8f32,   MVT::v8f16,   { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v4f32,   MVT::v4f16,   { 1, 1, 1, 1 } }, // vcvtph2ps
+    { ISD::FP_EXTEND, MVT::v4f64,   MVT::v4f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
+  };
+
   // Attempt to map directly to (simple) MVT types to let us match custom entries.
   EVT SrcTy = TLI->getValueType(DL, Src);
   EVT DstTy = TLI->getValueType(DL, Dst);
@@ -3034,6 +3045,13 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
           return *KindCost;
     }
 
+    if (ST->hasF16C()) {
+      if (const auto *Entry = ConvertCostTableLookup(F16ConversionTbl, ISD,
+                                                     SimpleDstTy, SimpleSrcTy))
+        if (auto KindCost = Entry->Cost[CostKind])
+          return *KindCost;
+    }
+
     if (ST->hasSSE41()) {
       if (const auto *Entry = ConvertCostTableLookup(SSE41ConversionTbl, ISD,
                                                      SimpleDstTy, SimpleSrcTy))
@@ -3107,6 +3125,13 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
       if (auto KindCost = Entry->Cost[CostKind])
         return std::max(LTSrc.first, LTDest.first) * *KindCost;
 
+  if (ST->hasF16C()) {
+    if (const auto *Entry = ConvertCostTableLookup(F16ConversionTbl, ISD,
+                                                   LTDest.second, LTSrc.second))
+      if (auto KindCost = Entry->Cost[CostKind])
+        return std::max(LTSrc.first, LTDest.first) * *KindCost;
+  }
+
   if (ST->hasSSE41())
     if (const auto *Entry = ConvertCostTableLookup(SSE41ConversionTbl, ISD,
                                                    LTDest.second, LTSrc.second))
diff --git a/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll b/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll
new file mode 100644
index 00000000000000..1d5dee6cb8121c
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/X86/conversion-fp16.ll
@@ -0,0 +1,601 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx2 | FileCheck %s --check-prefix=CHECK
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx2 -mattr=+f16c | FileCheck %s --check-prefix=CHECK-F16C
+; RUN: opt < %s -mtriple=x86_64-- -passes=slp-vectorizer -S -mattr=+avx512f | FileCheck %s --check-prefix=CHECK-AVX512
+
+define void @fpext_v4xf16_v4xf32(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to float
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to float
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to float
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to float
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D3:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 3
+; CHECK-NEXT:    store float [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store float [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store float [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store float [[E3]], ptr [[D3]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x float>
+; CHECK-F16C-NEXT:    store <4 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v4xf16_v4xf32(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x float>
+; CHECK-AVX512-NEXT:    store <4 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+
+  %e0 = fpext half %l0 to float
+  %e1 = fpext half %l1 to float
+  %e2 = fpext half %l2 to float
+  %e3 = fpext half %l3 to float
+
+  %d1 = getelementptr inbounds float, ptr %d0, i64 1
+  %d2 = getelementptr inbounds float, ptr %d0, i64 2
+  %d3 = getelementptr inbounds float, ptr %d0, i64 3
+  store float %e0, ptr %d0, align 8
+  store float %e1, ptr %d1, align 8
+  store float %e2, ptr %d2, align 8
+  store float %e3, ptr %d3, align 8
+  ret void
+}
+
+define void @fpext_v4xf16_v4xf64(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to double
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to double
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to double
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to double
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D3:%.*]] = getelementptr inbounds double, ptr [[D0]], i64 3
+; CHECK-NEXT:    store double [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store double [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store double [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store double [[E3]], ptr [[D3]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x double>
+; CHECK-F16C-NEXT:    store <4 x double> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v4xf16_v4xf64(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <4 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <4 x half> [[TMP1]] to <4 x double>
+; CHECK-AVX512-NEXT:    store <4 x double> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+
+  %e0 = fpext half %l0 to double
+  %e1 = fpext half %l1 to double
+  %e2 = fpext half %l2 to double
+  %e3 = fpext half %l3 to double
+
+  %d1 = getelementptr inbounds double, ptr %d0, i64 1
+  %d2 = getelementptr inbounds double, ptr %d0, i64 2
+  %d3 = getelementptr inbounds double, ptr %d0, i64 3
+  store double %e0, ptr %d0, align 8
+  store double %e1, ptr %d1, align 8
+  store double %e2, ptr %d2, align 8
+  store double %e3, ptr %d3, align 8
+  ret void
+}
+
+define void @fpext_v16xf15_v16xf32(ptr %s0, ptr %d0) {
+; CHECK-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[S1:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 1
+; CHECK-NEXT:    [[S2:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 2
+; CHECK-NEXT:    [[S3:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 3
+; CHECK-NEXT:    [[S4:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 4
+; CHECK-NEXT:    [[S5:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 5
+; CHECK-NEXT:    [[S6:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 6
+; CHECK-NEXT:    [[S7:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 7
+; CHECK-NEXT:    [[S8:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 8
+; CHECK-NEXT:    [[S9:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 9
+; CHECK-NEXT:    [[S10:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 10
+; CHECK-NEXT:    [[S11:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 11
+; CHECK-NEXT:    [[S12:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 12
+; CHECK-NEXT:    [[S13:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 13
+; CHECK-NEXT:    [[S14:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 14
+; CHECK-NEXT:    [[S15:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 15
+; CHECK-NEXT:    [[L0:%.*]] = load half, ptr [[S0]], align 2
+; CHECK-NEXT:    [[L1:%.*]] = load half, ptr [[S1]], align 2
+; CHECK-NEXT:    [[L2:%.*]] = load half, ptr [[S2]], align 2
+; CHECK-NEXT:    [[L3:%.*]] = load half, ptr [[S3]], align 2
+; CHECK-NEXT:    [[L4:%.*]] = load half, ptr [[S4]], align 2
+; CHECK-NEXT:    [[L5:%.*]] = load half, ptr [[S5]], align 2
+; CHECK-NEXT:    [[L6:%.*]] = load half, ptr [[S6]], align 2
+; CHECK-NEXT:    [[L7:%.*]] = load half, ptr [[S7]], align 2
+; CHECK-NEXT:    [[L8:%.*]] = load half, ptr [[S8]], align 2
+; CHECK-NEXT:    [[L9:%.*]] = load half, ptr [[S9]], align 2
+; CHECK-NEXT:    [[L10:%.*]] = load half, ptr [[S10]], align 2
+; CHECK-NEXT:    [[L11:%.*]] = load half, ptr [[S11]], align 2
+; CHECK-NEXT:    [[L12:%.*]] = load half, ptr [[S12]], align 2
+; CHECK-NEXT:    [[L13:%.*]] = load half, ptr [[S13]], align 2
+; CHECK-NEXT:    [[L14:%.*]] = load half, ptr [[S14]], align 2
+; CHECK-NEXT:    [[L15:%.*]] = load half, ptr [[S15]], align 2
+; CHECK-NEXT:    [[E0:%.*]] = fpext half [[L0]] to float
+; CHECK-NEXT:    [[E1:%.*]] = fpext half [[L1]] to float
+; CHECK-NEXT:    [[E2:%.*]] = fpext half [[L2]] to float
+; CHECK-NEXT:    [[E3:%.*]] = fpext half [[L3]] to float
+; CHECK-NEXT:    [[E4:%.*]] = fpext half [[L4]] to float
+; CHECK-NEXT:    [[E5:%.*]] = fpext half [[L5]] to float
+; CHECK-NEXT:    [[E6:%.*]] = fpext half [[L6]] to float
+; CHECK-NEXT:    [[E7:%.*]] = fpext half [[L7]] to float
+; CHECK-NEXT:    [[E8:%.*]] = fpext half [[L8]] to float
+; CHECK-NEXT:    [[E9:%.*]] = fpext half [[L9]] to float
+; CHECK-NEXT:    [[E10:%.*]] = fpext half [[L10]] to float
+; CHECK-NEXT:    [[E11:%.*]] = fpext half [[L11]] to float
+; CHECK-NEXT:    [[E12:%.*]] = fpext half [[L12]] to float
+; CHECK-NEXT:    [[E13:%.*]] = fpext half [[L13]] to float
+; CHECK-NEXT:    [[E14:%.*]] = fpext half [[L14]] to float
+; CHECK-NEXT:    [[E15:%.*]] = fpext half [[L15]] to float
+; CHECK-NEXT:    [[D1:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 1
+; CHECK-NEXT:    [[D2:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 2
+; CHECK-NEXT:    [[D15:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 3
+; CHECK-NEXT:    [[D4:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 4
+; CHECK-NEXT:    [[D5:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 5
+; CHECK-NEXT:    [[D6:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 6
+; CHECK-NEXT:    [[D7:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 7
+; CHECK-NEXT:    [[D8:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 8
+; CHECK-NEXT:    [[D9:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 9
+; CHECK-NEXT:    [[D10:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 10
+; CHECK-NEXT:    [[D11:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 11
+; CHECK-NEXT:    [[D12:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 12
+; CHECK-NEXT:    [[D13:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 13
+; CHECK-NEXT:    [[D14:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 14
+; CHECK-NEXT:    [[D16:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 15
+; CHECK-NEXT:    store float [[E0]], ptr [[D0]], align 8
+; CHECK-NEXT:    store float [[E1]], ptr [[D1]], align 8
+; CHECK-NEXT:    store float [[E2]], ptr [[D2]], align 8
+; CHECK-NEXT:    store float [[E3]], ptr [[D15]], align 8
+; CHECK-NEXT:    store float [[E4]], ptr [[D4]], align 8
+; CHECK-NEXT:    store float [[E5]], ptr [[D5]], align 8
+; CHECK-NEXT:    store float [[E6]], ptr [[D6]], align 8
+; CHECK-NEXT:    store float [[E7]], ptr [[D7]], align 8
+; CHECK-NEXT:    store float [[E8]], ptr [[D8]], align 8
+; CHECK-NEXT:    store float [[E9]], ptr [[D9]], align 8
+; CHECK-NEXT:    store float [[E10]], ptr [[D10]], align 8
+; CHECK-NEXT:    store float [[E11]], ptr [[D11]], align 8
+; CHECK-NEXT:    store float [[E12]], ptr [[D12]], align 8
+; CHECK-NEXT:    store float [[E13]], ptr [[D13]], align 8
+; CHECK-NEXT:    store float [[E14]], ptr [[D14]], align 8
+; CHECK-NEXT:    store float [[E15]], ptr [[D16]], align 8
+; CHECK-NEXT:    ret void
+;
+; CHECK-F16C-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-F16C-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-F16C-NEXT:    [[S8:%.*]] = getelementptr inbounds half, ptr [[S0]], i64 8
+; CHECK-F16C-NEXT:    [[D8:%.*]] = getelementptr inbounds float, ptr [[D0]], i64 8
+; CHECK-F16C-NEXT:    [[TMP1:%.*]] = load <8 x half>, ptr [[S0]], align 2
+; CHECK-F16C-NEXT:    [[TMP2:%.*]] = fpext <8 x half> [[TMP1]] to <8 x float>
+; CHECK-F16C-NEXT:    [[TMP3:%.*]] = load <8 x half>, ptr [[S8]], align 2
+; CHECK-F16C-NEXT:    [[TMP4:%.*]] = fpext <8 x half> [[TMP3]] to <8 x float>
+; CHECK-F16C-NEXT:    store <8 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-F16C-NEXT:    store <8 x float> [[TMP4]], ptr [[D8]], align 8
+; CHECK-F16C-NEXT:    ret void
+;
+; CHECK-AVX512-LABEL: define void @fpext_v16xf15_v16xf32(
+; CHECK-AVX512-SAME: ptr [[S0:%.*]], ptr [[D0:%.*]]) #[[ATTR0]] {
+; CHECK-AVX512-NEXT:    [[TMP1:%.*]] = load <16 x half>, ptr [[S0]], align 2
+; CHECK-AVX512-NEXT:    [[TMP2:%.*]] = fpext <16 x half> [[TMP1]] to <16 x float>
+; CHECK-AVX512-NEXT:    store <16 x float> [[TMP2]], ptr [[D0]], align 8
+; CHECK-AVX512-NEXT:    ret void
+;
+  %s1 = getelementptr inbounds half, ptr %s0, i64 1
+  %s2 = getelementptr inbounds half, ptr %s0, i64 2
+  %s3 = getelementptr inbounds half, ptr %s0, i64 3
+  %s4 = getelementptr inbounds half, ptr %s0, i64 4
+  %s5 = getelementptr inbounds half, ptr %s0, i64 5
+  %s6 = getelementptr inbounds half, ptr %s0, i64 6
+  %s7 = getelementptr inbounds half, ptr %s0, i64 7
+  %s8 = getelementptr inbounds half, ptr %s0, i64 8
+  %s9 = getelementptr inbounds half, ptr %s0, i64 9
+  %s10 = getelementptr inbounds half, ptr %s0, i64 10
+  %s11 = getelementptr inbounds half, ptr %s0, i64 11
+  %s12 = getelementptr inbounds half, ptr %s0, i64 12
+  %s13 = getelementptr inbounds half, ptr %s0, i64 13
+  %s14 = getelementptr inbounds half, ptr %s0, i64 14
+  %s15 = getelementptr inbounds half, ptr %s0, i64 15
+  %l0 = load half, ptr %s0, align 2
+  %l1 = load half, ptr %s1, align 2
+  %l2 = load half, ptr %s2, align 2
+  %l3 = load half, ptr %s3, align 2
+  %l4 = load half, ptr %s4, align 2
+  %l5 = load half, ptr %s5, align 2
+  %l6 = load half, ptr %s6, align 2
+  %l7 = load half, ptr %s7, align 2
+  %l8 = load half, ptr %s8, align 2
+  %l9 = load half, ptr %s9, align 2
+  %l10 = load half, ptr %s10, align 2
+  %l11 = load half, ptr %s11, align 2
+  %l12 = load half, ptr %s12, align 2
+  %l13 = load half, ptr %s13, align 2
+  %l14 = load half, ptr %s14, align 2
+  %l15 = load half, ptr %s15, align 2
+
+  %e0 = fpext half %l0 to float
+  %e1 = fpext half %l1 to float
+  %e2 = fpext half %l2 to float
+  %e3 = fpext half %l3 to float
+  %e4 = fpext half %l4 to float
+  %e5 = fpext half %l5 to float
+  %e6 = fpext half %l6 to float
+  %e7 = fpext half %l7 to float
+  %e8 = fpext half %l8 to float
+  %e9 = fpext half %l9 to float
+  %e10 = fpext half %l10 to float
+  %e11 = fpext half %l11 to float
+  %e12 = fpext half %l12 to float
+  %e13 = fpext half %l13 to float
+  %e14 = fpext half %l14 to float
+  %e15 = fpext half %l15 to float
+
+  %d1 = getelementptr inbounds float, ptr %d0, i64 1
+  %d2 = getelementptr inbounds float, ptr %d0, i64 2
+  %d3 = getelementptr inbounds float, ptr %d0, i64 3
+  %d4 = getelementptr inbounds float, ptr %d0, i64 4
+  %d5 = getelementptr inbounds float, ptr %d0, i64 5
+  %d6 = getelementptr inbounds float, ptr %d0, i64 6
+  %d7 = getelementptr inbounds float, ptr %d0, i64 7
+  %d8 = getelementptr inbounds float, ptr %d0, i64 8
+  %d9 = getelementptr inbounds float, ptr %d0, i64 9
+  %d10 = getelementptr inbounds float, ptr %d0, i64 10
+  %d11 = getelementptr inbounds float, ptr %d0, i64 11
+  %d12 = getelementptr inbounds float, ptr %d0, i64 12
+  %d13 = getelementptr inbounds float, ptr %d0, i64 13
+  %d14 = getelementptr inbounds float, ptr %d0, i64 14
+  %d15 = getelementptr inbounds float, ptr %d0, i64 15
+  store float %e0, ptr %d0, align 8
+  store float %e1, ptr %d1, ali...
[truncated]

github-actions · 2024-10-21T17:42:36Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 8ae39c8e34de2d24c46827b324c76bac845c18b0 84a22ef1a5ced6fc2383c38b99fb040df17db76e --extensions cpp,h -- llvm/lib/Target/X86/X86TargetTransformInfo.cpp llvm/lib/Target/X86/X86TargetTransformInfo.h

View the diff from clang-format here.

diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index bae223243b..9abc2051bf 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2293,143 +2293,221 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
   // 256-bit wide vectors.
 
   static const TypeConversionCostKindTblEntry AVX512FConversionTbl[] = {
-    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f32,   { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v16f32,  { 3, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::v16f64,  MVT::v16f32,  { 4, 1, 1, 1 } }, // 2*vcvtps2pd+vextractf64x4
-    { ISD::FP_EXTEND, MVT::v16f32,  MVT::v16f16,  { 1, 1, 1, 1 } }, // vcvtph2ps
-    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
-    { ISD::FP_ROUND,  MVT::v8f32,   MVT::v8f64,   { 1, 1, 1, 1 } },
-    { ISD::FP_ROUND,  MVT::v16f16,  MVT::v16f32,  { 1, 1, 1, 1 } }, // vcvtps2ph
-
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i8,    { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v16i1,   MVT::v16i8,   { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i16,   { 3, 1, 1, 1 } }, // sext+vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i16,   { 3, 1, 1, 1 } }, // sext+vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i16,   { 3, 1, 1, 1 } }, // sext+vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v16i1,   MVT::v16i16,  { 3, 1, 1, 1 } }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i32,   { 2, 1, 1, 1 } }, // zmm vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i32,   { 2, 1, 1, 1 } }, // zmm vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i32,   { 2, 1, 1, 1 } }, // zmm vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v16i1,   MVT::v16i32,  { 2, 1, 1, 1 } }, // vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i64,   { 2, 1, 1, 1 } }, // zmm vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i64,   { 2, 1, 1, 1 } }, // zmm vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i64,   { 2, 1, 1, 1 } }, // vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v2i8,    MVT::v2i32,   { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v4i8,    MVT::v4i32,   { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v16i8,   MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v32i8,   MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v64i8,   MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v16i16,  MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdw
-    { ISD::TRUNCATE,  MVT::v32i16,  MVT::v16i32,  { 2, 1, 1, 1 } }, // vpmovdw
-    { ISD::TRUNCATE,  MVT::v2i8,    MVT::v2i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v2i16,   MVT::v2i64,   { 1, 1, 1, 1 } }, // vpshufb
-    { ISD::TRUNCATE,  MVT::v8i8,    MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v16i8,   MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v32i8,   MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v64i8,   MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v8i16,   MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqw
-    { ISD::TRUNCATE,  MVT::v16i16,  MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqw
-    { ISD::TRUNCATE,  MVT::v32i16,  MVT::v8i64,   { 2, 1, 1, 1 } }, // vpmovqw
-    { ISD::TRUNCATE,  MVT::v8i32,   MVT::v8i64,   { 1, 1, 1, 1 } }, // vpmovqd
-    { ISD::TRUNCATE,  MVT::v4i32,   MVT::v4i64,   { 1, 1, 1, 1 } }, // zmm vpmovqd
-    { ISD::TRUNCATE,  MVT::v16i8,   MVT::v16i64,  { 5, 1, 1, 1 } },// 2*vpmovqd+concat+vpmovdb
-
-    { ISD::TRUNCATE,  MVT::v16i8,  MVT::v16i16,   { 3, 1, 1, 1 } }, // extend to v16i32
-    { ISD::TRUNCATE,  MVT::v32i8,  MVT::v32i16,   { 8, 1, 1, 1 } },
-    { ISD::TRUNCATE,  MVT::v64i8,  MVT::v32i16,   { 8, 1, 1, 1 } },
-
-    // Sign extend is zmm vpternlogd+vptruncdb.
-    // Zero extend is zmm broadcast load+vptruncdw.
-    { ISD::SIGN_EXTEND, MVT::v2i8,   MVT::v2i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v2i8,   MVT::v2i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v4i8,   MVT::v4i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v4i8,   MVT::v4i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i8,   MVT::v8i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i8,   MVT::v8i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v16i8,  MVT::v16i1,  { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v16i8,  MVT::v16i1,  { 4, 1, 1, 1 } },
-
-    // Sign extend is zmm vpternlogd+vptruncdw.
-    // Zero extend is zmm vpternlogd+vptruncdw+vpsrlw.
-    { ISD::SIGN_EXTEND, MVT::v2i16,  MVT::v2i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v2i16,  MVT::v2i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v4i16,  MVT::v4i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v4i16,  MVT::v4i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i16,  MVT::v8i1,   { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i16,  MVT::v8i1,   { 4, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v16i16, MVT::v16i1,  { 3, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v16i16, MVT::v16i1,  { 4, 1, 1, 1 } },
-
-    { ISD::SIGN_EXTEND, MVT::v2i32,  MVT::v2i1,   { 1, 1, 1, 1 } }, // zmm vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v2i32,  MVT::v2i1,   { 2, 1, 1, 1 } }, // zmm vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v4i32,  MVT::v4i1,   { 1, 1, 1, 1 } }, // zmm vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v4i32,  MVT::v4i1,   { 2, 1, 1, 1 } }, // zmm vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v8i32,  MVT::v8i1,   { 1, 1, 1, 1 } }, // zmm vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v8i32,  MVT::v8i1,   { 2, 1, 1, 1 } }, // zmm vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v2i64,  MVT::v2i1,   { 1, 1, 1, 1 } }, // zmm vpternlogq
-    { ISD::ZERO_EXTEND, MVT::v2i64,  MVT::v2i1,   { 2, 1, 1, 1 } }, // zmm vpternlogq+psrlq
-    { ISD::SIGN_EXTEND, MVT::v4i64,  MVT::v4i1,   { 1, 1, 1, 1 } }, // zmm vpternlogq
-    { ISD::ZERO_EXTEND, MVT::v4i64,  MVT::v4i1,   { 2, 1, 1, 1 } }, // zmm vpternlogq+psrlq
-
-    { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i1,  { 1, 1, 1, 1 } }, // vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i1,  { 2, 1, 1, 1 } }, // vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i1,   { 1, 1, 1, 1 } }, // vpternlogq
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i1,   { 2, 1, 1, 1 } }, // vpternlogq+psrlq
-
-    { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i8,  { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i8,  { 1, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i16, { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i16, { 1, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i8,   { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i8,   { 1, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i16,  { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i16,  { 1, 1, 1, 1 } },
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i32,  { 1, 1, 1, 1 } },
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i32,  { 1, 1, 1, 1 } },
-
-    { ISD::SIGN_EXTEND, MVT::v32i16, MVT::v32i8,  { 3, 1, 1, 1 } }, // FIXME: May not be right
-    { ISD::ZERO_EXTEND, MVT::v32i16, MVT::v32i8,  { 3, 1, 1, 1 } }, // FIXME: May not be right
-
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v8i1,   { 4, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i1,  { 3, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v16i8,  { 2, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i8,  { 1, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v8i16,  { 2, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i16, { 1, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v8i32,  { 1, 1, 1, 1 } },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i32, { 1, 1, 1, 1 } },
-
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i1,   { 4, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i1,  { 3, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v16i8,  { 2, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i8,  { 1, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i16,  { 2, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i16, { 1, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i32,  { 1, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i32, { 1, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f32,  MVT::v8i64,  {26, 1, 1, 1 } },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i64,  { 5, 1, 1, 1 } },
-
-    { ISD::FP_TO_SINT,  MVT::v16i8,  MVT::v16f32, { 2, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v16i8,  MVT::v16f64, { 7, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v32i8,  MVT::v32f64, {15, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v64i8,  MVT::v64f32, {11, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v64i8,  MVT::v64f64, {31, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v8i16,  MVT::v8f64,  { 3, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v16i16, MVT::v16f64, { 7, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v32i16, MVT::v32f32, { 5, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v32i16, MVT::v32f64, {15, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v8i32,  MVT::v8f64,  { 1, 1, 1, 1 } },
-    { ISD::FP_TO_SINT,  MVT::v16i32, MVT::v16f64, { 3, 1, 1, 1 } },
-
-    { ISD::FP_TO_UINT,  MVT::v8i32,  MVT::v8f64,  { 1, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v8i16,  MVT::v8f64,  { 3, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v8i8,   MVT::v8f64,  { 3, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v16i32, MVT::v16f32, { 1, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v16i16, MVT::v16f32, { 3, 1, 1, 1 } },
-    { ISD::FP_TO_UINT,  MVT::v16i8,  MVT::v16f32, { 3, 1, 1, 1 } },
+      {ISD::FP_EXTEND, MVT::v8f64, MVT::v8f32, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND, MVT::v8f64, MVT::v16f32, {3, 1, 1, 1}},
+      {ISD::FP_EXTEND,
+       MVT::v16f64,
+       MVT::v16f32,
+       {4, 1, 1, 1}}, // 2*vcvtps2pd+vextractf64x4
+      {ISD::FP_EXTEND, MVT::v16f32, MVT::v16f16, {1, 1, 1, 1}}, // vcvtph2ps
+      {ISD::FP_EXTEND,
+       MVT::v8f64,
+       MVT::v8f16,
+       {2, 1, 1, 1}}, // vcvtph2ps+vcvtps2pd
+      {ISD::FP_ROUND, MVT::v8f32, MVT::v8f64, {1, 1, 1, 1}},
+      {ISD::FP_ROUND, MVT::v16f16, MVT::v16f32, {1, 1, 1, 1}}, // vcvtps2ph
+
+      {ISD::TRUNCATE,
+       MVT::v2i1,
+       MVT::v2i8,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v4i1,
+       MVT::v4i8,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v8i1,
+       MVT::v8i8,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v16i1,
+       MVT::v16i8,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v2i1,
+       MVT::v2i16,
+       {3, 1, 1, 1}}, // sext+vpsllq+vptestmq
+      {ISD::TRUNCATE,
+       MVT::v4i1,
+       MVT::v4i16,
+       {3, 1, 1, 1}}, // sext+vpsllq+vptestmq
+      {ISD::TRUNCATE,
+       MVT::v8i1,
+       MVT::v8i16,
+       {3, 1, 1, 1}}, // sext+vpsllq+vptestmq
+      {ISD::TRUNCATE,
+       MVT::v16i1,
+       MVT::v16i16,
+       {3, 1, 1, 1}}, // sext+vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v2i1,
+       MVT::v2i32,
+       {2, 1, 1, 1}}, // zmm vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v4i1,
+       MVT::v4i32,
+       {2, 1, 1, 1}}, // zmm vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v8i1,
+       MVT::v8i32,
+       {2, 1, 1, 1}}, // zmm vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v16i1, MVT::v16i32, {2, 1, 1, 1}}, // vpslld+vptestmd
+      {ISD::TRUNCATE,
+       MVT::v2i1,
+       MVT::v2i64,
+       {2, 1, 1, 1}}, // zmm vpsllq+vptestmq
+      {ISD::TRUNCATE,
+       MVT::v4i1,
+       MVT::v4i64,
+       {2, 1, 1, 1}}, // zmm vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v8i1, MVT::v8i64, {2, 1, 1, 1}},   // vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v2i8, MVT::v2i32, {2, 1, 1, 1}},   // vpmovdb
+      {ISD::TRUNCATE, MVT::v4i8, MVT::v4i32, {2, 1, 1, 1}},   // vpmovdb
+      {ISD::TRUNCATE, MVT::v16i8, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdb
+      {ISD::TRUNCATE, MVT::v32i8, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdb
+      {ISD::TRUNCATE, MVT::v64i8, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdb
+      {ISD::TRUNCATE, MVT::v16i16, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdw
+      {ISD::TRUNCATE, MVT::v32i16, MVT::v16i32, {2, 1, 1, 1}}, // vpmovdw
+      {ISD::TRUNCATE, MVT::v2i8, MVT::v2i64, {2, 1, 1, 1}},    // vpmovqb
+      {ISD::TRUNCATE, MVT::v2i16, MVT::v2i64, {1, 1, 1, 1}},   // vpshufb
+      {ISD::TRUNCATE, MVT::v8i8, MVT::v8i64, {2, 1, 1, 1}},    // vpmovqb
+      {ISD::TRUNCATE, MVT::v16i8, MVT::v8i64, {2, 1, 1, 1}},   // vpmovqb
+      {ISD::TRUNCATE, MVT::v32i8, MVT::v8i64, {2, 1, 1, 1}},   // vpmovqb
+      {ISD::TRUNCATE, MVT::v64i8, MVT::v8i64, {2, 1, 1, 1}},   // vpmovqb
+      {ISD::TRUNCATE, MVT::v8i16, MVT::v8i64, {2, 1, 1, 1}},   // vpmovqw
+      {ISD::TRUNCATE, MVT::v16i16, MVT::v8i64, {2, 1, 1, 1}},  // vpmovqw
+      {ISD::TRUNCATE, MVT::v32i16, MVT::v8i64, {2, 1, 1, 1}},  // vpmovqw
+      {ISD::TRUNCATE, MVT::v8i32, MVT::v8i64, {1, 1, 1, 1}},   // vpmovqd
+      {ISD::TRUNCATE, MVT::v4i32, MVT::v4i64, {1, 1, 1, 1}},   // zmm vpmovqd
+      {ISD::TRUNCATE,
+       MVT::v16i8,
+       MVT::v16i64,
+       {5, 1, 1, 1}}, // 2*vpmovqd+concat+vpmovdb
+
+      {ISD::TRUNCATE,
+       MVT::v16i8,
+       MVT::v16i16,
+       {3, 1, 1, 1}}, // extend to v16i32
+      {ISD::TRUNCATE, MVT::v32i8, MVT::v32i16, {8, 1, 1, 1}},
+      {ISD::TRUNCATE, MVT::v64i8, MVT::v32i16, {8, 1, 1, 1}},
+
+      // Sign extend is zmm vpternlogd+vptruncdb.
+      // Zero extend is zmm broadcast load+vptruncdw.
+      {ISD::SIGN_EXTEND, MVT::v2i8, MVT::v2i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v2i8, MVT::v2i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v4i8, MVT::v4i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v4i8, MVT::v4i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i8, MVT::v8i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i8, MVT::v8i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v16i8, MVT::v16i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v16i8, MVT::v16i1, {4, 1, 1, 1}},
+
+      // Sign extend is zmm vpternlogd+vptruncdw.
+      // Zero extend is zmm vpternlogd+vptruncdw+vpsrlw.
+      {ISD::SIGN_EXTEND, MVT::v2i16, MVT::v2i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v2i16, MVT::v2i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v4i16, MVT::v4i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v4i16, MVT::v4i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i16, MVT::v8i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i16, MVT::v8i1, {4, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v16i16, MVT::v16i1, {3, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v16i16, MVT::v16i1, {4, 1, 1, 1}},
+
+      {ISD::SIGN_EXTEND, MVT::v2i32, MVT::v2i1, {1, 1, 1, 1}}, // zmm vpternlogd
+      {ISD::ZERO_EXTEND,
+       MVT::v2i32,
+       MVT::v2i1,
+       {2, 1, 1, 1}}, // zmm vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v4i32, MVT::v4i1, {1, 1, 1, 1}}, // zmm vpternlogd
+      {ISD::ZERO_EXTEND,
+       MVT::v4i32,
+       MVT::v4i1,
+       {2, 1, 1, 1}}, // zmm vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v8i32, MVT::v8i1, {1, 1, 1, 1}}, // zmm vpternlogd
+      {ISD::ZERO_EXTEND,
+       MVT::v8i32,
+       MVT::v8i1,
+       {2, 1, 1, 1}}, // zmm vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v2i64, MVT::v2i1, {1, 1, 1, 1}}, // zmm vpternlogq
+      {ISD::ZERO_EXTEND,
+       MVT::v2i64,
+       MVT::v2i1,
+       {2, 1, 1, 1}}, // zmm vpternlogq+psrlq
+      {ISD::SIGN_EXTEND, MVT::v4i64, MVT::v4i1, {1, 1, 1, 1}}, // zmm vpternlogq
+      {ISD::ZERO_EXTEND,
+       MVT::v4i64,
+       MVT::v4i1,
+       {2, 1, 1, 1}}, // zmm vpternlogq+psrlq
+
+      {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i1, {1, 1, 1, 1}}, // vpternlogd
+      {ISD::ZERO_EXTEND,
+       MVT::v16i32,
+       MVT::v16i1,
+       {2, 1, 1, 1}}, // vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i1, {1, 1, 1, 1}}, // vpternlogq
+      {ISD::ZERO_EXTEND,
+       MVT::v8i64,
+       MVT::v8i1,
+       {2, 1, 1, 1}}, // vpternlogq+psrlq
+
+      {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i8, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i8, {1, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i16, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i16, {1, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i8, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i8, {1, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i16, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i16, {1, 1, 1, 1}},
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i32, {1, 1, 1, 1}},
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i32, {1, 1, 1, 1}},
+
+      {ISD::SIGN_EXTEND,
+       MVT::v32i16,
+       MVT::v32i8,
+       {3, 1, 1, 1}}, // FIXME: May not be right
+      {ISD::ZERO_EXTEND,
+       MVT::v32i16,
+       MVT::v32i8,
+       {3, 1, 1, 1}}, // FIXME: May not be right
+
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i1, {4, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i1, {3, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v16i8, {2, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i8, {1, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i16, {2, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i16, {1, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i32, {1, 1, 1, 1}},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i32, {1, 1, 1, 1}},
+
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i1, {4, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i1, {3, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v16i8, {2, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i8, {1, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i16, {2, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i16, {1, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i32, {1, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i32, {1, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f32, MVT::v8i64, {26, 1, 1, 1}},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i64, {5, 1, 1, 1}},
+
+      {ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f32, {2, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f64, {7, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v32i8, MVT::v32f64, {15, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f32, {11, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f64, {31, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v8i16, MVT::v8f64, {3, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v16i16, MVT::v16f64, {7, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f32, {5, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f64, {15, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v8i32, MVT::v8f64, {1, 1, 1, 1}},
+      {ISD::FP_TO_SINT, MVT::v16i32, MVT::v16f64, {3, 1, 1, 1}},
+
+      {ISD::FP_TO_UINT, MVT::v8i32, MVT::v8f64, {1, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v8i16, MVT::v8f64, {3, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v8i8, MVT::v8f64, {3, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v16i32, MVT::v16f32, {1, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v16i16, MVT::v16f32, {3, 1, 1, 1}},
+      {ISD::FP_TO_UINT, MVT::v16i8, MVT::v16f32, {3, 1, 1, 1}},
   };
 
   static const TypeConversionCostKindTblEntry AVX512BWVLConversionTbl[] {
@@ -2977,14 +3055,17 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
   };
 
   static const TypeConversionCostKindTblEntry F16ConversionTbl[] = {
-    { ISD::FP_ROUND,  MVT::f16,     MVT::f32,     { 1, 1, 1, 1 } },
-    { ISD::FP_ROUND,  MVT::v8f16,   MVT::v8f32,   { 1, 1, 1, 1 } },
-    { ISD::FP_ROUND,  MVT::v4f16,   MVT::v4f32,   { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::f32,     MVT::f16,     { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::f64,     MVT::f16,     { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
-    { ISD::FP_EXTEND, MVT::v8f32,   MVT::v8f16,   { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::v4f32,   MVT::v4f16,   { 1, 1, 1, 1 } },
-    { ISD::FP_EXTEND, MVT::v4f64,   MVT::v4f16,   { 2, 1, 1, 1 } }, // vcvtph2ps+vcvtps2pd
+      {ISD::FP_ROUND, MVT::f16, MVT::f32, {1, 1, 1, 1}},
+      {ISD::FP_ROUND, MVT::v8f16, MVT::v8f32, {1, 1, 1, 1}},
+      {ISD::FP_ROUND, MVT::v4f16, MVT::v4f32, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND, MVT::f32, MVT::f16, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND, MVT::f64, MVT::f16, {2, 1, 1, 1}}, // vcvtph2ps+vcvtps2pd
+      {ISD::FP_EXTEND, MVT::v8f32, MVT::v8f16, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND, MVT::v4f32, MVT::v4f16, {1, 1, 1, 1}},
+      {ISD::FP_EXTEND,
+       MVT::v4f64,
+       MVT::v4f16,
+       {2, 1, 1, 1}}, // vcvtph2ps+vcvtps2pd
   };
 
   // Attempt to map directly to (simple) MVT types to let us match custom entries.

MatzeB · 2024-10-21T23:32:51Z

(seems the setOperationAction() parts break some things. Working on that right now)

…bles.

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer transforms the patterns: - Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE register can hold 4xfloat converted/stored to 4xf16) this is necessary as fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16 stores as legal as we generally expand fp16 operations). - Add missing cost entries to `X86TTIImpl::getCastInstrCost` conversion from/to fp16. Note that conversion from f64 to f16 is not supported by an X86 instruction.

revert breaks hipRuntime build 054c23d X86: Improve cost model of fp16 conversion (llvm#113195) Change-Id: I2dbd30b82c6b355ff83368c9fd5c8b2f83ce5db1

JoelWee · 2024-10-29T16:42:09Z

llvm/lib/Target/X86/X86TargetTransformInfo.cpp


+  if (ISD == ISD::FP_ROUND && LTDest.second.getScalarType() == MVT::f16) {
+    // Conversion requires a libcall.
+    return InstructionCost::getInvalid();


This is breaking https://github.com/google/jax/blob/main/tests/lax_test.py#L3630 LazyConstantTest.testConvertElementTypeAvoidsCopies21 (dtype_in=<class 'numpy.float64'>, dtype_out=<class 'numpy.float16'>).

With

F1029 08:45:30.640847 4013 logging.cc:62] assert.h assertion failed at [third_party/llvm/llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4569](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp?l=4569&ws=joelwee/4894&snapshot=26) in VectorizationFactor llvm::LoopVectorizationPlanner::selectVectorizationFactor(): ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop" *** Check failure stack trace: *** @ 0x7ef66f09cf59 absl::log_internal::LogMessage::SendToLog() @ 0x7ef66f09c4fe absl::log_internal::LogMessage::Flush() @ 0x7ef66f09d519 absl::log_internal::LogMessageFatal::~LogMessageFatal() @ 0x7ef67ade7314 __assert_fail @ 0x7efa86da3f10 llvm::LoopVectorizationPlanner::selectVectorizationFactor() @ 0x7efa86db95df llvm::LoopVectorizationPlanner::computeBestVF() @ 0x7efa86dcbdfd llvm::LoopVectorizePass::processLoop() @ 0x7efa86dd2c3d llvm::LoopVectorizePass::runImpl() @ 0x7efa86dd3875 llvm::LoopVectorizePass::run() @ 0x7efa8ceb7332 llvm::detail::PassModel<>::run() @ 0x7ef9520b9050 llvm::PassManager<>::run() @ 0x7efaff179412 llvm::detail::PassModel<>::run() @ 0x7ef9520be28a llvm::ModuleToFunctionPassAdaptor::run() @ 0x7efaff179192 llvm::detail::PassModel<>::run() @ 0x7ef9520b7d7c llvm::PassManager<>::run() @ 0x7efaa12ec861 xla::cpu::CompilerFunctor::operator()() @ 0x7efa913b0271 llvm::orc::ThreadSafeModule::withModuleDo<>() @ 0x7efa913b000b llvm::orc::IRCompileLayer::emit() @ 0x7efa913e6d45 llvm::orc::BasicIRLayerMaterializationUnit::materialize() @ 0x7efa91454337 llvm::orc::InPlaceTaskDispatcher::dispatch() @ 0x7efa91349466 llvm::orc::ExecutionSession::dispatchOutstandingMUs() @ 0x7efa9134e9e6 llvm::orc::ExecutionSession::OL_completeLookup() @ 0x7efa91369a89 llvm::orc::InProgressFullLookupState::complete() @ 0x7efa9133a0f0 llvm::orc::ExecutionSession::OL_applyQueryPhase1() @ 0x7efa91337234 llvm::orc::ExecutionSession::lookup() @ 0x7efa9134991e llvm::orc::ExecutionSession::lookup() @ 0x7efa91349de8 llvm::orc::ExecutionSession::lookup() @ 0x7efa9134a30e llvm::orc::ExecutionSession::lookup() @ 0x7efa9134a459 llvm::orc::ExecutionSession::lookup() @ 0x7efaa1719abf xla::cpu::SimpleOrcJIT::FindCompiledSymbol() @ 0x7efaddc247c0 absl::internal_any_invocable::RemoteInvoker<>() @ 0x7efaddc0fb68 std::__u::__function::__policy_invoker<>::__call_impl<>() @ 0x7ef89847e1b6 tsl::thread::EigenEnvironment::ExecuteTask() @ 0x7ef89847dd10 Eigen::ThreadPoolTempl<>::WorkerLoop() @ 0x7ef89847d940 std::__u::invoke<>() @ 0x7ef6a5f9e25e Thread::ThreadBody() @ 0x7efafb6827db start_thread @ 0x7efabc18e05f clone

I dumped the LLVM IR before:

; Function Attrs: nofree norecurse nosync nounwind memory(readwrite, inaccessiblemem: none) uwtable define noalias noundef ptr @convert.2(ptr nocapture readonly %0) local_unnamed_addr #0 { %args_gep = getelementptr inbounds nuw i8, ptr %0, i64 24 %args = load ptr, ptr %args_gep, align 8 %arg0 = load ptr, ptr %args, align 8, !invariant.load !0, !dereferenceable !1, !align !2 %arg1_gep = getelementptr i8, ptr %args, i64 16 %arg1 = load ptr, ptr %arg1_gep, align 8, !invariant.load !0, !dereferenceable !3, !align !2 br label %convert.2.loop_body.dim.0 convert.2.loop_body.dim.0: ; preds = %1, %convert.2.loop_body.dim.0 %convert.2.invar_address.dim.0.03 = phi i64 [ 0, %1 ], [ %invar.inc, %convert.2.loop_body.dim.0 ] %2 = getelementptr inbounds [5 x double], ptr %arg0, i64 0, i64 %convert.2.invar_address.dim.0.03 %3 = load double, ptr %2, align 8, !invariant.load !0, !noalias !4 %4 = fptrunc double %3 to half %5 = getelementptr inbounds [5 x half], ptr %arg1, i64 0, i64 %convert.2.invar_address.dim.0.03 store half %4, ptr %5, align 2, !alias.scope !4 %invar.inc = add nuw nsw i64 %convert.2.invar_address.dim.0.03, 1 %exitcond = icmp eq i64 %invar.inc, 5 br i1 %exitcond, label %return, label %convert.2.loop_body.dim.0 return: ; preds = %convert.2.loop_body.dim.0 ret ptr null }

Could we fix this?

I am not able to reproduce so far. Unfortunately the dump does not contain some of the referenced metadata so I have to make guesses for that. Then trying to run opt -S -o - -passes=loop-vectorize /tmp/x.ll works just fine and I guess I need some target setup (I played with some -mtriple=x86_64 -mattr=+avx512f,+f16c but that doesn't repro either).

That said, could you try if replacing the InstructionCost::getInvalid(); with InstructionCost::getMax() or if that doesn't work with a big number like 128 helps?

I hope #114128 fixes this.

Yes it looks like it does. Thanks! (And apologies about the bad dump)

Returning invalid instruction costs when converting from/to fp16 in `X86TTIImpl::getCastInstrCost` when there is no hardware support available was triggering asserts. This changes the code to return a large (arbitrary) number to model the fact that libcalls are used to implement the conversion. This also simplifies the code by only reporting costs for the scalar fp16 conversion; vectorized costs being left to the fallback assuming scalarization. This is a follow-up to assertion issues reported for the changes in #113195

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer transforms the patterns: - Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE register can hold 4xfloat converted/stored to 4xf16) this is necessary as fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16 stores as legal as we generally expand fp16 operations). - Add missing cost entries to `X86TTIImpl::getCastInstrCost` conversion from/to fp16. Note that conversion from f64 to f16 is not supported by an X86 instruction. Change-Id: I84a8a44795fc5d76cc573884c8c76bd04dfbb24b

Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer transforms the patterns: - Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE register can hold 4xfloat converted/stored to 4xf16) this is necessary as fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16 stores as legal as we generally expand fp16 operations). - Add missing cost entries to `X86TTIImpl::getCastInstrCost` conversion from/to fp16. Note that conversion from f64 to f16 is not supported by an X86 instruction.

Returning invalid instruction costs when converting from/to fp16 in `X86TTIImpl::getCastInstrCost` when there is no hardware support available was triggering asserts. This changes the code to return a large (arbitrary) number to model the fact that libcalls are used to implement the conversion. This also simplifies the code by only reporting costs for the scalar fp16 conversion; vectorized costs being left to the fallback assuming scalarization. This is a follow-up to assertion issues reported for the changes in llvm#113195

Returning invalid instruction costs when converting from/to fp16 in `X86TTIImpl::getCastInstrCost` when there is no hardware support available was triggering asserts. This changes the code to return a large (arbitrary) number to model the fact that libcalls are used to implement the conversion. This also simplifies the code by only reporting costs for the scalar fp16 conversion; vectorized costs being left to the fallback assuming scalarization. This is a follow-up to assertion issues reported for the changes in llvm#113195 upstream commit: 255e441

MatzeB requested review from LebedevRI, RKSimon, phoebewang and topperc October 21, 2024 17:38

llvmbot added backend:X86 llvm:transforms labels Oct 21, 2024

Override X86TTIImpl::getStoreMinimumVF instead of tweaking codegen ta…

84a22ef

…bles.

RKSimon approved these changes Oct 25, 2024

View reviewed changes

MatzeB merged commit 054c23d into llvm:main Oct 25, 2024
7 of 8 checks passed

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Oct 27, 2024

merge main into amd-staging

98001a2

revert breaks hipRuntime build 054c23d X86: Improve cost model of fp16 conversion (llvm#113195) Change-Id: I2dbd30b82c6b355ff83368c9fd5c8b2f83ce5db1

JoelWee reviewed Oct 29, 2024

View reviewed changes

MatzeB mentioned this pull request Oct 29, 2024

X86: Do not return invalid cost for fp16 conversion #114128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

X86: Improve cost model of fp16 conversion #113195

X86: Improve cost model of fp16 conversion #113195

Uh oh!

MatzeB commented Oct 21, 2024

Uh oh!

llvmbot commented Oct 21, 2024

Uh oh!

llvmbot commented Oct 21, 2024

Uh oh!

github-actions bot commented Oct 21, 2024 •

edited

Loading

Uh oh!

MatzeB commented Oct 21, 2024

Uh oh!

Uh oh!

JoelWee Oct 29, 2024

Uh oh!

MatzeB Oct 29, 2024

Uh oh!

MatzeB Oct 29, 2024

Uh oh!

JoelWee Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

X86: Improve cost model of fp16 conversion #113195

X86: Improve cost model of fp16 conversion #113195

Uh oh!

Conversation

MatzeB commented Oct 21, 2024

Uh oh!

llvmbot commented Oct 21, 2024

Uh oh!

llvmbot commented Oct 21, 2024

Uh oh!

github-actions bot commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatzeB commented Oct 21, 2024

Uh oh!

Uh oh!

JoelWee Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

MatzeB Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

MatzeB Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

JoelWee Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Oct 21, 2024 •

edited

Loading