Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3144,6 +3144,21 @@ InstructionCost AArch64TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
{ISD::SIGN_EXTEND, MVT::nxv8i32, MVT::nxv8i16, 2},
{ISD::SIGN_EXTEND, MVT::nxv8i64, MVT::nxv8i16, 6},
{ISD::SIGN_EXTEND, MVT::nxv4i64, MVT::nxv4i32, 2},

// Add cost for extending and converting to illegal -too wide- scalable
// Extending one size (e.g. i32 -> f64) takes 2 unpacks and 2 fcvts, while
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the cost-model is a bit of a guessing game, but is there any rationale behind picking a factor of 3? (i.e. why the cost is 12 instead of 4)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fcvt instructions seem to have 1/2 to 1/8 the throughput (depending on type) compared to simple arithmetic instructions, e.g. add, so I bumped the cost of those. The numbers may not be the best overall, but don't seem to lead to regressions at present. We may want to try a range of values at some point to see if there's a better estimate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the cost of converts of the 'not too wide' types then also be increased to reflect a higher reciprocal cost?
e.g. I see a cost of 1 for:

CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si16_to_f64 = sitofp <vscale x 2 x i16> undef to <vscale x 2 x double>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just discussed this offline, but just sharing my thoughts here: IMO the table should represent the cost of casts of legal types. Illegal types should be handled by generic code that multiplies the cost by the 'type legalization cost'. This is actually what happens for fixed-length types (see the code just below the table), but not (yet) for scalable types. Otherwise, any other illegal types that are not in the table (which includes types that cannot be represented by MVTs because they're "too wide") will get some default cost, which may be far too low.

It also seems that SINT_TO_FP records are missing in the table for scalable vector types (only FP_TO_SINT is handled). This is probably just a historical omission because this table gets updated/botched on an ad-hoc basis when people find that the cost is wrong for some workload, for some type and operation. It would be nice to clean this up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've changed the approach slightly to use the pseudo-legalization from the base getCastInstrCost that NEON uses (note the code below the table is about using SVE for fixed-length, so doesn't always apply).

Using this approach, we'll still get some illegal types (e.g. mapping nxv2i16 -> nxv2f64, the input would be promoted to nxv2i64 but that's not done in the current code for NEON), but I'm covering the cases where the destination type is legal.

I've decided to back away from increasing the cost of direct fcvts here – even though they have less throughput than add, the NEON values are not written with that in mind so we might incorrectly decide to favour NEON (or scalar) code.

I'll rerun some benchmarking with these adjusted values to see whether there's any regression from doing this.

// extending twice (e.g. i16 -> f64) takes 6 unpacks and 4 fcvts.
{ISD::SINT_TO_FP, MVT::nxv16f16, MVT::nxv16i8, 12},
{ISD::SINT_TO_FP, MVT::nxv16f32, MVT::nxv16i8, 22},
{ISD::SINT_TO_FP, MVT::nxv8f32, MVT::nxv8i16, 12},
{ISD::SINT_TO_FP, MVT::nxv8f64, MVT::nxv8i16, 22},
{ISD::SINT_TO_FP, MVT::nxv4f64, MVT::nxv4i32, 12},

{ISD::UINT_TO_FP, MVT::nxv16f16, MVT::nxv16i8, 12},
{ISD::UINT_TO_FP, MVT::nxv16f32, MVT::nxv16i8, 22},
{ISD::UINT_TO_FP, MVT::nxv8f32, MVT::nxv8i16, 12},
{ISD::UINT_TO_FP, MVT::nxv8f64, MVT::nxv8i16, 22},
{ISD::UINT_TO_FP, MVT::nxv4f64, MVT::nxv4i32, 12},
};

// We have to estimate a cost of fixed length operation upon
Expand Down
Loading
Loading