Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 50 additions & 4 deletions llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1458,16 +1458,62 @@ InstructionCost ARMTTIImpl::getArithmeticInstrCost(
if (LooksLikeAFreeShift())
return 0;

// When targets have both DSP and MVE we find that the
// the compiler will attempt to vectorize as well as using
// scalar SMLAL operations. This is in cases where we have
// the pattern ext(mul(ext(i16), ext(i16))) we find
// that generated codegen performs better when only using SMLAL scalar
// ops instead of trying to mix vector ops with SMLAL ops. We therefore
// check if a mul instruction is used in a SMLAL pattern.
auto MulInSMLALPattern = [&](const Instruction *I, unsigned Opcode,
Type *Ty) -> bool {
if (!ST->hasDSP() || !ST->hasMVEIntegerOps())
return false;
if (!I)
return false;

if (Opcode != Instruction::Mul)
return false;

if (Ty->isVectorTy())
return false;

auto IsSExtInst = [](const Value *V) -> bool {
return (dyn_cast<SExtInst>(V)) ? true : false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return isa(V);

Is it worth adding ZExt at the same time for UMULL?

Copy link
Contributor Author

@nasherm nasherm Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Surprisingly no. Vectorization for my reproducer using unsigned types is done quite well

Correction. I double checked and yes this may be beneficial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by adding ZExt we do avoid mixing vector and scalar instructions. The generated code doesn't take advantage of UMULL instructions however and just sticks with simple scalar ops. This was the behavior before the changes to the SLPVectorizer were made.

I do think that once this is merged there is further improvement to be done on SMLAL code gen which I intend to investigate i.e further code folding using DSP instructions. I could add UMULL codegen into my investigation as well.

Regardless adding ZExt support will just return previous behavior so I don't believe it to be a concern.

};

// We check the arguments of the function to see if they're extends
auto *BinOp = dyn_cast<BinaryOperator>(I);
if (!BinOp)
return false;
auto *Op0 = BinOp->getOperand(0);
auto *Op1 = BinOp->getOperand(1);
if (Op0 && Op1 && IsSExtInst(Op0) && IsSExtInst(Op1)) {
// In this case we're interested in an ext of an i16
if (!Op0->getType()->isIntegerTy(32) || !Op1->getType()->isIntegerTy(32))
return false;
// We need to check if this result will be further extended to i64
for (auto *U : I->users())
if (IsSExtInst(dyn_cast<Value>(U)))
return true;
}

return false;
};

if (MulInSMLALPattern(CxtI, Opcode, Ty))
return 0;

// Default to cheap (throughput/size of 1 instruction) but adjust throughput
// for "multiple beats" potentially needed by MVE instructions.
int BaseCost = 1;
if (ST->hasMVEIntegerOps() && Ty->isVectorTy())
BaseCost = ST->getMVEVectorCostFactor(CostKind);

// The rest of this mostly follows what is done in BaseT::getArithmeticInstrCost,
// without treating floats as more expensive that scalars or increasing the
// costs for custom operations. The results is also multiplied by the
// MVEVectorCostFactor where appropriate.
// The rest of this mostly follows what is done in
// BaseT::getArithmeticInstrCost, without treating floats as more expensive
// that scalars or increasing the costs for custom operations. The results is
// also multiplied by the MVEVectorCostFactor where appropriate.
if (TLI->isOperationLegalOrCustomOrPromote(ISDOpcode, LT.second))
return LT.first * BaseCost;

Expand Down
37 changes: 37 additions & 0 deletions llvm/test/Analysis/CostModel/ARM/muls-in-smlal-patterns.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple thumbv8.1-m.main -mattr=+mve,+dsp < %s | FileCheck %s
define i64 @test(i16 %a, i16 %b) {
; CHECK-LABEL: 'test'
; CHECK: Cost Model: Found an estimated cost of 0 for instruction: %m = mul i32 %as, %bs
;
%as = sext i16 %a to i32
%bs = sext i16 %b to i32
%m = mul i32 %as, %bs
%ms = sext i32 %m to i64
ret i64 %ms
}

define i64 @withadd(i16 %a, i16 %b, i64 %c) {
; CHECK-LABEL: 'withadd'
; CHECK: Cost Model: Found an estimated cost of 0 for instruction: %m = mul i32 %as, %bs
;
%as = sext i16 %a to i32
%bs = sext i16 %b to i32
%m = mul i32 %as, %bs
%ms = sext i32 %m to i64
%r = add i64 %c, %ms
ret i64 %r
}

define i64 @withloads(ptr %pa, ptr %pb, i64 %c) {
; CHECK-LABEL: 'withloads'
; CHECK: Cost Model: Found an estimated cost of 0 for instruction: %m = mul i32 %as, %bs
;
%a = load i16, ptr %pa
%b = load i16, ptr %pb
%as = sext i16 %a to i32
%bs = sext i16 %b to i32
%m = mul i32 %as, %bs
%ms = sext i32 %m to i64
%r = add i64 %c, %ms
ret i64 %r
}
Loading