Skip to content

Commit e3fd8f8

Browse files
committed
AMDGPU: Correctly expand f64 sqrt intrinsic
rocm-device-libs and llpc were avoiding using f64 sqrt intrinsics in favor of their own expansions. Port the expansion into the backend. Both of these users should be updated to call the intrinsic instead. The library and llpc expansions are slightly different. llpc uses an ldexp to do the scale; the library uses a multiply. Use ldexp to do the scale instead of the multiply. I believe v_ldexp_f64 and v_mul_f64 are always the same number of cycles, but it's cheaper to materialize the 32-bit integer constant than the 64-bit double constant. The libraries have another fast version of sqrt which will be handled separately. I am tempted to do this in an IR expansion instead. In the IR we could take advantage of computeKnownFPClass to avoid the 0-or-inf argument check.
1 parent 47b3ada commit e3fd8f8

File tree

13 files changed

+5932
-881
lines changed

13 files changed

+5932
-881
lines changed

llvm/docs/AMDGPUUsage.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -965,6 +965,9 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
965965
========================================= ==========================================================
966966
LLVM Intrinsic Description
967967
========================================= ==========================================================
968+
llvm.amdgcn.sqrt Provides direct access to v_sqrt_f64, v_sqrt_f32 and v_sqrt_f16
969+
(on targets with half support). Peforms sqrt function.
970+
968971
llvm.amdgcn.log Provides direct access to v_log_f32 and v_log_f16
969972
(on targets with half support). Peforms log2 function.
970973

@@ -980,6 +983,8 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
980983
inputs. Backend will optimize out denormal scaling if
981984
marked with the :ref:`afn <fastmath_afn>` flag.
982985

986+
:ref:`llvm.sqrt <int_sqrt>` Implemented for double, float and half (and vectors).
987+
983988
:ref:`llvm.log <int_log>` Implemented for float and half (and vectors).
984989

985990
:ref:`llvm.exp <int_exp>` Implemented for float and half (and vectors).

llvm/docs/ReleaseNotes.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,9 @@ Changes to the AMDGPU Backend
173173
* Implemented new 2ulp IEEE lowering strategy for float
174174
reciprocal. This is used by default for OpenCL on gfx9+.
175175

176+
* `llvm.sqrt.f64` is now lowered correctly. Use `llvm.amdgcn.sqrt.f64`
177+
for raw instruction access.
178+
176179
Changes to the ARM Backend
177180
--------------------------
178181

llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1181,6 +1181,13 @@ class MachineIRBuilder {
11811181
const SrcOp &Op0, const SrcOp &Op1,
11821182
std::optional<unsigned> Flags = std::nullopt);
11831183

1184+
/// Build and insert a \p Res = G_IS_FPCLASS \p Pred\p Src, \p Mask
1185+
MachineInstrBuilder buildIsFPClass(const DstOp &Res, const SrcOp &Src,
1186+
unsigned Mask) {
1187+
return buildInstr(TargetOpcode::G_IS_FPCLASS, {Res},
1188+
{Src, SrcOp(static_cast<int64_t>(Mask))});
1189+
}
1190+
11841191
/// Build and insert a \p Res = G_SELECT \p Tst, \p Op0, \p Op1
11851192
///
11861193
/// \pre setBasicBlock or setMI must have been called.

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp

Lines changed: 94 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -907,7 +907,12 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
907907
.clampScalar(0, S16, S64);
908908

909909
if (ST.has16BitInsts()) {
910-
getActionDefinitionsBuilder({G_FSQRT, G_FFLOOR})
910+
getActionDefinitionsBuilder(G_FSQRT)
911+
.legalFor({S32, S16})
912+
.customFor({S64})
913+
.scalarize(0)
914+
.clampScalar(0, S16, S64);
915+
getActionDefinitionsBuilder(G_FFLOOR)
911916
.legalFor({S32, S64, S16})
912917
.scalarize(0)
913918
.clampScalar(0, S16, S64);
@@ -925,7 +930,8 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
925930
.lower();
926931
} else {
927932
getActionDefinitionsBuilder(G_FSQRT)
928-
.legalFor({S32, S64})
933+
.legalFor({S32})
934+
.customFor({S64})
929935
.scalarize(0)
930936
.clampScalar(0, S32, S64);
931937

@@ -1996,6 +2002,8 @@ bool AMDGPULegalizerInfo::legalizeCustom(LegalizerHelper &Helper,
19962002
return legalizeFDIV(MI, MRI, B);
19972003
case TargetOpcode::G_FFREXP:
19982004
return legalizeFFREXP(MI, MRI, B);
2005+
case TargetOpcode::G_FSQRT:
2006+
return legalizeFSQRT(MI, MRI, B);
19992007
case TargetOpcode::G_UDIV:
20002008
case TargetOpcode::G_UREM:
20012009
case TargetOpcode::G_UDIVREM:
@@ -4829,6 +4837,90 @@ bool AMDGPULegalizerInfo::legalizeFDIVFastIntrin(MachineInstr &MI,
48294837
return true;
48304838
}
48314839

4840+
bool AMDGPULegalizerInfo::legalizeFSQRT(MachineInstr &MI,
4841+
MachineRegisterInfo &MRI,
4842+
MachineIRBuilder &B) const {
4843+
// For double type, the SQRT and RSQ instructions don't have required
4844+
// precision, we apply Goldschmidt's algorithm to improve the result:
4845+
//
4846+
// y0 = rsq(x)
4847+
// g0 = x * y0
4848+
// h0 = 0.5 * y0
4849+
//
4850+
// r0 = 0.5 - h0 * g0
4851+
// g1 = g0 * r0 + g0
4852+
// h1 = h0 * r0 + h0
4853+
//
4854+
// r1 = 0.5 - h1 * g1 => d0 = x - g1 * g1
4855+
// g2 = g1 * r1 + g1 g2 = d0 * h1 + g1
4856+
// h2 = h1 * r1 + h1
4857+
//
4858+
// r2 = 0.5 - h2 * g2 => d1 = x - g2 * g2
4859+
// g3 = g2 * r2 + g2 g3 = d1 * h1 + g2
4860+
//
4861+
// sqrt(x) = g3
4862+
4863+
const LLT S1 = LLT::scalar(1);
4864+
const LLT S32 = LLT::scalar(32);
4865+
const LLT F64 = LLT::scalar(64);
4866+
4867+
Register Dst = MI.getOperand(0).getReg();
4868+
assert(MRI.getType(Dst) == F64 && "only expect to lower f64 sqrt");
4869+
4870+
Register X = MI.getOperand(1).getReg();
4871+
unsigned Flags = MI.getFlags();
4872+
4873+
auto ScaleConstant = B.buildFConstant(F64, 0x1.0p-767);
4874+
4875+
auto ZeroInt = B.buildConstant(S32, 0);
4876+
auto Scaling = B.buildFCmp(FCmpInst::FCMP_OLT, S1, X, ScaleConstant);
4877+
4878+
// Scale up input if it is too small.
4879+
auto ScaleUpFactor = B.buildConstant(S32, 256);
4880+
auto ScaleUp = B.buildSelect(S32, Scaling, ScaleUpFactor, ZeroInt);
4881+
auto SqrtX = B.buildFLdexp(F64, X, ScaleUp, Flags);
4882+
4883+
auto SqrtY = B.buildIntrinsic(Intrinsic::amdgcn_rsq, {F64}, false)
4884+
.addReg(SqrtX.getReg(0));
4885+
4886+
auto Half = B.buildFConstant(F64, 0.5);
4887+
auto SqrtH0 = B.buildFMul(F64, SqrtY, Half);
4888+
auto SqrtS0 = B.buildFMul(F64, SqrtX, SqrtY);
4889+
4890+
auto NegSqrtH0 = B.buildFNeg(F64, SqrtH0);
4891+
auto SqrtR0 = B.buildFMA(F64, NegSqrtH0, SqrtS0, Half);
4892+
4893+
auto SqrtS1 = B.buildFMA(F64, SqrtS0, SqrtR0, SqrtS0);
4894+
auto SqrtH1 = B.buildFMA(F64, SqrtH0, SqrtR0, SqrtH0);
4895+
4896+
auto NegSqrtS1 = B.buildFNeg(F64, SqrtS1);
4897+
auto SqrtD0 = B.buildFMA(F64, NegSqrtS1, SqrtS1, SqrtX);
4898+
4899+
auto SqrtS2 = B.buildFMA(F64, SqrtD0, SqrtH1, SqrtS1);
4900+
4901+
auto NegSqrtS2 = B.buildFNeg(F64, SqrtS2);
4902+
auto SqrtD1 = B.buildFMA(F64, NegSqrtS2, SqrtS2, SqrtX);
4903+
4904+
auto SqrtRet = B.buildFMA(F64, SqrtD1, SqrtH1, SqrtS2);
4905+
4906+
// Scale down the result.
4907+
auto ScaleDownFactor = B.buildConstant(S32, -128);
4908+
auto ScaleDown = B.buildSelect(S32, Scaling, ScaleDownFactor, ZeroInt);
4909+
SqrtRet = B.buildFLdexp(F64, SqrtRet, ScaleDown, Flags);
4910+
4911+
// TODO: Switch to fcmp oeq 0 for finite only. Can't fully remove this check
4912+
// with finite only or nsz because rsq(+/-0) = +/-inf
4913+
4914+
// TODO: Check for DAZ and expand to subnormals
4915+
auto IsZeroOrInf = B.buildIsFPClass(LLT::scalar(1), SqrtX, fcZero | fcPosInf);
4916+
4917+
// If x is +INF, +0, or -0, use its original value
4918+
B.buildSelect(Dst, IsZeroOrInf, SqrtX, SqrtRet, Flags);
4919+
4920+
MI.eraseFromParent();
4921+
return true;
4922+
}
4923+
48324924
// Expand llvm.amdgcn.rsq.clamp on targets that don't support the instruction.
48334925
// FIXME: Why do we handle this one but not other removed instructions?
48344926
//

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,9 @@ class AMDGPULegalizerInfo final : public LegalizerInfo {
157157
bool legalizeFDIVFastIntrin(MachineInstr &MI, MachineRegisterInfo &MRI,
158158
MachineIRBuilder &B) const;
159159

160+
bool legalizeFSQRT(MachineInstr &MI, MachineRegisterInfo &MRI,
161+
MachineIRBuilder &B) const;
162+
160163
bool legalizeRsqClampIntrinsic(MachineInstr &MI, MachineRegisterInfo &MRI,
161164
MachineIRBuilder &B) const;
162165

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

Lines changed: 87 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,8 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
219219
setOperationAction(ISD::SELECT, MVT::f64, Promote);
220220
AddPromotedToType(ISD::SELECT, MVT::f64, MVT::i64);
221221

222+
setOperationAction(ISD::FSQRT, MVT::f64, Custom);
223+
222224
setOperationAction(ISD::SELECT_CC,
223225
{MVT::f32, MVT::i32, MVT::i64, MVT::f64, MVT::i1}, Expand);
224226

@@ -4924,7 +4926,10 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
49244926
"Load should return a value and a chain");
49254927
return Result;
49264928
}
4927-
4929+
case ISD::FSQRT:
4930+
if (Op.getValueType() == MVT::f64)
4931+
return lowerFSQRTF64(Op, DAG);
4932+
return SDValue();
49284933
case ISD::FSIN:
49294934
case ISD::FCOS:
49304935
return LowerTrig(Op, DAG);
@@ -9749,6 +9754,87 @@ SDValue SITargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
97499754
return SDValue();
97509755
}
97519756

9757+
SDValue SITargetLowering::lowerFSQRTF64(SDValue Op, SelectionDAG &DAG) const {
9758+
// For double type, the SQRT and RSQ instructions don't have required
9759+
// precision, we apply Goldschmidt's algorithm to improve the result:
9760+
//
9761+
// y0 = rsq(x)
9762+
// g0 = x * y0
9763+
// h0 = 0.5 * y0
9764+
//
9765+
// r0 = 0.5 - h0 * g0
9766+
// g1 = g0 * r0 + g0
9767+
// h1 = h0 * r0 + h0
9768+
//
9769+
// r1 = 0.5 - h1 * g1 => d0 = x - g1 * g1
9770+
// g2 = g1 * r1 + g1 g2 = d0 * h1 + g1
9771+
// h2 = h1 * r1 + h1
9772+
//
9773+
// r2 = 0.5 - h2 * g2 => d1 = x - g2 * g2
9774+
// g3 = g2 * r2 + g2 g3 = d1 * h1 + g2
9775+
//
9776+
// sqrt(x) = g3
9777+
9778+
SDNodeFlags Flags = Op->getFlags();
9779+
9780+
SDLoc DL(Op);
9781+
9782+
SDValue X = Op.getOperand(0);
9783+
SDValue ScaleConstant = DAG.getConstantFP(0x1.0p-767, DL, MVT::f64);
9784+
9785+
SDValue Scaling = DAG.getSetCC(DL, MVT::i1, X, ScaleConstant, ISD::SETOLT);
9786+
9787+
SDValue ZeroInt = DAG.getConstant(0, DL, MVT::i32);
9788+
9789+
// Scale up input if it is too small.
9790+
SDValue ScaleUpFactor = DAG.getConstant(256, DL, MVT::i32);
9791+
SDValue ScaleUp =
9792+
DAG.getNode(ISD::SELECT, DL, MVT::i32, Scaling, ScaleUpFactor, ZeroInt);
9793+
SDValue SqrtX = DAG.getNode(ISD::FLDEXP, DL, MVT::f64, X, ScaleUp, Flags);
9794+
9795+
SDValue SqrtY = DAG.getNode(AMDGPUISD::RSQ, DL, MVT::f64, SqrtX);
9796+
9797+
SDValue SqrtS0 = DAG.getNode(ISD::FMUL, DL, MVT::f64, SqrtX, SqrtY);
9798+
9799+
SDValue Half = DAG.getConstantFP(0.5, DL, MVT::f64);
9800+
SDValue SqrtH0 = DAG.getNode(ISD::FMUL, DL, MVT::f64, SqrtY, Half);
9801+
9802+
SDValue NegSqrtH0 = DAG.getNode(ISD::FNEG, DL, MVT::f64, SqrtH0);
9803+
SDValue SqrtR0 = DAG.getNode(ISD::FMA, DL, MVT::f64, NegSqrtH0, SqrtS0, Half);
9804+
9805+
SDValue SqrtH1 = DAG.getNode(ISD::FMA, DL, MVT::f64, SqrtH0, SqrtR0, SqrtH0);
9806+
9807+
SDValue SqrtS1 = DAG.getNode(ISD::FMA, DL, MVT::f64, SqrtS0, SqrtR0, SqrtS0);
9808+
9809+
SDValue NegSqrtS1 = DAG.getNode(ISD::FNEG, DL, MVT::f64, SqrtS1);
9810+
SDValue SqrtD0 = DAG.getNode(ISD::FMA, DL, MVT::f64, NegSqrtS1, SqrtS1, SqrtX);
9811+
9812+
SDValue SqrtS2 = DAG.getNode(ISD::FMA, DL, MVT::f64, SqrtD0, SqrtH1, SqrtS1);
9813+
9814+
SDValue NegSqrtS2 = DAG.getNode(ISD::FNEG, DL, MVT::f64, SqrtS2);
9815+
SDValue SqrtD1 =
9816+
DAG.getNode(ISD::FMA, DL, MVT::f64, NegSqrtS2, SqrtS2, SqrtX);
9817+
9818+
SDValue SqrtRet = DAG.getNode(ISD::FMA, DL, MVT::f64, SqrtD1, SqrtH1, SqrtS2);
9819+
9820+
SDValue ScaleDownFactor = DAG.getConstant(-128, DL, MVT::i32);
9821+
SDValue ScaleDown =
9822+
DAG.getNode(ISD::SELECT, DL, MVT::i32, Scaling, ScaleDownFactor, ZeroInt);
9823+
SqrtRet = DAG.getNode(ISD::FLDEXP, DL, MVT::f64, SqrtRet, ScaleDown, Flags);
9824+
9825+
// TODO: Switch to fcmp oeq 0 for finite only. Can't fully remove this check
9826+
// with finite only or nsz because rsq(+/-0) = +/-inf
9827+
9828+
// TODO: Check for DAZ and expand to subnormals
9829+
SDValue IsZeroOrInf =
9830+
DAG.getNode(ISD::IS_FPCLASS, DL, MVT::i1, SqrtX,
9831+
DAG.getTargetConstant(fcZero | fcPosInf, DL, MVT::i32));
9832+
9833+
// If x is +INF, +0, or -0, use its original value
9834+
return DAG.getNode(ISD::SELECT, DL, MVT::f64, IsZeroOrInf, SqrtX, SqrtRet,
9835+
Flags);
9836+
}
9837+
97529838
SDValue SITargetLowering::LowerTrig(SDValue Op, SelectionDAG &DAG) const {
97539839
SDLoc DL(Op);
97549840
EVT VT = Op.getValueType();

llvm/lib/Target/AMDGPU/SIISelLowering.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
109109
SDValue LowerFFREXP(SDValue Op, SelectionDAG &DAG) const;
110110
SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
111111
SDValue LowerTrig(SDValue Op, SelectionDAG &DAG) const;
112+
SDValue lowerFSQRTF64(SDValue Op, SelectionDAG &DAG) const;
112113
SDValue LowerATOMIC_CMP_SWAP(SDValue Op, SelectionDAG &DAG) const;
113114
SDValue LowerBRCOND(SDValue Op, SelectionDAG &DAG) const;
114115
SDValue LowerRETURNADDR(SDValue Op, SelectionDAG &DAG) const;

llvm/lib/Target/AMDGPU/VOP1Instructions.td

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -332,7 +332,7 @@ defm V_SQRT_F32 : VOP1Inst <"v_sqrt_f32", VOP_F32_F32, any_amdgcn_sqrt>;
332332
let TRANS = 1, SchedRW = [WriteTrans64] in {
333333
defm V_RCP_F64 : VOP1Inst <"v_rcp_f64", VOP_F64_F64, AMDGPUrcp>;
334334
defm V_RSQ_F64 : VOP1Inst <"v_rsq_f64", VOP_F64_F64, AMDGPUrsq>;
335-
defm V_SQRT_F64 : VOP1Inst <"v_sqrt_f64", VOP_F64_F64, any_amdgcn_sqrt>;
335+
defm V_SQRT_F64 : VOP1Inst <"v_sqrt_f64", VOP_F64_F64, int_amdgcn_sqrt>;
336336
} // End TRANS = 1, SchedRW = [WriteTrans64]
337337

338338
let TRANS = 1, SchedRW = [WriteTrans32] in {

llvm/test/Analysis/CostModel/AMDGPU/arith-fp.ll

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -52,21 +52,21 @@ define i32 @fsqrt(i32 %arg) {
5252
; ALL-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4F32 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> undef)
5353
; ALL-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8F32 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> undef)
5454
; ALL-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V16F32 = call <16 x float> @llvm.sqrt.v16f32(<16 x float> undef)
55-
; ALL-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %F64 = call double @llvm.sqrt.f64(double undef)
56-
; ALL-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)
57-
; ALL-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)
58-
; ALL-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)
55+
; ALL-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F64 = call double @llvm.sqrt.f64(double undef)
56+
; ALL-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)
57+
; ALL-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)
58+
; ALL-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)
5959
; ALL-NEXT: Cost Model: Found an estimated cost of 10 for instruction: ret i32 undef
6060
;
6161
; ALL-SIZE-LABEL: 'fsqrt'
6262
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %F32 = call float @llvm.sqrt.f32(float undef)
6363
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4F32 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> undef)
6464
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8F32 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> undef)
6565
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V16F32 = call <16 x float> @llvm.sqrt.v16f32(<16 x float> undef)
66-
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %F64 = call double @llvm.sqrt.f64(double undef)
67-
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)
68-
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)
69-
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)
66+
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F64 = call double @llvm.sqrt.f64(double undef)
67+
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)
68+
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)
69+
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)
7070
; ALL-SIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret i32 undef
7171
;
7272
%F32 = call float @llvm.sqrt.f32(float undef)

0 commit comments

Comments
 (0)