-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[ROCDL] Added rocdl.fmed3 -> Intrinsic::amdgcn_fmed3 #157748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCDL] Added rocdl.fmed3 -> Intrinsic::amdgcn_fmed3 #157748
Conversation
Signed-off-by: keshavvinayak01 <[email protected]>
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
Signed-off-by: keshavvinayak01 <[email protected]>
… lit Signed-off-by: keshavvinayak01 <[email protected]>
|
@llvm/pr-subscribers-mlir @llvm/pr-subscribers-clang-codegen Author: Keshav Vinayak Jha (keshavvinayak01) ChangesDescriptionAdded support for AMDGPU signed (med3) intrinsics. Implemented Testing
Addresses #157052 Patch is 27.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff 14 Files Affected:
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b4..4f6ab2a36cd853 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
//===----------------------------------------------------------------------===//
TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a1..5d4c980c7c63e3 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
case AMDGPU::BI__builtin_amdgcn_fmed3h:
return emitBuiltinWithOneOverloadedType<3>(*this, E,
Intrinsic::amdgcn_fmed3);
+ case AMDGPU::BI__builtin_amdgcn_smed3:
+ case AMDGPU::BI__builtin_amdgcn_smed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_smed3);
+ case AMDGPU::BI__builtin_amdgcn_umed3:
+ case AMDGPU::BI__builtin_amdgcn_umed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_umed3);
case AMDGPU::BI__builtin_amdgcn_ds_append:
case AMDGPU::BI__builtin_amdgcn_ds_consume: {
Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f0..52ba06ed4be25f 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
[IntrNoMem, IntrSpeculatable]
>;
+def int_amdgcn_smed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
DefaultAttrsIntrinsic<[llvm_float_ty],
[llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb8611..486ec90edcaefb 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d006794360..c6cb4736f95df4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
return maxnum(Src0, Src1);
}
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2)
+ : (Src1.sgt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+ return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2)
+ : (Src1.ugt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+ return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
// Check if a value can be converted to a 16-bit value without losing
// precision.
// The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
return nullptr;
}
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
// Trim all zero components from the end of the vector \p UseV and return
// an appropriate bitset with known elements.
static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
break;
}
+ case Intrinsic::amdgcn_smed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // smed3(c0, x, c1) -> smed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold smed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+ if (Value *X = matchSExtFromI16(Src0)) {
+ if (Value *Y = matchSExtFromI16(Src1)) {
+ if (Value *Z = matchSExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new SExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // umed3(c0, x, c1) -> umed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold umed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+ if (Value *X = matchZExtFromI16(Src0)) {
+ if (Value *Y = matchZExtFromI16(Src1)) {
+ if (Value *Z = matchZExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new ZExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
case Intrinsic::amdgcn_icmp:
case Intrinsic::amdgcn_fcmp: {
const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc68676..e9680e062cffaa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
[]
>;
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
- []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
- []
->;
def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
(AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
(AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab938..5da1e04c58bae4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
Observer.changedInstr(MI);
return true;
}
+ case Intrinsic::amdgcn_smed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
case Intrinsic::amdgcn_readlane:
case Intrinsic::amdgcn_writelane:
case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed1..63141d065bf653 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
case AMDGPU::G_AMDGPU_SMED3:
+ case AMDGPU::G_AMDGPU_UMED3:
case AMDGPU::G_AMDGPU_FMED3:
return getDefaultMappingVOP(MI);
case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
case Intrinsic::amdgcn_sat_pk4_i4_i8:
case Intrinsic::amdgcn_sat_pk4_u4_u8:
+ case Intrinsic::amdgcn_smed3:
+ case Intrinsic::amdgcn_umed3:
case Intrinsic::amdgcn_fmed3:
case Intrinsic::amdgcn_cubeid:
case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 00000000000000..0f6f00309401c5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 00000000000000..250fdc0d2d78d2
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ %med3.user = mul i32 %med3, %mul.arg
+ store volatile i32 %med3.user, ptr addrspace(1) %out
+ store volatile i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 00000000000000..d484e8a4b08048
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 00000000000000..e1bec276d1fb6e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{...
[truncated]
|
|
@llvm/pr-subscribers-backend-amdgpu Author: Keshav Vinayak Jha (keshavvinayak01) ChangesDescriptionAdded support for AMDGPU signed (med3) intrinsics. Implemented Testing
Addresses #157052 Patch is 27.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff 14 Files Affected:
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b..4f6ab2a36cd85 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
//===----------------------------------------------------------------------===//
TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a..5d4c980c7c63e 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
case AMDGPU::BI__builtin_amdgcn_fmed3h:
return emitBuiltinWithOneOverloadedType<3>(*this, E,
Intrinsic::amdgcn_fmed3);
+ case AMDGPU::BI__builtin_amdgcn_smed3:
+ case AMDGPU::BI__builtin_amdgcn_smed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_smed3);
+ case AMDGPU::BI__builtin_amdgcn_umed3:
+ case AMDGPU::BI__builtin_amdgcn_umed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_umed3);
case AMDGPU::BI__builtin_amdgcn_ds_append:
case AMDGPU::BI__builtin_amdgcn_ds_consume: {
Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f..52ba06ed4be25 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
[IntrNoMem, IntrSpeculatable]
>;
+def int_amdgcn_smed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
DefaultAttrsIntrinsic<[llvm_float_ty],
[llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb861..486ec90edcaef 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d00679436..c6cb4736f95df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
return maxnum(Src0, Src1);
}
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2)
+ : (Src1.sgt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+ return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2)
+ : (Src1.ugt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+ return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
// Check if a value can be converted to a 16-bit value without losing
// precision.
// The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
return nullptr;
}
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
// Trim all zero components from the end of the vector \p UseV and return
// an appropriate bitset with known elements.
static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
break;
}
+ case Intrinsic::amdgcn_smed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // smed3(c0, x, c1) -> smed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold smed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+ if (Value *X = matchSExtFromI16(Src0)) {
+ if (Value *Y = matchSExtFromI16(Src1)) {
+ if (Value *Z = matchSExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new SExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // umed3(c0, x, c1) -> umed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold umed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+ if (Value *X = matchZExtFromI16(Src0)) {
+ if (Value *Y = matchZExtFromI16(Src1)) {
+ if (Value *Z = matchZExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new ZExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
case Intrinsic::amdgcn_icmp:
case Intrinsic::amdgcn_fcmp: {
const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc6867..e9680e062cffa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
[]
>;
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
- []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
- []
->;
def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
(AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
(AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab93..5da1e04c58bae 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
Observer.changedInstr(MI);
return true;
}
+ case Intrinsic::amdgcn_smed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
case Intrinsic::amdgcn_readlane:
case Intrinsic::amdgcn_writelane:
case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed..63141d065bf65 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
case AMDGPU::G_AMDGPU_SMED3:
+ case AMDGPU::G_AMDGPU_UMED3:
case AMDGPU::G_AMDGPU_FMED3:
return getDefaultMappingVOP(MI);
case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
case Intrinsic::amdgcn_sat_pk4_i4_i8:
case Intrinsic::amdgcn_sat_pk4_u4_u8:
+ case Intrinsic::amdgcn_smed3:
+ case Intrinsic::amdgcn_umed3:
case Intrinsic::amdgcn_fmed3:
case Intrinsic::amdgcn_cubeid:
case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 0000000000000..0f6f00309401c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 0000000000000..250fdc0d2d78d
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ %med3.user = mul i32 %med3, %mul.arg
+ store volatile i32 %med3.user, ptr addrspace(1) %out
+ store volatile i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 0000000000000..d484e8a4b0804
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 0000000000000..e1bec276d1fb6
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_multi_use:...
[truncated]
|
|
@llvm/pr-subscribers-clang Author: Keshav Vinayak Jha (keshavvinayak01) ChangesDescriptionAdded support for AMDGPU signed (med3) intrinsics. Implemented Testing
Addresses #157052 Patch is 27.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff 14 Files Affected:
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b4..4f6ab2a36cd853 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
//===----------------------------------------------------------------------===//
TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a1..5d4c980c7c63e3 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
case AMDGPU::BI__builtin_amdgcn_fmed3h:
return emitBuiltinWithOneOverloadedType<3>(*this, E,
Intrinsic::amdgcn_fmed3);
+ case AMDGPU::BI__builtin_amdgcn_smed3:
+ case AMDGPU::BI__builtin_amdgcn_smed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_smed3);
+ case AMDGPU::BI__builtin_amdgcn_umed3:
+ case AMDGPU::BI__builtin_amdgcn_umed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_umed3);
case AMDGPU::BI__builtin_amdgcn_ds_append:
case AMDGPU::BI__builtin_amdgcn_ds_consume: {
Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f0..52ba06ed4be25f 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
[IntrNoMem, IntrSpeculatable]
>;
+def int_amdgcn_smed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
DefaultAttrsIntrinsic<[llvm_float_ty],
[llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb8611..486ec90edcaefb 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d006794360..c6cb4736f95df4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
return maxnum(Src0, Src1);
}
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2)
+ : (Src1.sgt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+ return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2)
+ : (Src1.ugt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+ return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
// Check if a value can be converted to a 16-bit value without losing
// precision.
// The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
return nullptr;
}
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
// Trim all zero components from the end of the vector \p UseV and return
// an appropriate bitset with known elements.
static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
break;
}
+ case Intrinsic::amdgcn_smed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // smed3(c0, x, c1) -> smed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold smed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+ if (Value *X = matchSExtFromI16(Src0)) {
+ if (Value *Y = matchSExtFromI16(Src1)) {
+ if (Value *Z = matchSExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new SExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // umed3(c0, x, c1) -> umed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold umed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+ if (Value *X = matchZExtFromI16(Src0)) {
+ if (Value *Y = matchZExtFromI16(Src1)) {
+ if (Value *Z = matchZExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new ZExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
case Intrinsic::amdgcn_icmp:
case Intrinsic::amdgcn_fcmp: {
const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc68676..e9680e062cffaa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
[]
>;
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
- []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
- []
->;
def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
(AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
(AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab938..5da1e04c58bae4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
Observer.changedInstr(MI);
return true;
}
+ case Intrinsic::amdgcn_smed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
case Intrinsic::amdgcn_readlane:
case Intrinsic::amdgcn_writelane:
case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed1..63141d065bf653 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
case AMDGPU::G_AMDGPU_SMED3:
+ case AMDGPU::G_AMDGPU_UMED3:
case AMDGPU::G_AMDGPU_FMED3:
return getDefaultMappingVOP(MI);
case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
case Intrinsic::amdgcn_sat_pk4_i4_i8:
case Intrinsic::amdgcn_sat_pk4_u4_u8:
+ case Intrinsic::amdgcn_smed3:
+ case Intrinsic::amdgcn_umed3:
case Intrinsic::amdgcn_fmed3:
case Intrinsic::amdgcn_cubeid:
case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 00000000000000..0f6f00309401c5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 00000000000000..250fdc0d2d78d2
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ %med3.user = mul i32 %med3, %mul.arg
+ store volatile i32 %med3.user, ptr addrspace(1) %out
+ store volatile i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 00000000000000..d484e8a4b08048
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 00000000000000..e1bec276d1fb6e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{...
[truncated]
|
|
@llvm/pr-subscribers-mlir-llvm Author: Keshav Vinayak Jha (keshavvinayak01) ChangesDescriptionAdded support for AMDGPU signed (med3) intrinsics. Implemented Testing
Addresses #157052 Patch is 27.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff 14 Files Affected:
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b..4f6ab2a36cd85 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
//===----------------------------------------------------------------------===//
TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a..5d4c980c7c63e 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
case AMDGPU::BI__builtin_amdgcn_fmed3h:
return emitBuiltinWithOneOverloadedType<3>(*this, E,
Intrinsic::amdgcn_fmed3);
+ case AMDGPU::BI__builtin_amdgcn_smed3:
+ case AMDGPU::BI__builtin_amdgcn_smed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_smed3);
+ case AMDGPU::BI__builtin_amdgcn_umed3:
+ case AMDGPU::BI__builtin_amdgcn_umed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_umed3);
case AMDGPU::BI__builtin_amdgcn_ds_append:
case AMDGPU::BI__builtin_amdgcn_ds_consume: {
Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f..52ba06ed4be25 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
[IntrNoMem, IntrSpeculatable]
>;
+def int_amdgcn_smed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
DefaultAttrsIntrinsic<[llvm_float_ty],
[llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb861..486ec90edcaef 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d00679436..c6cb4736f95df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
return maxnum(Src0, Src1);
}
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2)
+ : (Src1.sgt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+ return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2)
+ : (Src1.ugt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+ return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
// Check if a value can be converted to a 16-bit value without losing
// precision.
// The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
return nullptr;
}
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
// Trim all zero components from the end of the vector \p UseV and return
// an appropriate bitset with known elements.
static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
break;
}
+ case Intrinsic::amdgcn_smed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // smed3(c0, x, c1) -> smed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold smed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+ if (Value *X = matchSExtFromI16(Src0)) {
+ if (Value *Y = matchSExtFromI16(Src1)) {
+ if (Value *Z = matchSExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new SExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // umed3(c0, x, c1) -> umed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold umed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+ if (Value *X = matchZExtFromI16(Src0)) {
+ if (Value *Y = matchZExtFromI16(Src1)) {
+ if (Value *Z = matchZExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new ZExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
case Intrinsic::amdgcn_icmp:
case Intrinsic::amdgcn_fcmp: {
const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc6867..e9680e062cffa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
[]
>;
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
- []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
- []
->;
def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
(AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
(AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab93..5da1e04c58bae 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
Observer.changedInstr(MI);
return true;
}
+ case Intrinsic::amdgcn_smed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
case Intrinsic::amdgcn_readlane:
case Intrinsic::amdgcn_writelane:
case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed..63141d065bf65 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
case AMDGPU::G_AMDGPU_SMED3:
+ case AMDGPU::G_AMDGPU_UMED3:
case AMDGPU::G_AMDGPU_FMED3:
return getDefaultMappingVOP(MI);
case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
case Intrinsic::amdgcn_sat_pk4_i4_i8:
case Intrinsic::amdgcn_sat_pk4_u4_u8:
+ case Intrinsic::amdgcn_smed3:
+ case Intrinsic::amdgcn_umed3:
case Intrinsic::amdgcn_fmed3:
case Intrinsic::amdgcn_cubeid:
case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 0000000000000..0f6f00309401c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 0000000000000..250fdc0d2d78d
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ %med3.user = mul i32 %med3, %mul.arg
+ store volatile i32 %med3.user, ptr addrspace(1) %out
+ store volatile i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 0000000000000..d484e8a4b0804
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 0000000000000..e1bec276d1fb6
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_multi_use:...
[truncated]
|
|
@llvm/pr-subscribers-llvm-ir Author: Keshav Vinayak Jha (keshavvinayak01) ChangesDescriptionAdded support for AMDGPU signed (med3) intrinsics. Implemented Testing
Addresses #157052 Patch is 27.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff 14 Files Affected:
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b..4f6ab2a36cd85 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
//===----------------------------------------------------------------------===//
TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a..5d4c980c7c63e 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
case AMDGPU::BI__builtin_amdgcn_fmed3h:
return emitBuiltinWithOneOverloadedType<3>(*this, E,
Intrinsic::amdgcn_fmed3);
+ case AMDGPU::BI__builtin_amdgcn_smed3:
+ case AMDGPU::BI__builtin_amdgcn_smed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_smed3);
+ case AMDGPU::BI__builtin_amdgcn_umed3:
+ case AMDGPU::BI__builtin_amdgcn_umed3h:
+ return emitBuiltinWithOneOverloadedType<3>(*this, E,
+ Intrinsic::amdgcn_umed3);
case AMDGPU::BI__builtin_amdgcn_ds_append:
case AMDGPU::BI__builtin_amdgcn_ds_consume: {
Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f..52ba06ed4be25 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
[IntrNoMem, IntrSpeculatable]
>;
+def int_amdgcn_smed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+ DefaultAttrsIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+ [IntrNoMem, IntrSpeculatable]
+>;
+
def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
DefaultAttrsIntrinsic<[llvm_float_ty],
[llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb861..486ec90edcaef 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d00679436..c6cb4736f95df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
return maxnum(Src0, Src1);
}
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2)
+ : (Src1.sgt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+ return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+ APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2)
+ : (Src1.ugt(Src2) ? Src1 : Src2);
+
+ if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+ if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+ return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
// Check if a value can be converted to a 16-bit value without losing
// precision.
// The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
return nullptr;
}
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+ Value *Src = nullptr;
+ ConstantInt *CInt = nullptr;
+ if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+ if (Src->getType()->isIntegerTy(16))
+ return Src;
+ } else if (match(Arg, m_ConstantInt(CInt))) {
+ // Check if the constant fits in i16
+ if (CInt->getValue().getActiveBits() <= 16)
+ return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+ }
+ return nullptr;
+}
+
// Trim all zero components from the end of the vector \p UseV and return
// an appropriate bitset with known elements.
static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
break;
}
+ case Intrinsic::amdgcn_smed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // smed3(c0, x, c1) -> smed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold smed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+ if (Value *X = matchSExtFromI16(Src0)) {
+ if (Value *Y = matchSExtFromI16(Src1)) {
+ if (Value *Z = matchSExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new SExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ Value *Src0 = II.getArgOperand(0);
+ Value *Src1 = II.getArgOperand(1);
+ Value *Src2 = II.getArgOperand(2);
+
+ // Propagate poison values.
+ for (Value *Src : {Src0, Src1, Src2}) {
+ if (isa<PoisonValue>(Src))
+ return IC.replaceInstUsesWith(II, Src);
+ }
+
+ bool Swap = false;
+ // Canonicalize constants to RHS operands.
+ //
+ // umed3(c0, x, c1) -> umed3(x, c0, c1)
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+ std::swap(Src1, Src2);
+ Swap = true;
+ }
+
+ if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+ std::swap(Src0, Src1);
+ Swap = true;
+ }
+
+ if (Swap) {
+ II.setArgOperand(0, Src0);
+ II.setArgOperand(1, Src1);
+ II.setArgOperand(2, Src2);
+ return &II;
+ }
+
+ // Constant fold umed3 with constant operands.
+ if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+ if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+ if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+ APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+ return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+ }
+ }
+ }
+
+ // Width reduction for integer extensions.
+ // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+ if (Value *X = matchZExtFromI16(Src0)) {
+ if (Value *Y = matchZExtFromI16(Src1)) {
+ if (Value *Z = matchZExtFromI16(Src2)) {
+ Value *NewCall = IC.Builder.CreateIntrinsic(
+ IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+ return new ZExtInst(NewCall, II.getType());
+ }
+ }
+ }
+
+ break;
+ }
case Intrinsic::amdgcn_icmp:
case Intrinsic::amdgcn_fcmp: {
const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc6867..e9680e062cffa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
[]
>;
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
- []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
- []
->;
def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
(AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+ [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+ (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
[(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
(AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab93..5da1e04c58bae 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
Observer.changedInstr(MI);
return true;
}
+ case Intrinsic::amdgcn_smed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
+ case Intrinsic::amdgcn_umed3: {
+ GISelChangeObserver &Observer = Helper.Observer;
+
+ // FIXME: This is to workaround the inability of tablegen match combiners to
+ // match intrinsics in patterns.
+ Observer.changingInstr(MI);
+ MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+ MI.removeOperand(1);
+ Observer.changedInstr(MI);
+ return true;
+ }
case Intrinsic::amdgcn_readlane:
case Intrinsic::amdgcn_writelane:
case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed..63141d065bf65 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
case AMDGPU::G_AMDGPU_SMED3:
+ case AMDGPU::G_AMDGPU_UMED3:
case AMDGPU::G_AMDGPU_FMED3:
return getDefaultMappingVOP(MI);
case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
case Intrinsic::amdgcn_sat_pk4_i4_i8:
case Intrinsic::amdgcn_sat_pk4_u4_u8:
+ case Intrinsic::amdgcn_smed3:
+ case Intrinsic::amdgcn_umed3:
case Intrinsic::amdgcn_fmed3:
case Intrinsic::amdgcn_cubeid:
case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 0000000000000..0f6f00309401c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 0000000000000..250fdc0d2d78d
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ %med3.user = mul i32 %med3, %mul.arg
+ store volatile i32 %med3.user, ptr addrspace(1) %out
+ store volatile i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+ %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 0000000000000..d484e8a4b0804
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %src2.i16 = trunc i32 %src2.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+ %src0.i16 = trunc i32 %src0.arg to i16
+ %src1.i16 = trunc i32 %src1.arg to i16
+ %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+ store i16 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 0000000000000..e1bec276d1fb6
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+ %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+ store i32 %med3, ptr addrspace(1) %out
+ ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_multi_use:...
[truncated]
|
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not necessary to add new intrinsics for these operations. You are better off writing the med3 in terms of min and max and letting the backend deal with it. The effort of fully supporting all analyses and optimizations on a new operation is very high
| Value *Src0 = II.getArgOperand(0); | ||
| Value *Src1 = II.getArgOperand(1); | ||
| Value *Src2 = II.getArgOperand(2); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is untested and these intrinsics are not universally supported
@arsenm I see that we already have support for lowering fmed3 all the way down to the supported AMDGPU V_MED_F32 / V_MED_F16 ops, see here. Why can't we also add similar intrinsics for SMED and UMED when the hardware already supports those instructions? |
We probably shouldn't have or use the fmed3 intrinsic. The fmed3 case is special due to the unreasonable signaling nan behavior. The full work to optimize around that was also never implemented |
So you're suggesting I rewrite the rocdl -> llvmir to use min/max ops in LLVMIR? |
|
If the backend can reliably produce med3 out of min and max, then we shouldn't be adding intrinsics - let the compiler do its thing |
|
(and if the compiler can't reliably produce these, we may want to investigate if the backend can be made to do so) |
Signed-off-by: keshavvinayak01 <[email protected]>
Signed-off-by: keshavvinayak01 <[email protected]>
| // MED3 operations | ||
| //===----------------------------------------------------------------------===// | ||
|
|
||
| def ROCDL_Med3F16Op : ROCDL_ConcreteNonMemIntrOp<"med3.f16", [Pure], 1>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong reject of having two ops that differ only in type. Please make this one operation with a variadic type. There should be existing examples.
Description
Added support for AMDGPU signed (med3) intrinsics. Implemented
smed3andumed3ISA instructions from ROCDL with complete end-to-end support including LLVM intrinsics, Clang builtins, AMDGPU backend instruction selection, and MLIR ROCDL dialect operations.Testing
llvm/test/CodeGen/AMDGPU/rocdl.med3.<dtype>ops in/test/Target/LLVMIR/rocdl.mlirAddresses #157052