[ROCDL] Added rocdl.fmed3 -> Intrinsic::amdgcn_fmed3 #157748

keshavvinayak01 · 2025-09-09T20:58:59Z

Description

Added support for AMDGPU signed (med3) intrinsics. Implemented smed3 and umed3 ISA instructions from ROCDL with complete end-to-end support including LLVM intrinsics, Clang builtins, AMDGPU backend instruction selection, and MLIR ROCDL dialect operations.

Testing

4 Lit test files in llvm/test/CodeGen/AMDGPU/
ROCDL -> LLVMIR lit tests for new rocdl.med3.<dtype> ops in /test/Target/LLVMIR/rocdl.mlir

Addresses #157052

Signed-off-by: keshavvinayak01 <[email protected]>

github-actions · 2025-09-09T20:59:21Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

Signed-off-by: keshavvinayak01 <[email protected]>

… lit Signed-off-by: keshavvinayak01 <[email protected]>

llvmbot · 2025-09-11T13:48:57Z

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-clang-codegen

Author: Keshav Vinayak Jha (keshavvinayak01)

Changes

Description

Added support for AMDGPU signed (med3) intrinsics. Implemented smed3 and umed3 ISA instructions from ROCDL with complete end-to-end support including LLVM intrinsics, Clang builtins, AMDGPU backend instruction selection, and MLIR ROCDL dialect operations.

Testing

4 Lit test files in llvm/test/CodeGen/AMDGPU/
ROCDL -> LLVMIR lit tests for new rocdl.med3.<dtype> ops in /test/Target/LLVMIR/rocdl.mlir

Addresses #157052

Patch is 27.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff

14 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+4)
(modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+8)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+12)
(modified) llvm/lib/Target/AMDGPU/AMDGPUGISel.td (+2-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+172)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td (+12-7)
(modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+22)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll (+42)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll (+42)
(modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+88)
(modified) mlir/test/Target/LLVMIR/rocdl.mlir (+42)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b4..4f6ab2a36cd853 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
 BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
 BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
 BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
 BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
 //===----------------------------------------------------------------------===//
 
 TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
 
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a1..5d4c980c7c63e3 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_fmed3h:
     return emitBuiltinWithOneOverloadedType<3>(*this, E,
                                                Intrinsic::amdgcn_fmed3);
+  case AMDGPU::BI__builtin_amdgcn_smed3:
+  case AMDGPU::BI__builtin_amdgcn_smed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_smed3);
+  case AMDGPU::BI__builtin_amdgcn_umed3:
+  case AMDGPU::BI__builtin_amdgcn_umed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_umed3);
   case AMDGPU::BI__builtin_amdgcn_ds_append:
   case AMDGPU::BI__builtin_amdgcn_ds_consume: {
     Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f0..52ba06ed4be25f 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
     [IntrNoMem, IntrSpeculatable]
 >;
 
+def int_amdgcn_smed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
 def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
   DefaultAttrsIntrinsic<[llvm_float_ty],
     [llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb8611..486ec90edcaefb 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
 def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
 
 def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
 def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
 def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d006794360..c6cb4736f95df4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
   return maxnum(Src0, Src1);
 }
 
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2) 
+                              : (Src1.sgt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+  return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2) 
+                              : (Src1.ugt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+  return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
 // Check if a value can be converted to a 16-bit value without losing
 // precision.
 // The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
   return nullptr;
 }
 
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
 // Trim all zero components from the end of the vector \p UseV and return
 // an appropriate bitset with known elements.
 static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
 
     break;
   }
+  case Intrinsic::amdgcn_smed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // smed3(c0, x, c1) -> smed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold smed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+    if (Value *X = matchSExtFromI16(Src0)) {
+      if (Value *Y = matchSExtFromI16(Src1)) {
+        if (Value *Z = matchSExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new SExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // umed3(c0, x, c1) -> umed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold umed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+    if (Value *X = matchZExtFromI16(Src0)) {
+      if (Value *Y = matchZExtFromI16(Src1)) {
+        if (Value *Z = matchZExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new ZExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
   case Intrinsic::amdgcn_icmp:
   case Intrinsic::amdgcn_fcmp: {
     const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc68676..e9680e062cffaa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
   []
 >;
 
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
-  []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
-  []
->;
 
 def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
 
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
 def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
                   SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
                                        SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
    (AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
 
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
 def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
    (AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab938..5da1e04c58bae4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
     Observer.changedInstr(MI);
     return true;
   }
+  case Intrinsic::amdgcn_smed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
   case Intrinsic::amdgcn_readlane:
   case Intrinsic::amdgcn_writelane:
   case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed1..63141d065bf653 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
   case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
   case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
   case AMDGPU::G_AMDGPU_SMED3:
+  case AMDGPU::G_AMDGPU_UMED3:
   case AMDGPU::G_AMDGPU_FMED3:
     return getDefaultMappingVOP(MI);
   case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
     case Intrinsic::amdgcn_sat_pk4_i4_i8:
     case Intrinsic::amdgcn_sat_pk4_u4_u8:
+    case Intrinsic::amdgcn_smed3:
+    case Intrinsic::amdgcn_umed3:
     case Intrinsic::amdgcn_fmed3:
     case Intrinsic::amdgcn_cubeid:
     case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 00000000000000..0f6f00309401c5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 00000000000000..250fdc0d2d78d2
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  %med3.user = mul i32 %med3, %mul.arg
+  store volatile i32 %med3.user, ptr addrspace(1) %out
+  store volatile i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 00000000000000..d484e8a4b08048
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 00000000000000..e1bec276d1fb6e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{...
[truncated]

llvmbot · 2025-09-11T13:48:58Z

@llvm/pr-subscribers-backend-amdgpu

Author: Keshav Vinayak Jha (keshavvinayak01)

Changes

Description

Added support for AMDGPU signed (med3) intrinsics. Implemented smed3 and umed3 ISA instructions from ROCDL with complete end-to-end support including LLVM intrinsics, Clang builtins, AMDGPU backend instruction selection, and MLIR ROCDL dialect operations.

Testing

4 Lit test files in llvm/test/CodeGen/AMDGPU/
ROCDL -> LLVMIR lit tests for new rocdl.med3.<dtype> ops in /test/Target/LLVMIR/rocdl.mlir

Addresses #157052

Patch is 27.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff

14 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+4)
(modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+8)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+12)
(modified) llvm/lib/Target/AMDGPU/AMDGPUGISel.td (+2-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+172)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td (+12-7)
(modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+22)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll (+42)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll (+42)
(modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+88)
(modified) mlir/test/Target/LLVMIR/rocdl.mlir (+42)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b..4f6ab2a36cd85 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
 BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
 BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
 BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
 BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
 //===----------------------------------------------------------------------===//
 
 TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
 
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a..5d4c980c7c63e 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_fmed3h:
     return emitBuiltinWithOneOverloadedType<3>(*this, E,
                                                Intrinsic::amdgcn_fmed3);
+  case AMDGPU::BI__builtin_amdgcn_smed3:
+  case AMDGPU::BI__builtin_amdgcn_smed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_smed3);
+  case AMDGPU::BI__builtin_amdgcn_umed3:
+  case AMDGPU::BI__builtin_amdgcn_umed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_umed3);
   case AMDGPU::BI__builtin_amdgcn_ds_append:
   case AMDGPU::BI__builtin_amdgcn_ds_consume: {
     Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f..52ba06ed4be25 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
     [IntrNoMem, IntrSpeculatable]
 >;
 
+def int_amdgcn_smed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
 def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
   DefaultAttrsIntrinsic<[llvm_float_ty],
     [llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb861..486ec90edcaef 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
 def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
 
 def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
 def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
 def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d00679436..c6cb4736f95df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
   return maxnum(Src0, Src1);
 }
 
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2) 
+                              : (Src1.sgt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+  return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2) 
+                              : (Src1.ugt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+  return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
 // Check if a value can be converted to a 16-bit value without losing
 // precision.
 // The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
   return nullptr;
 }
 
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
 // Trim all zero components from the end of the vector \p UseV and return
 // an appropriate bitset with known elements.
 static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
 
     break;
   }
+  case Intrinsic::amdgcn_smed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // smed3(c0, x, c1) -> smed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold smed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+    if (Value *X = matchSExtFromI16(Src0)) {
+      if (Value *Y = matchSExtFromI16(Src1)) {
+        if (Value *Z = matchSExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new SExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // umed3(c0, x, c1) -> umed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold umed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+    if (Value *X = matchZExtFromI16(Src0)) {
+      if (Value *Y = matchZExtFromI16(Src1)) {
+        if (Value *Z = matchZExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new ZExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
   case Intrinsic::amdgcn_icmp:
   case Intrinsic::amdgcn_fcmp: {
     const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc6867..e9680e062cffa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
   []
 >;
 
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
-  []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
-  []
->;
 
 def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
 
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
 def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
                   SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
                                        SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
    (AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
 
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
 def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
    (AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab93..5da1e04c58bae 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
     Observer.changedInstr(MI);
     return true;
   }
+  case Intrinsic::amdgcn_smed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
   case Intrinsic::amdgcn_readlane:
   case Intrinsic::amdgcn_writelane:
   case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed..63141d065bf65 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
   case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
   case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
   case AMDGPU::G_AMDGPU_SMED3:
+  case AMDGPU::G_AMDGPU_UMED3:
   case AMDGPU::G_AMDGPU_FMED3:
     return getDefaultMappingVOP(MI);
   case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
     case Intrinsic::amdgcn_sat_pk4_i4_i8:
     case Intrinsic::amdgcn_sat_pk4_u4_u8:
+    case Intrinsic::amdgcn_smed3:
+    case Intrinsic::amdgcn_umed3:
     case Intrinsic::amdgcn_fmed3:
     case Intrinsic::amdgcn_cubeid:
     case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 0000000000000..0f6f00309401c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 0000000000000..250fdc0d2d78d
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  %med3.user = mul i32 %med3, %mul.arg
+  store volatile i32 %med3.user, ptr addrspace(1) %out
+  store volatile i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 0000000000000..d484e8a4b0804
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 0000000000000..e1bec276d1fb6
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_multi_use:...
[truncated]

llvmbot · 2025-09-11T13:48:58Z

@llvm/pr-subscribers-clang

Author: Keshav Vinayak Jha (keshavvinayak01)

Changes

Description

Added support for AMDGPU signed (med3) intrinsics. Implemented smed3 and umed3 ISA instructions from ROCDL with complete end-to-end support including LLVM intrinsics, Clang builtins, AMDGPU backend instruction selection, and MLIR ROCDL dialect operations.

Testing

4 Lit test files in llvm/test/CodeGen/AMDGPU/
ROCDL -> LLVMIR lit tests for new rocdl.med3.<dtype> ops in /test/Target/LLVMIR/rocdl.mlir

Addresses #157052

Patch is 27.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff

14 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+4)
(modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+8)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+12)
(modified) llvm/lib/Target/AMDGPU/AMDGPUGISel.td (+2-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+172)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td (+12-7)
(modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+22)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll (+42)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll (+42)
(modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+88)
(modified) mlir/test/Target/LLVMIR/rocdl.mlir (+42)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b4..4f6ab2a36cd853 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
 BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
 BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
 BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
 BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
 //===----------------------------------------------------------------------===//
 
 TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
 
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a1..5d4c980c7c63e3 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_fmed3h:
     return emitBuiltinWithOneOverloadedType<3>(*this, E,
                                                Intrinsic::amdgcn_fmed3);
+  case AMDGPU::BI__builtin_amdgcn_smed3:
+  case AMDGPU::BI__builtin_amdgcn_smed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_smed3);
+  case AMDGPU::BI__builtin_amdgcn_umed3:
+  case AMDGPU::BI__builtin_amdgcn_umed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_umed3);
   case AMDGPU::BI__builtin_amdgcn_ds_append:
   case AMDGPU::BI__builtin_amdgcn_ds_consume: {
     Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f0..52ba06ed4be25f 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
     [IntrNoMem, IntrSpeculatable]
 >;
 
+def int_amdgcn_smed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
 def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
   DefaultAttrsIntrinsic<[llvm_float_ty],
     [llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb8611..486ec90edcaefb 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
 def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
 
 def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
 def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
 def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d006794360..c6cb4736f95df4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
   return maxnum(Src0, Src1);
 }
 
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2) 
+                              : (Src1.sgt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+  return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2) 
+                              : (Src1.ugt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+  return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
 // Check if a value can be converted to a 16-bit value without losing
 // precision.
 // The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
   return nullptr;
 }
 
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
 // Trim all zero components from the end of the vector \p UseV and return
 // an appropriate bitset with known elements.
 static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
 
     break;
   }
+  case Intrinsic::amdgcn_smed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // smed3(c0, x, c1) -> smed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold smed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+    if (Value *X = matchSExtFromI16(Src0)) {
+      if (Value *Y = matchSExtFromI16(Src1)) {
+        if (Value *Z = matchSExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new SExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // umed3(c0, x, c1) -> umed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold umed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+    if (Value *X = matchZExtFromI16(Src0)) {
+      if (Value *Y = matchZExtFromI16(Src1)) {
+        if (Value *Z = matchZExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new ZExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
   case Intrinsic::amdgcn_icmp:
   case Intrinsic::amdgcn_fcmp: {
     const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc68676..e9680e062cffaa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
   []
 >;
 
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
-  []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
-  []
->;
 
 def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
 
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
 def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
                   SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
                                        SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
    (AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
 
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
 def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
    (AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab938..5da1e04c58bae4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
     Observer.changedInstr(MI);
     return true;
   }
+  case Intrinsic::amdgcn_smed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
   case Intrinsic::amdgcn_readlane:
   case Intrinsic::amdgcn_writelane:
   case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed1..63141d065bf653 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
   case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
   case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
   case AMDGPU::G_AMDGPU_SMED3:
+  case AMDGPU::G_AMDGPU_UMED3:
   case AMDGPU::G_AMDGPU_FMED3:
     return getDefaultMappingVOP(MI);
   case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
     case Intrinsic::amdgcn_sat_pk4_i4_i8:
     case Intrinsic::amdgcn_sat_pk4_u4_u8:
+    case Intrinsic::amdgcn_smed3:
+    case Intrinsic::amdgcn_umed3:
     case Intrinsic::amdgcn_fmed3:
     case Intrinsic::amdgcn_cubeid:
     case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 00000000000000..0f6f00309401c5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 00000000000000..250fdc0d2d78d2
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  %med3.user = mul i32 %med3, %mul.arg
+  store volatile i32 %med3.user, ptr addrspace(1) %out
+  store volatile i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 00000000000000..d484e8a4b08048
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 00000000000000..e1bec276d1fb6e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{...
[truncated]

llvmbot · 2025-09-11T13:48:58Z

@llvm/pr-subscribers-mlir-llvm

Author: Keshav Vinayak Jha (keshavvinayak01)

Changes

Description

Added support for AMDGPU signed (med3) intrinsics. Implemented smed3 and umed3 ISA instructions from ROCDL with complete end-to-end support including LLVM intrinsics, Clang builtins, AMDGPU backend instruction selection, and MLIR ROCDL dialect operations.

Testing

4 Lit test files in llvm/test/CodeGen/AMDGPU/
ROCDL -> LLVMIR lit tests for new rocdl.med3.<dtype> ops in /test/Target/LLVMIR/rocdl.mlir

Addresses #157052

Patch is 27.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff

14 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+4)
(modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+8)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+12)
(modified) llvm/lib/Target/AMDGPU/AMDGPUGISel.td (+2-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+172)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td (+12-7)
(modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+22)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll (+42)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll (+42)
(modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+88)
(modified) mlir/test/Target/LLVMIR/rocdl.mlir (+42)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b..4f6ab2a36cd85 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
 BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
 BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
 BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
 BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
 //===----------------------------------------------------------------------===//
 
 TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
 
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a..5d4c980c7c63e 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_fmed3h:
     return emitBuiltinWithOneOverloadedType<3>(*this, E,
                                                Intrinsic::amdgcn_fmed3);
+  case AMDGPU::BI__builtin_amdgcn_smed3:
+  case AMDGPU::BI__builtin_amdgcn_smed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_smed3);
+  case AMDGPU::BI__builtin_amdgcn_umed3:
+  case AMDGPU::BI__builtin_amdgcn_umed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_umed3);
   case AMDGPU::BI__builtin_amdgcn_ds_append:
   case AMDGPU::BI__builtin_amdgcn_ds_consume: {
     Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f..52ba06ed4be25 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
     [IntrNoMem, IntrSpeculatable]
 >;
 
+def int_amdgcn_smed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
 def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
   DefaultAttrsIntrinsic<[llvm_float_ty],
     [llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb861..486ec90edcaef 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
 def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
 
 def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
 def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
 def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d00679436..c6cb4736f95df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
   return maxnum(Src0, Src1);
 }
 
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2) 
+                              : (Src1.sgt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+  return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2) 
+                              : (Src1.ugt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+  return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
 // Check if a value can be converted to a 16-bit value without losing
 // precision.
 // The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
   return nullptr;
 }
 
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
 // Trim all zero components from the end of the vector \p UseV and return
 // an appropriate bitset with known elements.
 static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
 
     break;
   }
+  case Intrinsic::amdgcn_smed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // smed3(c0, x, c1) -> smed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold smed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+    if (Value *X = matchSExtFromI16(Src0)) {
+      if (Value *Y = matchSExtFromI16(Src1)) {
+        if (Value *Z = matchSExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new SExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // umed3(c0, x, c1) -> umed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold umed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+    if (Value *X = matchZExtFromI16(Src0)) {
+      if (Value *Y = matchZExtFromI16(Src1)) {
+        if (Value *Z = matchZExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new ZExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
   case Intrinsic::amdgcn_icmp:
   case Intrinsic::amdgcn_fcmp: {
     const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc6867..e9680e062cffa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
   []
 >;
 
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
-  []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
-  []
->;
 
 def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
 
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
 def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
                   SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
                                        SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
    (AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
 
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
 def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
    (AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab93..5da1e04c58bae 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
     Observer.changedInstr(MI);
     return true;
   }
+  case Intrinsic::amdgcn_smed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
   case Intrinsic::amdgcn_readlane:
   case Intrinsic::amdgcn_writelane:
   case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed..63141d065bf65 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
   case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
   case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
   case AMDGPU::G_AMDGPU_SMED3:
+  case AMDGPU::G_AMDGPU_UMED3:
   case AMDGPU::G_AMDGPU_FMED3:
     return getDefaultMappingVOP(MI);
   case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
     case Intrinsic::amdgcn_sat_pk4_i4_i8:
     case Intrinsic::amdgcn_sat_pk4_u4_u8:
+    case Intrinsic::amdgcn_smed3:
+    case Intrinsic::amdgcn_umed3:
     case Intrinsic::amdgcn_fmed3:
     case Intrinsic::amdgcn_cubeid:
     case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 0000000000000..0f6f00309401c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 0000000000000..250fdc0d2d78d
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  %med3.user = mul i32 %med3, %mul.arg
+  store volatile i32 %med3.user, ptr addrspace(1) %out
+  store volatile i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 0000000000000..d484e8a4b0804
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 0000000000000..e1bec276d1fb6
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_multi_use:...
[truncated]

llvmbot · 2025-09-11T13:48:58Z

@llvm/pr-subscribers-llvm-ir

Author: Keshav Vinayak Jha (keshavvinayak01)

Changes

Description

Added support for AMDGPU signed (med3) intrinsics. Implemented smed3 and umed3 ISA instructions from ROCDL with complete end-to-end support including LLVM intrinsics, Clang builtins, AMDGPU backend instruction selection, and MLIR ROCDL dialect operations.

Testing

4 Lit test files in llvm/test/CodeGen/AMDGPU/
ROCDL -> LLVMIR lit tests for new rocdl.med3.<dtype> ops in /test/Target/LLVMIR/rocdl.mlir

Addresses #157052

Patch is 27.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157748.diff

14 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+4)
(modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+8)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+12)
(modified) llvm/lib/Target/AMDGPU/AMDGPUGISel.td (+2-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+172)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td (+12-7)
(modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+22)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll (+42)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll (+27)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll (+42)
(modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+88)
(modified) mlir/test/Target/LLVMIR/rocdl.mlir (+42)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index fda16e42d2c6b..4f6ab2a36cd85 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -139,6 +139,8 @@ BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
 BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
 BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
 BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
+BUILTIN(__builtin_amdgcn_smed3, "iiii", "nc")
+BUILTIN(__builtin_amdgcn_umed3, "UiUiUiUi", "nc")
 BUILTIN(__builtin_amdgcn_ds_faddf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fminf, "ff*3fIiIiIb", "n")
 BUILTIN(__builtin_amdgcn_ds_fmaxf, "ff*3fIiIiIb", "n")
@@ -265,6 +267,8 @@ TARGET_BUILTIN(__builtin_amdgcn_perm, "UiUiUiUi", "nc", "gfx8-insts")
 //===----------------------------------------------------------------------===//
 
 TARGET_BUILTIN(__builtin_amdgcn_fmed3h, "hhhh", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_smed3h, "ssss", "nc", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_umed3h, "UsUsUsUs", "nc", "gfx9-insts")
 
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f64, "dd*1d", "t", "gfx90a-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_f32, "ff*1f", "t", "atomic-fadd-rtn-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 07cf08c54985a..5d4c980c7c63e 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -606,6 +606,14 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_fmed3h:
     return emitBuiltinWithOneOverloadedType<3>(*this, E,
                                                Intrinsic::amdgcn_fmed3);
+  case AMDGPU::BI__builtin_amdgcn_smed3:
+  case AMDGPU::BI__builtin_amdgcn_smed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_smed3);
+  case AMDGPU::BI__builtin_amdgcn_umed3:
+  case AMDGPU::BI__builtin_amdgcn_umed3h:
+    return emitBuiltinWithOneOverloadedType<3>(*this, E,
+                                               Intrinsic::amdgcn_umed3);
   case AMDGPU::BI__builtin_amdgcn_ds_append:
   case AMDGPU::BI__builtin_amdgcn_ds_consume: {
     Intrinsic::ID Intrin = BuiltinID == AMDGPU::BI__builtin_amdgcn_ds_append ?
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 030d01d7a5f3f..52ba06ed4be25 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -543,6 +543,18 @@ def int_amdgcn_fmed3 :
     [IntrNoMem, IntrSpeculatable]
 >;
 
+def int_amdgcn_smed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
+def int_amdgcn_umed3 :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty],
+    [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem, IntrSpeculatable]
+>;
+
 def int_amdgcn_cubeid : ClangBuiltin<"__builtin_amdgcn_cubeid">,
   DefaultAttrsIntrinsic<[llvm_float_ty],
     [llvm_float_ty, llvm_float_ty, llvm_float_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index bb4bf742fb861..486ec90edcaef 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -256,8 +256,8 @@ def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE2, AMDGPUcvt_f32_ubyte2>;
 def : GINodeEquiv<G_AMDGPU_CVT_F32_UBYTE3, AMDGPUcvt_f32_ubyte3>;
 
 def : GINodeEquiv<G_AMDGPU_CVT_PK_I16_I32, AMDGPUpk_i16_i32_impl>;
-def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3>;
-def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3>;
+def : GINodeEquiv<G_AMDGPU_SMED3, AMDGPUsmed3_impl>;
+def : GINodeEquiv<G_AMDGPU_UMED3, AMDGPUumed3_impl>;
 def : GINodeEquiv<G_AMDGPU_FMED3, AMDGPUfmed3_impl>;
 def : GINodeEquiv<G_AMDGPU_CLAMP, AMDGPUclamp>;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 4fe5d00679436..c6cb4736f95df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -60,6 +60,26 @@ static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
   return maxnum(Src0, Src1);
 }
 
+// Constant fold llvm.amdgcn.smed3 intrinsics for standard inputs.
+static APInt smed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.sgt(Src1) ? (Src0.sgt(Src2) ? Src0 : Src2) 
+                              : (Src1.sgt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.sgt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.sgt(Src2) ? Src0 : Src2;
+  return Src0.sgt(Src1) ? Src0 : Src1;
+}
+
+// Constant fold llvm.amdgcn.umed3 intrinsics for standard inputs.
+static APInt umed3AMDGCN(const APInt &Src0, const APInt &Src1, const APInt &Src2) {
+  APInt Max3 = Src0.ugt(Src1) ? (Src0.ugt(Src2) ? Src0 : Src2) 
+                              : (Src1.ugt(Src2) ? Src1 : Src2);
+  
+  if (Max3 == Src0) return Src1.ugt(Src2) ? Src1 : Src2;
+  if (Max3 == Src1) return Src0.ugt(Src2) ? Src0 : Src2;
+  return Src0.ugt(Src1) ? Src0 : Src1;
+}
+
 // Check if a value can be converted to a 16-bit value without losing
 // precision.
 // The value is expected to be either a float (IsFloat = true) or an unsigned
@@ -427,6 +447,36 @@ static Value *matchFPExtFromF16(Value *Arg) {
   return nullptr;
 }
 
+/// Match an sext from i16 to i32, or a constant we can convert.
+static Value *matchSExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_SExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
+/// Match a zext from i16 to i32, or a constant we can convert.
+static Value *matchZExtFromI16(Value *Arg) {
+  Value *Src = nullptr;
+  ConstantInt *CInt = nullptr;
+  if (match(Arg, m_OneUse(m_ZExt(m_Value(Src))))) {
+    if (Src->getType()->isIntegerTy(16))
+      return Src;
+  } else if (match(Arg, m_ConstantInt(CInt))) {
+    // Check if the constant fits in i16
+    if (CInt->getValue().getActiveBits() <= 16)
+      return ConstantInt::get(Type::getInt16Ty(Arg->getContext()), CInt->getValue().trunc(16));
+  }
+  return nullptr;
+}
+
 // Trim all zero components from the end of the vector \p UseV and return
 // an appropriate bitset with known elements.
 static APInt trimTrailingZerosInVector(InstCombiner &IC, Value *UseV,
@@ -1174,6 +1224,128 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
 
     break;
   }
+  case Intrinsic::amdgcn_smed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // smed3(c0, x, c1) -> smed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold smed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = smed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // smed3((sext X), (sext Y), (sext Z)) -> sext (smed3(X, Y, Z))
+    if (Value *X = matchSExtFromI16(Src0)) {
+      if (Value *Y = matchSExtFromI16(Src1)) {
+        if (Value *Z = matchSExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new SExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+
+    // Propagate poison values.
+    for (Value *Src : {Src0, Src1, Src2}) {
+      if (isa<PoisonValue>(Src))
+        return IC.replaceInstUsesWith(II, Src);
+    }
+
+    bool Swap = false;
+    // Canonicalize constants to RHS operands.
+    //
+    // umed3(c0, x, c1) -> umed3(x, c0, c1)
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src1) && !isa<Constant>(Src2)) {
+      std::swap(Src1, Src2);
+      Swap = true;
+    }
+
+    if (isa<Constant>(Src0) && !isa<Constant>(Src1)) {
+      std::swap(Src0, Src1);
+      Swap = true;
+    }
+
+    if (Swap) {
+      II.setArgOperand(0, Src0);
+      II.setArgOperand(1, Src1);
+      II.setArgOperand(2, Src2);
+      return &II;
+    }
+
+    // Constant fold umed3 with constant operands.
+    if (const ConstantInt *C0 = dyn_cast<ConstantInt>(Src0)) {
+      if (const ConstantInt *C1 = dyn_cast<ConstantInt>(Src1)) {
+        if (const ConstantInt *C2 = dyn_cast<ConstantInt>(Src2)) {
+          APInt Result = umed3AMDGCN(C0->getValue(), C1->getValue(), C2->getValue());
+          return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), Result));
+        }
+      }
+    }
+
+    // Width reduction for integer extensions.
+    // umed3((zext X), (zext Y), (zext Z)) -> zext (umed3(X, Y, Z))
+    if (Value *X = matchZExtFromI16(Src0)) {
+      if (Value *Y = matchZExtFromI16(Src1)) {
+        if (Value *Z = matchZExtFromI16(Src2)) {
+          Value *NewCall = IC.Builder.CreateIntrinsic(
+              IID, {X->getType()}, {X, Y, Z}, &II, II.getName());
+          return new ZExtInst(NewCall, II.getType());
+        }
+      }
+    }
+
+    break;
+  }
   case Intrinsic::amdgcn_icmp:
   case Intrinsic::amdgcn_fcmp: {
     const ConstantInt *CC = cast<ConstantInt>(II.getArgOperand(2));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index b8fa6f3fc6867..e9680e062cffa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -334,16 +334,13 @@ def AMDGPUmad_i24 : SDNode<"AMDGPUISD::MAD_I24", AMDGPUDTIntTernaryOp,
   []
 >;
 
-def AMDGPUsmed3 : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp,
-  []
->;
-
-def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
-  []
->;
 
 def AMDGPUfmed3_impl : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;
 
+def AMDGPUsmed3_impl : SDNode<"AMDGPUISD::SMED3", AMDGPUDTIntTernaryOp, []>;
+
+def AMDGPUumed3_impl : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp, []>;
+
 def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
                   SDTypeProfile<1, 4, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
                                        SDTCisFP<0>, SDTCisVec<1>,
@@ -448,6 +445,14 @@ def AMDGPUfmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_fmed3 node:$src0, node:$src1, node:$src2),
    (AMDGPUfmed3_impl node:$src0, node:$src1, node:$src2)]>;
 
+def AMDGPUsmed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_smed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUsmed3_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUumed3 : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_umed3 node:$src0, node:$src1, node:$src2),
+   (AMDGPUumed3_impl node:$src0, node:$src1, node:$src2)]>;
+
 def AMDGPUdiv_fixup : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_div_fixup node:$src0, node:$src1, node:$src2),
    (AMDGPUdiv_fixup_impl node:$src0, node:$src1, node:$src2)]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f18536cd4ab93..5da1e04c58bae 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7798,6 +7798,28 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
     Observer.changedInstr(MI);
     return true;
   }
+  case Intrinsic::amdgcn_smed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_SMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
+  case Intrinsic::amdgcn_umed3: {
+    GISelChangeObserver &Observer = Helper.Observer;
+
+    // FIXME: This is to workaround the inability of tablegen match combiners to
+    // match intrinsics in patterns.
+    Observer.changingInstr(MI);
+    MI.setDesc(B.getTII().get(AMDGPU::G_AMDGPU_UMED3));
+    MI.removeOperand(1);
+    Observer.changedInstr(MI);
+    return true;
+  }
   case Intrinsic::amdgcn_readlane:
   case Intrinsic::amdgcn_writelane:
   case Intrinsic::amdgcn_readfirstlane:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 36b27bef350ed..63141d065bf65 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4136,6 +4136,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
   case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
   case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
   case AMDGPU::G_AMDGPU_SMED3:
+  case AMDGPU::G_AMDGPU_UMED3:
   case AMDGPU::G_AMDGPU_FMED3:
     return getDefaultMappingVOP(MI);
   case AMDGPU::G_UMULH:
@@ -4660,6 +4661,8 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_cvt_scalef32_sr_pk16_bf6_bf16:
     case Intrinsic::amdgcn_sat_pk4_i4_i8:
     case Intrinsic::amdgcn_sat_pk4_u4_u8:
+    case Intrinsic::amdgcn_smed3:
+    case Intrinsic::amdgcn_umed3:
     case Intrinsic::amdgcn_fmed3:
     case Intrinsic::amdgcn_cubeid:
     case Intrinsic::amdgcn_cubema:
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
new file mode 100644
index 0000000000000..0f6f00309401c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero_i16:
+; GCN: v_med3_i16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.smed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.smed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
new file mode 100644
index 0000000000000..250fdc0d2d78d
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_smed3:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_smed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_multi_use:
+; GCN: v_med3_i32 [[MED3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_mul_lo_i32 v{{[0-9]+}}, [[MED3]], s{{[0-9]+}}
+define amdgpu_kernel void @test_smed3_multi_use(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2, i32 %mul.arg) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  %med3.user = mul i32 %med3, %mul.arg
+  store volatile i32 %med3.user, ptr addrspace(1) %out
+  store volatile i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_constants:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 42
+define amdgpu_kernel void @test_smed3_constants(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 42)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_smed3_zero:
+; GCN: v_med3_i32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_smed3_zero(ptr addrspace(1) %out, i32 %src0, i32 %src1) #1 {
+  %med3 = call i32 @llvm.amdgcn.smed3.i32(i32 %src0, i32 %src1, i32 0)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i32 @llvm.amdgcn.smed3.i32(i32, i32, i32) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
new file mode 100644
index 0000000000000..d484e8a4b0804
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.i16.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg, i32 %src2.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %src2.i16 = trunc i32 %src2.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 %src2.i16)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_zero_i16:
+; GCN: v_med3_u16 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, 0
+define amdgpu_kernel void @test_umed3_zero_i16(ptr addrspace(1) %out, i32 %src0.arg, i32 %src1.arg) #1 {
+  %src0.i16 = trunc i32 %src0.arg to i16
+  %src1.i16 = trunc i32 %src1.arg to i16
+  %med3 = call i16 @llvm.amdgcn.umed3.i16(i16 %src0.i16, i16 %src1.i16, i16 0)
+  store i16 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+declare i16 @llvm.amdgcn.umed3.i16(i16, i16, i16) #0
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
new file mode 100644
index 0000000000000..e1bec276d1fb6
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.umed3.ll
@@ -0,0 +1,42 @@
+; RUN: llc -mtriple=amdgcn < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=GCN %s
+
+; GCN-LABEL: {{^}}test_umed3:
+; GCN: v_med3_u32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
+define amdgpu_kernel void @test_umed3(ptr addrspace(1) %out, i32 %src0, i32 %src1, i32 %src2) #1 {
+  %med3 = call i32 @llvm.amdgcn.umed3.i32(i32 %src0, i32 %src1, i32 %src2)
+  store i32 %med3, ptr addrspace(1) %out
+  ret void
+}
+
+; GCN-LABEL: {{^}}test_umed3_multi_use:...
[truncated]

arsenm

It is not necessary to add new intrinsics for these operations. You are better off writing the med3 in terms of min and max and letting the backend deal with it. The effort of fully supporting all analyses and optimizations on a new operation is very high

arsenm · 2025-09-11T14:09:20Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+    Value *Src0 = II.getArgOperand(0);
+    Value *Src1 = II.getArgOperand(1);
+    Value *Src2 = II.getArgOperand(2);
+


This code is untested and these intrinsics are not universally supported

keshavvinayak01 · 2025-09-11T15:05:12Z

It is not necessary to add new intrinsics for these operations. You are better off writing the med3 in terms of min and max and letting the backend deal with it. The effort of fully supporting all analyses and optimizations on a new operation is very high

@arsenm I see that we already have support for lowering fmed3 all the way down to the supported AMDGPU V_MED_F32 / V_MED_F16 ops, see here. Why can't we also add similar intrinsics for SMED and UMED when the hardware already supports those instructions?

arsenm · 2025-09-11T15:17:26Z

Why can't we also add similar intrinsics for SMED and UMED when the hardware already supports those instructions?

We probably shouldn't have or use the fmed3 intrinsic. The fmed3 case is special due to the unreasonable signaling nan behavior. The full work to optimize around that was also never implemented

keshavvinayak01 · 2025-09-11T16:50:04Z

It is not necessary to add new intrinsics for these operations. You are better off writing the med3 in terms of min and max and letting the backend deal with it. The effort of fully supporting all analyses and optimizations on a new operation is very high

So you're suggesting I rewrite the rocdl -> llvmir to use min/max ops in LLVMIR?
I do think it's still useful to introduce the rocdl ops, see #157052

krzysz00 · 2025-09-11T17:25:06Z

If the backend can reliably produce med3 out of min and max, then we shouldn't be adding intrinsics - let the compiler do its thing

krzysz00 · 2025-09-11T21:42:02Z

(and if the compiler can't reliably produce these, we may want to investigate if the backend can be made to do so)

Signed-off-by: keshavvinayak01 <[email protected]>

krzysz00 · 2025-09-17T01:43:49Z

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

+// MED3 operations
+//===----------------------------------------------------------------------===//
+
+def ROCDL_Med3F16Op : ROCDL_ConcreteNonMemIntrOp<"med3.f16", [Pure], 1>,


Strong reject of having two ops that differ only in type. Please make this one operation with a variadic type. There should be existing examples.

Added Intrinsics for smed, umed, to support ISA instructions from ROCDL

49f6936

Signed-off-by: keshavvinayak01 <[email protected]>

keshavvinayak01 changed the title ~~Added Intrinsics for smed, umed, to support ISA instructions from ROCDL~~ [AMDGPU] [ROCDL] Added Intrinsics for smed, umed, to support ISA instructions from ROCDL Sep 10, 2025

keshavvinayak01 and others added 3 commits September 10, 2025 18:19

Tested succesful rocdl -> llvm rewrites for smed,umed,fmed

6cb12ec

Signed-off-by: keshavvinayak01 <[email protected]>

Added smed3, umed3 amdgcn.& -> asm lits; Added rocdl -> llvm lowering…

8267484

… lit Signed-off-by: keshavvinayak01 <[email protected]>

Merge branch 'main' into users/keshavvinayak01/smed-umed-extensions

5f8f446

keshavvinayak01 marked this pull request as ready for review September 11, 2025 13:48

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. mlir:llvm mlir llvm:ir labels Sep 11, 2025

keshavvinayak01 marked this pull request as draft September 11, 2025 13:51

arsenm requested changes Sep 11, 2025

View reviewed changes

arsenm reviewed Sep 11, 2025

View reviewed changes

keshavvinayak01 added 2 commits September 15, 2025 07:25

Removed smed3/umed3 intrinsics; only keeping rocdl.fmed3.

49a0f2b

Signed-off-by: keshavvinayak01 <[email protected]>

Added back old InstrInfo/GISel for smed/umed

86441ab

Signed-off-by: keshavvinayak01 <[email protected]>

keshavvinayak01 changed the title ~~[AMDGPU] [ROCDL] Added Intrinsics for smed, umed, to support ISA instructions from ROCDL~~ [ROCDL] Added rocdl.fmed3 -> Intrinsic::amdgcn_fmed3 Sep 15, 2025

krzysz00 requested changes Sep 17, 2025

View reviewed changes

keshavvinayak01 closed this Sep 17, 2025

[ROCDL] Added rocdl.fmed3 -> Intrinsic::amdgcn_fmed3 #157748

[ROCDL] Added rocdl.fmed3 -> Intrinsic::amdgcn_fmed3 #157748

Uh oh!

Conversation

keshavvinayak01 commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

github-actions bot commented Sep 9, 2025

Uh oh!

llvmbot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

llvmbot commented Sep 11, 2025

Description

Testing

Uh oh!

llvmbot commented Sep 11, 2025

Description

Testing

Uh oh!

llvmbot commented Sep 11, 2025

Description

Testing

Uh oh!

llvmbot commented Sep 11, 2025

Description

Testing

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

keshavvinayak01 commented Sep 11, 2025

Uh oh!

arsenm commented Sep 11, 2025

Uh oh!

keshavvinayak01 commented Sep 11, 2025

Uh oh!

krzysz00 commented Sep 11, 2025

Uh oh!

krzysz00 commented Sep 11, 2025

Uh oh!

krzysz00 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

keshavvinayak01 commented Sep 9, 2025 •

edited

Loading

llvmbot commented Sep 11, 2025 •

edited

Loading