[AArch64][ARM] Optimize more `tbl`/`tbx` calls into `shufflevector` #169748

valadaptive · 2025-11-26T23:48:11Z

Resolves #169701. This PR depends on #169589; the last two commits are new.

This PR extends the existing InstCombine operation which folds tbl1 intrinsics to shufflevector if the mask operand is constant. Before this change, it only handled 64-bit tbl1 intrinsics with no out-of-bounds indices. I've extended it to support both 64-bit and 128-bit vectors, and it now handles the full range of tbl1-tbl4 and tbx1-tbx4, as long as at most two of the input operands are actually indexed into.

For the purposes of tbl, we need a dummy vector of zeroes if there are any out-of-bounds indices, and for the purposes of tbx, we use the "fallback" operand. Both of those take up an operand for the purposes of shufflevector.

This works a lot like #169110, with some added complexity because we need to handle multiple operands. I raised a couple questions in that PR that still need to be answered:

Is it correct to check IsA<UndefValue> for each mask index, and set the output mask index to -1 if so? This is later folded to a poison value, and I'm not sure about the subtle differences between poison and undef and when you can substitute one for the other. As I mentioned in [WebAssembly] Fold constant i8x16.swizzle and i8x16.relaxed.swizzle to shufflevector #169110, the existing x86 pass (simplifyX86vpermilvar) already behaves this way when it comes to undef.
How can I write an Alive2 proof for this? It's very hard to find good documentation or tutorials about Alive2.

As with #169110, most of the regression test cases were generated using Claude. Everything else was written by me.

github-actions · 2025-11-26T23:48:28Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-11-26T23:48:55Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-arm

Author: None (valadaptive)

Changes

Resolves #169701. This PR depends on #169589; the last two commits are new.

This PR extends the existing InstCombine operation which folds tbl1 intrinsics to shufflevector if the mask operand is constant. Before this change, it only handled 64-bit tbl1 intrinsics with no out-of-bounds indices. I've extended it to support both 64-bit and 128-bit vectors, and it now handles the full range of tbl1-tbl4 and tbx1-tbx4, as long as at most two of the input operands are actually indexed into.

For the purposes of tbl, we need a dummy vector of zeroes if there are any out-of-bounds indices, and for the purposes of tbx, we use the "fallback" operand. Both of those take up an operand for the purposes of shufflevector.

This works a lot like #169110, with some added complexity because we need to handle multiple operands. I raised a couple questions in that PR that still need to be answered:

Is it correct to check IsA<UndefValue> for each mask index, and set the output mask index to -1 if so? This is later folded to a poison value, and I'm not sure about the subtle differences between poison and undef and when you can substitute one for the other. As I mentioned in #169110, the existing x86 pass (simplifyX86vpermilvar) already behaves this way when it comes to undef.
How can I write an Alive2 proof for this? It's very hard to find good documentation or tutorials about Alive2.

As with #169110, most of the regression test cases were generated using Claude. Everything else was written by me.

Patch is 57.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169748.diff

16 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+21)
(modified) llvm/lib/Target/AArch64/CMakeLists.txt (+1)
(modified) llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp (+23)
(modified) llvm/lib/Target/ARM/CMakeLists.txt (+1)
(added) llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.cpp (+219)
(added) llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.h (+56)
(added) llvm/lib/Target/ARMCommon/CMakeLists.txt (+8)
(modified) llvm/lib/Target/CMakeLists.txt (+5)
(modified) llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp (-104)
(modified) llvm/test/Transforms/InstCombine/AArch64/aes-intrinsics.ll (+1-1)
(added) llvm/test/Transforms/InstCombine/AArch64/tbl.ll (+269)
(removed) llvm/test/Transforms/InstCombine/AArch64/tbl1.ll (-65)
(modified) llvm/test/Transforms/InstCombine/ARM/2012-04-23-Neon-Intrinsics.ll (+1-1)
(modified) llvm/test/Transforms/InstCombine/ARM/aes-intrinsics.ll (+1-1)
(added) llvm/test/Transforms/InstCombine/ARM/tbl.ll (+215)
(removed) llvm/test/Transforms/InstCombine/ARM/tbl1.ll (-35)

diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 0bae00bafee3c..4a53e5bd49c70 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -7,6 +7,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "AArch64TargetTransformInfo.h"
+#include "../ARMCommon/ARMCommonInstCombineIntrinsic.h"
 #include "AArch64ExpandImm.h"
 #include "AArch64PerfectShuffle.h"
 #include "AArch64SMEAttributes.h"
@@ -2856,6 +2857,26 @@ AArch64TTIImpl::instCombineIntrinsic(InstCombiner &IC,
   case Intrinsic::aarch64_neon_fmaxnm:
   case Intrinsic::aarch64_neon_fminnm:
     return instCombineMaxMinNM(IC, II);
+  case Intrinsic::aarch64_neon_tbl1:
+  case Intrinsic::aarch64_neon_tbl2:
+  case Intrinsic::aarch64_neon_tbl3:
+  case Intrinsic::aarch64_neon_tbl4:
+    return ARMCommon::simplifyNeonTbl(II, IC, /*IsExtension=*/false);
+  case Intrinsic::aarch64_neon_tbx1:
+  case Intrinsic::aarch64_neon_tbx2:
+  case Intrinsic::aarch64_neon_tbx3:
+  case Intrinsic::aarch64_neon_tbx4:
+    return ARMCommon::simplifyNeonTbl(II, IC, /*IsExtension=*/true);
+  case Intrinsic::aarch64_neon_smull:
+  case Intrinsic::aarch64_neon_umull: {
+    bool IsSigned = IID == Intrinsic::aarch64_neon_smull;
+    return ARMCommon::simplifyNeonMultiply(II, IC, IsSigned);
+  }
+  case Intrinsic::aarch64_crypto_aesd:
+  case Intrinsic::aarch64_crypto_aese:
+  case Intrinsic::aarch64_sve_aesd:
+  case Intrinsic::aarch64_sve_aese:
+    return ARMCommon::simplifyAES(II, IC);
   case Intrinsic::aarch64_sve_convert_from_svbool:
     return instCombineConvertFromSVBool(IC, II);
   case Intrinsic::aarch64_sve_dup:
diff --git a/llvm/lib/Target/AArch64/CMakeLists.txt b/llvm/lib/Target/AArch64/CMakeLists.txt
index 285d646293eb7..d27a698ee9e4a 100644
--- a/llvm/lib/Target/AArch64/CMakeLists.txt
+++ b/llvm/lib/Target/AArch64/CMakeLists.txt
@@ -101,6 +101,7 @@ add_llvm_target(AArch64CodeGen
   AArch64Desc
   AArch64Info
   AArch64Utils
+  ARMCommon
   Analysis
   AsmPrinter
   CFGuard
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index fdb0ec40cb41f..99d57b00315b1 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -7,6 +7,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "ARMTargetTransformInfo.h"
+#include "../ARMCommon/ARMCommonInstCombineIntrinsic.h"
 #include "ARMSubtarget.h"
 #include "MCTargetDesc/ARMAddressingModes.h"
 #include "llvm/ADT/APInt.h"
@@ -182,6 +183,28 @@ ARMTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
     break;
   }
 
+  case Intrinsic::arm_neon_vtbl1:
+  case Intrinsic::arm_neon_vtbl2:
+  case Intrinsic::arm_neon_vtbl3:
+  case Intrinsic::arm_neon_vtbl4:
+    return ARMCommon::simplifyNeonTbl(II, IC, /*IsExtension=*/false);
+
+  case Intrinsic::arm_neon_vtbx1:
+  case Intrinsic::arm_neon_vtbx2:
+  case Intrinsic::arm_neon_vtbx3:
+  case Intrinsic::arm_neon_vtbx4:
+    return ARMCommon::simplifyNeonTbl(II, IC, /*IsExtension=*/true);
+
+  case Intrinsic::arm_neon_vmulls:
+  case Intrinsic::arm_neon_vmullu: {
+    bool IsSigned = IID == Intrinsic::arm_neon_vmulls;
+    return ARMCommon::simplifyNeonMultiply(II, IC, IsSigned);
+  }
+
+  case Intrinsic::arm_neon_aesd:
+  case Intrinsic::arm_neon_aese:
+    return ARMCommon::simplifyAES(II, IC);
+
   case Intrinsic::arm_mve_pred_i2v: {
     Value *Arg = II.getArgOperand(0);
     Value *ArgArg;
diff --git a/llvm/lib/Target/ARM/CMakeLists.txt b/llvm/lib/Target/ARM/CMakeLists.txt
index eb3ad01a54fb2..9fc9bc134e5cc 100644
--- a/llvm/lib/Target/ARM/CMakeLists.txt
+++ b/llvm/lib/Target/ARM/CMakeLists.txt
@@ -73,6 +73,7 @@ add_llvm_target(ARMCodeGen
   Thumb2SizeReduction.cpp
 
   LINK_COMPONENTS
+  ARMCommon
   ARMDesc
   ARMInfo
   ARMUtils
diff --git a/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.cpp b/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.cpp
new file mode 100644
index 0000000000000..df58dbc6df38f
--- /dev/null
+++ b/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.cpp
@@ -0,0 +1,219 @@
+//===- ARMCommonInstCombineIntrinsic.cpp -
+//                  instCombineIntrinsic opts for both ARM and AArch64  ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+///
+/// \file
+/// This file contains optimizations for ARM and AArch64 intrinsics that
+/// are shared between both architectures. These functions can be called from:
+/// - ARM TTI's instCombineIntrinsic (for arm_neon_* intrinsics)
+/// - AArch64 TTI's instCombineIntrinsic (for aarch64_neon_* and aarch64_sve_*
+///   intrinsics)
+///
+//===----------------------------------------------------------------------===//
+
+#include "ARMCommonInstCombineIntrinsic.h"
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Value.h"
+#include "llvm/Transforms/InstCombine/InstCombiner.h"
+
+using namespace llvm;
+using namespace llvm::PatternMatch;
+
+namespace llvm {
+namespace ARMCommon {
+
+/// Convert `tbl`/`tbx` intrinsics to shufflevector if the mask is constant, and
+/// at most two source operands are actually referenced.
+Instruction *simplifyNeonTbl(IntrinsicInst &II, InstCombiner &IC,
+                             bool IsExtension) {
+  // Bail out if the mask is not a constant.
+  auto *C = dyn_cast<Constant>(II.getArgOperand(II.arg_size() - 1));
+  if (!C)
+    return nullptr;
+
+  auto *RetTy = cast<FixedVectorType>(II.getType());
+  unsigned NumIndexes = RetTy->getNumElements();
+
+  // Only perform this transformation for <8 x i8> and <16 x i8> vector types.
+  // Even the language-level intrinsics that operate on u8/p8 should lower to an
+  // LLVM intrinsic that operates on i8.
+  if (!(RetTy->getElementType()->isIntegerTy(8) &&
+        (NumIndexes == 8 || NumIndexes == 16)))
+    return nullptr;
+
+  // For tbx instructions, the first argument is the "fallback" vector, which
+  // has the same length as the mask and return type.
+  unsigned int StartIndex = (unsigned)IsExtension;
+  auto *SourceTy =
+      cast<FixedVectorType>(II.getArgOperand(StartIndex)->getType());
+  // Note that the element count of each source vector does *not* need to be the
+  // same as the element count of the return type and mask! All source vectors
+  // must have the same element count as each other, though.
+  unsigned NumElementsPerSource = SourceTy->getNumElements();
+
+  // There are no tbl/tbx intrinsics for which the destination size exceeds the
+  // source size. However, our definitions of the intrinsics, at least in
+  // IntrinsicsAArch64.td, allow for arbitrary destination vector sizes, so it
+  // *could* technically happen.
+  if (NumIndexes > NumElementsPerSource) {
+    return nullptr;
+  }
+
+  // The tbl/tbx intrinsics take several source operands followed by a mask
+  // operand.
+  unsigned int NumSourceOperands = II.arg_size() - 1 - (unsigned)IsExtension;
+
+  // Map input operands to shuffle indices. This also helpfully deduplicates the
+  // input arguments, in case the same value is passed as an argument multiple
+  // times.
+  SmallDenseMap<Value *, unsigned, 2> ValueToShuffleSlot;
+  Value *ShuffleOperands[2] = {PoisonValue::get(SourceTy),
+                               PoisonValue::get(SourceTy)};
+
+  int Indexes[16];
+  for (unsigned I = 0; I < NumIndexes; ++I) {
+    Constant *COp = C->getAggregateElement(I);
+
+    if (!COp || (!isa<UndefValue>(COp) && !isa<ConstantInt>(COp)))
+      return nullptr;
+
+    if (isa<UndefValue>(COp)) {
+      Indexes[I] = -1;
+      continue;
+    }
+
+    uint64_t Index = cast<ConstantInt>(COp)->getZExtValue();
+    // The index of the input argument that this index references (0 = first
+    // source argument, etc).
+    unsigned SourceOperandIndex = Index / NumElementsPerSource;
+    // The index of the element at that source operand.
+    unsigned SourceOperandElementIndex = Index % NumElementsPerSource;
+
+    Value *SourceOperand;
+    if (SourceOperandIndex >= NumSourceOperands) {
+      // This index is out of bounds. Map it to index into either the fallback
+      // vector (tbx) or vector of zeroes (tbl).
+      SourceOperandIndex = NumSourceOperands;
+      if (IsExtension) {
+        // For out-of-bounds indices in tbx, choose the `I`th element of the
+        // fallback.
+        SourceOperand = II.getArgOperand(0);
+        SourceOperandElementIndex = I;
+      } else {
+        // Otherwise, choose some element from the dummy vector of zeroes (we'll
+        // always choose the first).
+        SourceOperand = Constant::getNullValue(SourceTy);
+        SourceOperandElementIndex = 0;
+      }
+    } else {
+      SourceOperand = II.getArgOperand(SourceOperandIndex + StartIndex);
+    }
+
+    // The source operand may be the fallback vector, which may not have the
+    // same number of elements as the source vector. In that case, we *could*
+    // choose to extend its length with another shufflevector, but it's simpler
+    // to just bail instead.
+    if (cast<FixedVectorType>(SourceOperand->getType())->getNumElements() !=
+        NumElementsPerSource) {
+      return nullptr;
+    }
+
+    // We now know the source operand referenced by this index. Make it a
+    // shufflevector operand, if it isn't already.
+    unsigned NumSlots = ValueToShuffleSlot.size();
+    // This shuffle references more than two sources, and hence cannot be
+    // represented as a shufflevector.
+    if (NumSlots == 2 && !ValueToShuffleSlot.contains(SourceOperand)) {
+      return nullptr;
+    }
+    auto [It, Inserted] =
+        ValueToShuffleSlot.try_emplace(SourceOperand, NumSlots);
+    if (Inserted) {
+      ShuffleOperands[It->getSecond()] = SourceOperand;
+    }
+
+    unsigned RemappedIndex =
+        (It->getSecond() * NumElementsPerSource) + SourceOperandElementIndex;
+    Indexes[I] = RemappedIndex;
+  }
+
+  Value *Shuf = IC.Builder.CreateShuffleVector(
+      ShuffleOperands[0], ShuffleOperands[1], ArrayRef(Indexes, NumIndexes));
+  return IC.replaceInstUsesWith(II, Shuf);
+}
+
+/// Simplify NEON multiply-long intrinsics (smull, umull).
+/// These intrinsics perform widening multiplies: they multiply two vectors of
+/// narrow integers and produce a vector of wider integers. This function
+/// performs algebraic simplifications:
+/// 1. Multiply by zero => zero vector
+/// 2. Multiply by one => zero/sign-extend the non-one operand
+/// 3. Both operands constant => regular multiply that can be constant-folded
+///    later
+Instruction *simplifyNeonMultiply(IntrinsicInst &II, InstCombiner &IC,
+                                  bool IsSigned) {
+  Value *Arg0 = II.getArgOperand(0);
+  Value *Arg1 = II.getArgOperand(1);
+
+  // Handle mul by zero first:
+  if (isa<ConstantAggregateZero>(Arg0) || isa<ConstantAggregateZero>(Arg1)) {
+    return IC.replaceInstUsesWith(II, ConstantAggregateZero::get(II.getType()));
+  }
+
+  // Check for constant LHS & RHS - in this case we just simplify.
+  VectorType *NewVT = cast<VectorType>(II.getType());
+  if (Constant *CV0 = dyn_cast<Constant>(Arg0)) {
+    if (Constant *CV1 = dyn_cast<Constant>(Arg1)) {
+      Value *V0 = IC.Builder.CreateIntCast(CV0, NewVT, IsSigned);
+      Value *V1 = IC.Builder.CreateIntCast(CV1, NewVT, IsSigned);
+      return IC.replaceInstUsesWith(II, IC.Builder.CreateMul(V0, V1));
+    }
+
+    // Couldn't simplify - canonicalize constant to the RHS.
+    std::swap(Arg0, Arg1);
+  }
+
+  // Handle mul by one:
+  if (Constant *CV1 = dyn_cast<Constant>(Arg1))
+    if (ConstantInt *Splat =
+            dyn_cast_or_null<ConstantInt>(CV1->getSplatValue()))
+      if (Splat->isOne())
+        return CastInst::CreateIntegerCast(Arg0, II.getType(), IsSigned);
+
+  return nullptr;
+}
+
+/// Simplify AES encryption/decryption intrinsics (AESE, AESD).
+///
+/// ARM's AES instructions (AESE/AESD) XOR the data and the key, provided as
+/// separate arguments, before performing the encryption/decryption operation.
+/// We can fold that "internal" XOR with a previous one.
+Instruction *simplifyAES(IntrinsicInst &II, InstCombiner &IC) {
+  Value *DataArg = II.getArgOperand(0);
+  Value *KeyArg = II.getArgOperand(1);
+
+  // Accept zero on either operand.
+  if (!match(KeyArg, m_ZeroInt()))
+    std::swap(KeyArg, DataArg);
+
+  // Try to use the builtin XOR in AESE and AESD to eliminate a prior XOR
+  Value *Data, *Key;
+  if (match(KeyArg, m_ZeroInt()) &&
+      match(DataArg, m_Xor(m_Value(Data), m_Value(Key)))) {
+    IC.replaceOperand(II, 0, Data);
+    IC.replaceOperand(II, 1, Key);
+    return &II;
+  }
+
+  return nullptr;
+}
+
+} // namespace ARMCommon
+} // namespace llvm
diff --git a/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.h b/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.h
new file mode 100644
index 0000000000000..319aee48ccb0d
--- /dev/null
+++ b/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.h
@@ -0,0 +1,56 @@
+//===- ARMCommonInstCombineIntrinsic.h -
+// instCombineIntrinsic opts for both ARM and AArch64 -----------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+///
+/// \file
+/// This file contains optimizations for ARM and AArch64 intrinsics that
+/// are shared between both architectures. These functions can be called from:
+/// - ARM TTI's instCombineIntrinsic (for arm_neon_* intrinsics)
+/// - AArch64 TTI's instCombineIntrinsic (for aarch64_neon_* and aarch64_sve_*
+///   intrinsics)
+///
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_ARMCOMMON_ARMCOMMONINSTCOMBINEINTRINSIC_H
+#define LLVM_LIB_TARGET_ARMCOMMON_ARMCOMMONINSTCOMBINEINTRINSIC_H
+
+#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Value.h"
+#include "llvm/Transforms/InstCombine/InstCombiner.h"
+
+namespace llvm {
+
+namespace ARMCommon {
+
+/// Convert `tbl`/`tbx` intrinsics to shufflevector if the mask is constant, and
+/// at most two source operands are actually referenced.
+Instruction *simplifyNeonTbl(IntrinsicInst &II, InstCombiner &IC,
+                             bool IsExtension);
+
+/// Simplify NEON multiply-long intrinsics (smull, umull).
+/// These intrinsics perform widening multiplies: they multiply two vectors of
+/// narrow integers and produce a vector of wider integers. This function
+/// performs algebraic simplifications:
+/// 1. Multiply by zero => zero vector
+/// 2. Multiply by one => zero/sign-extend the non-one operand
+/// 3. Both operands constant => regular multiply that can be constant-folded
+///    later
+Instruction *simplifyNeonMultiply(IntrinsicInst &II, InstCombiner &IC,
+                                  bool IsSigned);
+
+/// Simplify AES encryption/decryption intrinsics (AESE, AESD).
+///
+/// ARM's AES instructions (AESE/AESD) XOR the data and the key, provided as
+/// separate arguments, before performing the encryption/decryption operation.
+/// We can fold that "internal" XOR with a previous one.
+Instruction *simplifyAES(IntrinsicInst &II, InstCombiner &IC);
+
+} // namespace ARMCommon
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_ARMCOMMON_ARMCOMMONINSTCOMBINEINTRINSIC_H
diff --git a/llvm/lib/Target/ARMCommon/CMakeLists.txt b/llvm/lib/Target/ARMCommon/CMakeLists.txt
new file mode 100644
index 0000000000000..1805a5df2f053
--- /dev/null
+++ b/llvm/lib/Target/ARMCommon/CMakeLists.txt
@@ -0,0 +1,8 @@
+add_llvm_component_library(LLVMARMCommon
+  ARMCommonInstCombineIntrinsic.cpp
+
+  LINK_COMPONENTS
+  Core
+  Support
+  TransformUtils
+  )
diff --git a/llvm/lib/Target/CMakeLists.txt b/llvm/lib/Target/CMakeLists.txt
index bcc13f942bf96..e3528014a4be2 100644
--- a/llvm/lib/Target/CMakeLists.txt
+++ b/llvm/lib/Target/CMakeLists.txt
@@ -31,6 +31,11 @@ if (NOT BUILD_SHARED_LIBS AND NOT APPLE AND
   set(CMAKE_CXX_VISIBILITY_PRESET hidden)
 endif()
 
+# Add shared ARM/AArch64 utilities if either target is being built
+if("ARM" IN_LIST LLVM_TARGETS_TO_BUILD OR "AArch64" IN_LIST LLVM_TARGETS_TO_BUILD)
+  add_subdirectory(ARMCommon)
+endif()
+
 foreach(t ${LLVM_TARGETS_TO_BUILD})
   message(STATUS "Targeting ${t}")
   add_subdirectory(${t})
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
index 8e4edefec42fd..8a54c0dde6be6 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -737,44 +737,6 @@ static Instruction *foldCtpop(IntrinsicInst &II, InstCombinerImpl &IC) {
   return nullptr;
 }
 
-/// Convert a table lookup to shufflevector if the mask is constant.
-/// This could benefit tbl1 if the mask is { 7,6,5,4,3,2,1,0 }, in
-/// which case we could lower the shufflevector with rev64 instructions
-/// as it's actually a byte reverse.
-static Value *simplifyNeonTbl1(const IntrinsicInst &II,
-                               InstCombiner::BuilderTy &Builder) {
-  // Bail out if the mask is not a constant.
-  auto *C = dyn_cast<Constant>(II.getArgOperand(1));
-  if (!C)
-    return nullptr;
-
-  auto *VecTy = cast<FixedVectorType>(II.getType());
-  unsigned NumElts = VecTy->getNumElements();
-
-  // Only perform this transformation for <8 x i8> vector types.
-  if (!VecTy->getElementType()->isIntegerTy(8) || NumElts != 8)
-    return nullptr;
-
-  int Indexes[8];
-
-  for (unsigned I = 0; I < NumElts; ++I) {
-    Constant *COp = C->getAggregateElement(I);
-
-    if (!COp || !isa<ConstantInt>(COp))
-      return nullptr;
-
-    Indexes[I] = cast<ConstantInt>(COp)->getLimitedValue();
-
-    // Make sure the mask indices are in range.
-    if ((unsigned)Indexes[I] >= NumElts)
-      return nullptr;
-  }
-
-  auto *V1 = II.getArgOperand(0);
-  auto *V2 = Constant::getNullValue(V1->getType());
-  return Builder.CreateShuffleVector(V1, V2, ArrayRef(Indexes));
-}
-
 // Returns true iff the 2 intrinsics have the same operands, limiting the
 // comparison to the first NumOperands.
 static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,
@@ -3155,72 +3117,6 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) {
         Intrinsic::getOrInsertDeclaration(II->getModule(), NewIntrin);
     return CallInst::Create(NewFn, CallArgs);
   }
-  case Intrinsic::arm_neon_vtbl1:
-  case Intrinsic::aarch64_neon_tbl1:
-    if (Value *V = simplifyNeonTbl1(*II, Builder))
-      return replaceInstUsesWith(*II, V);
-    break;
-
-  case Intrinsic::arm_neon_vmulls:
-  case Intrinsic::arm_neon_vmullu:
-  case Intrinsic::aarch64_neon_smull:
-  case Intrinsic::aarch64_neon_umull: {
-    Value *Arg0 = II->getArgOperand(0);
-    Value *Arg1 = II->getArgOperand(1);
-
-    // Handle mul by zero first:
-    if (isa<ConstantAggregateZero>(Arg0) || isa<ConstantAggregateZero>(Arg1)) {
-      return replaceInstUsesWith(CI, ConstantAggregateZero::get(II->getType()));
-    }
-
-    // Check for constant LHS & RHS - in this case we just simplify.
-    bool Zext = (IID == Intrinsic::arm_neon_vmullu ||
-                 IID == Intrinsic::aarch64_neon_umull);
-    VectorType *NewVT = cast<VectorType>(II->getType());
-    if (Constant *CV0 = dyn_cast<Constant>(Arg0)) {
-      if (Constant *CV1 = dyn_cast<Constant>(Arg1)) {
-        Value *V0 = Builder.CreateIntCast(CV0, NewVT, /*isSigned=*/!Zext);
-        Value *V1 = Builder.CreateIntCast(CV1, NewVT, /*isSigned=*/!Zext);
-        return replaceInstUsesWith(CI, Builder.CreateMul(V0, V1));
-      }
-
-      // Couldn't simplify - canonicalize constant to the RHS.
-      std::swap(Arg0, Arg1);
-    }
-
-    // Handle mul by one:
-    if (Constant *CV1 = dyn_cast<Constant>(Arg1))
-      if (ConstantInt *Splat =
-              dyn_cast_or_null<ConstantInt>(CV1->getSplatValue()))
-        if (Splat->isOne())
-          return CastInst::CreateIn...
[truncated]

llvmbot · 2025-11-26T23:48:56Z

@llvm/pr-subscribers-backend-aarch64

Author: None (valadaptive)

Changes

Resolves #169701. This PR depends on #169589; the last two commits are new.

This PR extends the existing InstCombine operation which folds tbl1 intrinsics to shufflevector if the mask operand is constant. Before this change, it only handled 64-bit tbl1 intrinsics with no out-of-bounds indices. I've extended it to support both 64-bit and 128-bit vectors, and it now handles the full range of tbl1-tbl4 and tbx1-tbx4, as long as at most two of the input operands are actually indexed into.

For the purposes of tbl, we need a dummy vector of zeroes if there are any out-of-bounds indices, and for the purposes of tbx, we use the "fallback" operand. Both of those take up an operand for the purposes of shufflevector.

This works a lot like #169110, with some added complexity because we need to handle multiple operands. I raised a couple questions in that PR that still need to be answered:

Is it correct to check IsA<UndefValue> for each mask index, and set the output mask index to -1 if so? This is later folded to a poison value, and I'm not sure about the subtle differences between poison and undef and when you can substitute one for the other. As I mentioned in #169110, the existing x86 pass (simplifyX86vpermilvar) already behaves this way when it comes to undef.
How can I write an Alive2 proof for this? It's very hard to find good documentation or tutorials about Alive2.

As with #169110, most of the regression test cases were generated using Claude. Everything else was written by me.

Patch is 57.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169748.diff

16 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+21)
(modified) llvm/lib/Target/AArch64/CMakeLists.txt (+1)
(modified) llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp (+23)
(modified) llvm/lib/Target/ARM/CMakeLists.txt (+1)
(added) llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.cpp (+219)
(added) llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.h (+56)
(added) llvm/lib/Target/ARMCommon/CMakeLists.txt (+8)
(modified) llvm/lib/Target/CMakeLists.txt (+5)
(modified) llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp (-104)
(modified) llvm/test/Transforms/InstCombine/AArch64/aes-intrinsics.ll (+1-1)
(added) llvm/test/Transforms/InstCombine/AArch64/tbl.ll (+269)
(removed) llvm/test/Transforms/InstCombine/AArch64/tbl1.ll (-65)
(modified) llvm/test/Transforms/InstCombine/ARM/2012-04-23-Neon-Intrinsics.ll (+1-1)
(modified) llvm/test/Transforms/InstCombine/ARM/aes-intrinsics.ll (+1-1)
(added) llvm/test/Transforms/InstCombine/ARM/tbl.ll (+215)
(removed) llvm/test/Transforms/InstCombine/ARM/tbl1.ll (-35)

diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 0bae00bafee3c..4a53e5bd49c70 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -7,6 +7,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "AArch64TargetTransformInfo.h"
+#include "../ARMCommon/ARMCommonInstCombineIntrinsic.h"
 #include "AArch64ExpandImm.h"
 #include "AArch64PerfectShuffle.h"
 #include "AArch64SMEAttributes.h"
@@ -2856,6 +2857,26 @@ AArch64TTIImpl::instCombineIntrinsic(InstCombiner &IC,
   case Intrinsic::aarch64_neon_fmaxnm:
   case Intrinsic::aarch64_neon_fminnm:
     return instCombineMaxMinNM(IC, II);
+  case Intrinsic::aarch64_neon_tbl1:
+  case Intrinsic::aarch64_neon_tbl2:
+  case Intrinsic::aarch64_neon_tbl3:
+  case Intrinsic::aarch64_neon_tbl4:
+    return ARMCommon::simplifyNeonTbl(II, IC, /*IsExtension=*/false);
+  case Intrinsic::aarch64_neon_tbx1:
+  case Intrinsic::aarch64_neon_tbx2:
+  case Intrinsic::aarch64_neon_tbx3:
+  case Intrinsic::aarch64_neon_tbx4:
+    return ARMCommon::simplifyNeonTbl(II, IC, /*IsExtension=*/true);
+  case Intrinsic::aarch64_neon_smull:
+  case Intrinsic::aarch64_neon_umull: {
+    bool IsSigned = IID == Intrinsic::aarch64_neon_smull;
+    return ARMCommon::simplifyNeonMultiply(II, IC, IsSigned);
+  }
+  case Intrinsic::aarch64_crypto_aesd:
+  case Intrinsic::aarch64_crypto_aese:
+  case Intrinsic::aarch64_sve_aesd:
+  case Intrinsic::aarch64_sve_aese:
+    return ARMCommon::simplifyAES(II, IC);
   case Intrinsic::aarch64_sve_convert_from_svbool:
     return instCombineConvertFromSVBool(IC, II);
   case Intrinsic::aarch64_sve_dup:
diff --git a/llvm/lib/Target/AArch64/CMakeLists.txt b/llvm/lib/Target/AArch64/CMakeLists.txt
index 285d646293eb7..d27a698ee9e4a 100644
--- a/llvm/lib/Target/AArch64/CMakeLists.txt
+++ b/llvm/lib/Target/AArch64/CMakeLists.txt
@@ -101,6 +101,7 @@ add_llvm_target(AArch64CodeGen
   AArch64Desc
   AArch64Info
   AArch64Utils
+  ARMCommon
   Analysis
   AsmPrinter
   CFGuard
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index fdb0ec40cb41f..99d57b00315b1 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -7,6 +7,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "ARMTargetTransformInfo.h"
+#include "../ARMCommon/ARMCommonInstCombineIntrinsic.h"
 #include "ARMSubtarget.h"
 #include "MCTargetDesc/ARMAddressingModes.h"
 #include "llvm/ADT/APInt.h"
@@ -182,6 +183,28 @@ ARMTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
     break;
   }
 
+  case Intrinsic::arm_neon_vtbl1:
+  case Intrinsic::arm_neon_vtbl2:
+  case Intrinsic::arm_neon_vtbl3:
+  case Intrinsic::arm_neon_vtbl4:
+    return ARMCommon::simplifyNeonTbl(II, IC, /*IsExtension=*/false);
+
+  case Intrinsic::arm_neon_vtbx1:
+  case Intrinsic::arm_neon_vtbx2:
+  case Intrinsic::arm_neon_vtbx3:
+  case Intrinsic::arm_neon_vtbx4:
+    return ARMCommon::simplifyNeonTbl(II, IC, /*IsExtension=*/true);
+
+  case Intrinsic::arm_neon_vmulls:
+  case Intrinsic::arm_neon_vmullu: {
+    bool IsSigned = IID == Intrinsic::arm_neon_vmulls;
+    return ARMCommon::simplifyNeonMultiply(II, IC, IsSigned);
+  }
+
+  case Intrinsic::arm_neon_aesd:
+  case Intrinsic::arm_neon_aese:
+    return ARMCommon::simplifyAES(II, IC);
+
   case Intrinsic::arm_mve_pred_i2v: {
     Value *Arg = II.getArgOperand(0);
     Value *ArgArg;
diff --git a/llvm/lib/Target/ARM/CMakeLists.txt b/llvm/lib/Target/ARM/CMakeLists.txt
index eb3ad01a54fb2..9fc9bc134e5cc 100644
--- a/llvm/lib/Target/ARM/CMakeLists.txt
+++ b/llvm/lib/Target/ARM/CMakeLists.txt
@@ -73,6 +73,7 @@ add_llvm_target(ARMCodeGen
   Thumb2SizeReduction.cpp
 
   LINK_COMPONENTS
+  ARMCommon
   ARMDesc
   ARMInfo
   ARMUtils
diff --git a/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.cpp b/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.cpp
new file mode 100644
index 0000000000000..df58dbc6df38f
--- /dev/null
+++ b/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.cpp
@@ -0,0 +1,219 @@
+//===- ARMCommonInstCombineIntrinsic.cpp -
+//                  instCombineIntrinsic opts for both ARM and AArch64  ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+///
+/// \file
+/// This file contains optimizations for ARM and AArch64 intrinsics that
+/// are shared between both architectures. These functions can be called from:
+/// - ARM TTI's instCombineIntrinsic (for arm_neon_* intrinsics)
+/// - AArch64 TTI's instCombineIntrinsic (for aarch64_neon_* and aarch64_sve_*
+///   intrinsics)
+///
+//===----------------------------------------------------------------------===//
+
+#include "ARMCommonInstCombineIntrinsic.h"
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Value.h"
+#include "llvm/Transforms/InstCombine/InstCombiner.h"
+
+using namespace llvm;
+using namespace llvm::PatternMatch;
+
+namespace llvm {
+namespace ARMCommon {
+
+/// Convert `tbl`/`tbx` intrinsics to shufflevector if the mask is constant, and
+/// at most two source operands are actually referenced.
+Instruction *simplifyNeonTbl(IntrinsicInst &II, InstCombiner &IC,
+                             bool IsExtension) {
+  // Bail out if the mask is not a constant.
+  auto *C = dyn_cast<Constant>(II.getArgOperand(II.arg_size() - 1));
+  if (!C)
+    return nullptr;
+
+  auto *RetTy = cast<FixedVectorType>(II.getType());
+  unsigned NumIndexes = RetTy->getNumElements();
+
+  // Only perform this transformation for <8 x i8> and <16 x i8> vector types.
+  // Even the language-level intrinsics that operate on u8/p8 should lower to an
+  // LLVM intrinsic that operates on i8.
+  if (!(RetTy->getElementType()->isIntegerTy(8) &&
+        (NumIndexes == 8 || NumIndexes == 16)))
+    return nullptr;
+
+  // For tbx instructions, the first argument is the "fallback" vector, which
+  // has the same length as the mask and return type.
+  unsigned int StartIndex = (unsigned)IsExtension;
+  auto *SourceTy =
+      cast<FixedVectorType>(II.getArgOperand(StartIndex)->getType());
+  // Note that the element count of each source vector does *not* need to be the
+  // same as the element count of the return type and mask! All source vectors
+  // must have the same element count as each other, though.
+  unsigned NumElementsPerSource = SourceTy->getNumElements();
+
+  // There are no tbl/tbx intrinsics for which the destination size exceeds the
+  // source size. However, our definitions of the intrinsics, at least in
+  // IntrinsicsAArch64.td, allow for arbitrary destination vector sizes, so it
+  // *could* technically happen.
+  if (NumIndexes > NumElementsPerSource) {
+    return nullptr;
+  }
+
+  // The tbl/tbx intrinsics take several source operands followed by a mask
+  // operand.
+  unsigned int NumSourceOperands = II.arg_size() - 1 - (unsigned)IsExtension;
+
+  // Map input operands to shuffle indices. This also helpfully deduplicates the
+  // input arguments, in case the same value is passed as an argument multiple
+  // times.
+  SmallDenseMap<Value *, unsigned, 2> ValueToShuffleSlot;
+  Value *ShuffleOperands[2] = {PoisonValue::get(SourceTy),
+                               PoisonValue::get(SourceTy)};
+
+  int Indexes[16];
+  for (unsigned I = 0; I < NumIndexes; ++I) {
+    Constant *COp = C->getAggregateElement(I);
+
+    if (!COp || (!isa<UndefValue>(COp) && !isa<ConstantInt>(COp)))
+      return nullptr;
+
+    if (isa<UndefValue>(COp)) {
+      Indexes[I] = -1;
+      continue;
+    }
+
+    uint64_t Index = cast<ConstantInt>(COp)->getZExtValue();
+    // The index of the input argument that this index references (0 = first
+    // source argument, etc).
+    unsigned SourceOperandIndex = Index / NumElementsPerSource;
+    // The index of the element at that source operand.
+    unsigned SourceOperandElementIndex = Index % NumElementsPerSource;
+
+    Value *SourceOperand;
+    if (SourceOperandIndex >= NumSourceOperands) {
+      // This index is out of bounds. Map it to index into either the fallback
+      // vector (tbx) or vector of zeroes (tbl).
+      SourceOperandIndex = NumSourceOperands;
+      if (IsExtension) {
+        // For out-of-bounds indices in tbx, choose the `I`th element of the
+        // fallback.
+        SourceOperand = II.getArgOperand(0);
+        SourceOperandElementIndex = I;
+      } else {
+        // Otherwise, choose some element from the dummy vector of zeroes (we'll
+        // always choose the first).
+        SourceOperand = Constant::getNullValue(SourceTy);
+        SourceOperandElementIndex = 0;
+      }
+    } else {
+      SourceOperand = II.getArgOperand(SourceOperandIndex + StartIndex);
+    }
+
+    // The source operand may be the fallback vector, which may not have the
+    // same number of elements as the source vector. In that case, we *could*
+    // choose to extend its length with another shufflevector, but it's simpler
+    // to just bail instead.
+    if (cast<FixedVectorType>(SourceOperand->getType())->getNumElements() !=
+        NumElementsPerSource) {
+      return nullptr;
+    }
+
+    // We now know the source operand referenced by this index. Make it a
+    // shufflevector operand, if it isn't already.
+    unsigned NumSlots = ValueToShuffleSlot.size();
+    // This shuffle references more than two sources, and hence cannot be
+    // represented as a shufflevector.
+    if (NumSlots == 2 && !ValueToShuffleSlot.contains(SourceOperand)) {
+      return nullptr;
+    }
+    auto [It, Inserted] =
+        ValueToShuffleSlot.try_emplace(SourceOperand, NumSlots);
+    if (Inserted) {
+      ShuffleOperands[It->getSecond()] = SourceOperand;
+    }
+
+    unsigned RemappedIndex =
+        (It->getSecond() * NumElementsPerSource) + SourceOperandElementIndex;
+    Indexes[I] = RemappedIndex;
+  }
+
+  Value *Shuf = IC.Builder.CreateShuffleVector(
+      ShuffleOperands[0], ShuffleOperands[1], ArrayRef(Indexes, NumIndexes));
+  return IC.replaceInstUsesWith(II, Shuf);
+}
+
+/// Simplify NEON multiply-long intrinsics (smull, umull).
+/// These intrinsics perform widening multiplies: they multiply two vectors of
+/// narrow integers and produce a vector of wider integers. This function
+/// performs algebraic simplifications:
+/// 1. Multiply by zero => zero vector
+/// 2. Multiply by one => zero/sign-extend the non-one operand
+/// 3. Both operands constant => regular multiply that can be constant-folded
+///    later
+Instruction *simplifyNeonMultiply(IntrinsicInst &II, InstCombiner &IC,
+                                  bool IsSigned) {
+  Value *Arg0 = II.getArgOperand(0);
+  Value *Arg1 = II.getArgOperand(1);
+
+  // Handle mul by zero first:
+  if (isa<ConstantAggregateZero>(Arg0) || isa<ConstantAggregateZero>(Arg1)) {
+    return IC.replaceInstUsesWith(II, ConstantAggregateZero::get(II.getType()));
+  }
+
+  // Check for constant LHS & RHS - in this case we just simplify.
+  VectorType *NewVT = cast<VectorType>(II.getType());
+  if (Constant *CV0 = dyn_cast<Constant>(Arg0)) {
+    if (Constant *CV1 = dyn_cast<Constant>(Arg1)) {
+      Value *V0 = IC.Builder.CreateIntCast(CV0, NewVT, IsSigned);
+      Value *V1 = IC.Builder.CreateIntCast(CV1, NewVT, IsSigned);
+      return IC.replaceInstUsesWith(II, IC.Builder.CreateMul(V0, V1));
+    }
+
+    // Couldn't simplify - canonicalize constant to the RHS.
+    std::swap(Arg0, Arg1);
+  }
+
+  // Handle mul by one:
+  if (Constant *CV1 = dyn_cast<Constant>(Arg1))
+    if (ConstantInt *Splat =
+            dyn_cast_or_null<ConstantInt>(CV1->getSplatValue()))
+      if (Splat->isOne())
+        return CastInst::CreateIntegerCast(Arg0, II.getType(), IsSigned);
+
+  return nullptr;
+}
+
+/// Simplify AES encryption/decryption intrinsics (AESE, AESD).
+///
+/// ARM's AES instructions (AESE/AESD) XOR the data and the key, provided as
+/// separate arguments, before performing the encryption/decryption operation.
+/// We can fold that "internal" XOR with a previous one.
+Instruction *simplifyAES(IntrinsicInst &II, InstCombiner &IC) {
+  Value *DataArg = II.getArgOperand(0);
+  Value *KeyArg = II.getArgOperand(1);
+
+  // Accept zero on either operand.
+  if (!match(KeyArg, m_ZeroInt()))
+    std::swap(KeyArg, DataArg);
+
+  // Try to use the builtin XOR in AESE and AESD to eliminate a prior XOR
+  Value *Data, *Key;
+  if (match(KeyArg, m_ZeroInt()) &&
+      match(DataArg, m_Xor(m_Value(Data), m_Value(Key)))) {
+    IC.replaceOperand(II, 0, Data);
+    IC.replaceOperand(II, 1, Key);
+    return &II;
+  }
+
+  return nullptr;
+}
+
+} // namespace ARMCommon
+} // namespace llvm
diff --git a/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.h b/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.h
new file mode 100644
index 0000000000000..319aee48ccb0d
--- /dev/null
+++ b/llvm/lib/Target/ARMCommon/ARMCommonInstCombineIntrinsic.h
@@ -0,0 +1,56 @@
+//===- ARMCommonInstCombineIntrinsic.h -
+// instCombineIntrinsic opts for both ARM and AArch64 -----------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+///
+/// \file
+/// This file contains optimizations for ARM and AArch64 intrinsics that
+/// are shared between both architectures. These functions can be called from:
+/// - ARM TTI's instCombineIntrinsic (for arm_neon_* intrinsics)
+/// - AArch64 TTI's instCombineIntrinsic (for aarch64_neon_* and aarch64_sve_*
+///   intrinsics)
+///
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_ARMCOMMON_ARMCOMMONINSTCOMBINEINTRINSIC_H
+#define LLVM_LIB_TARGET_ARMCOMMON_ARMCOMMONINSTCOMBINEINTRINSIC_H
+
+#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Value.h"
+#include "llvm/Transforms/InstCombine/InstCombiner.h"
+
+namespace llvm {
+
+namespace ARMCommon {
+
+/// Convert `tbl`/`tbx` intrinsics to shufflevector if the mask is constant, and
+/// at most two source operands are actually referenced.
+Instruction *simplifyNeonTbl(IntrinsicInst &II, InstCombiner &IC,
+                             bool IsExtension);
+
+/// Simplify NEON multiply-long intrinsics (smull, umull).
+/// These intrinsics perform widening multiplies: they multiply two vectors of
+/// narrow integers and produce a vector of wider integers. This function
+/// performs algebraic simplifications:
+/// 1. Multiply by zero => zero vector
+/// 2. Multiply by one => zero/sign-extend the non-one operand
+/// 3. Both operands constant => regular multiply that can be constant-folded
+///    later
+Instruction *simplifyNeonMultiply(IntrinsicInst &II, InstCombiner &IC,
+                                  bool IsSigned);
+
+/// Simplify AES encryption/decryption intrinsics (AESE, AESD).
+///
+/// ARM's AES instructions (AESE/AESD) XOR the data and the key, provided as
+/// separate arguments, before performing the encryption/decryption operation.
+/// We can fold that "internal" XOR with a previous one.
+Instruction *simplifyAES(IntrinsicInst &II, InstCombiner &IC);
+
+} // namespace ARMCommon
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_ARMCOMMON_ARMCOMMONINSTCOMBINEINTRINSIC_H
diff --git a/llvm/lib/Target/ARMCommon/CMakeLists.txt b/llvm/lib/Target/ARMCommon/CMakeLists.txt
new file mode 100644
index 0000000000000..1805a5df2f053
--- /dev/null
+++ b/llvm/lib/Target/ARMCommon/CMakeLists.txt
@@ -0,0 +1,8 @@
+add_llvm_component_library(LLVMARMCommon
+  ARMCommonInstCombineIntrinsic.cpp
+
+  LINK_COMPONENTS
+  Core
+  Support
+  TransformUtils
+  )
diff --git a/llvm/lib/Target/CMakeLists.txt b/llvm/lib/Target/CMakeLists.txt
index bcc13f942bf96..e3528014a4be2 100644
--- a/llvm/lib/Target/CMakeLists.txt
+++ b/llvm/lib/Target/CMakeLists.txt
@@ -31,6 +31,11 @@ if (NOT BUILD_SHARED_LIBS AND NOT APPLE AND
   set(CMAKE_CXX_VISIBILITY_PRESET hidden)
 endif()
 
+# Add shared ARM/AArch64 utilities if either target is being built
+if("ARM" IN_LIST LLVM_TARGETS_TO_BUILD OR "AArch64" IN_LIST LLVM_TARGETS_TO_BUILD)
+  add_subdirectory(ARMCommon)
+endif()
+
 foreach(t ${LLVM_TARGETS_TO_BUILD})
   message(STATUS "Targeting ${t}")
   add_subdirectory(${t})
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
index 8e4edefec42fd..8a54c0dde6be6 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -737,44 +737,6 @@ static Instruction *foldCtpop(IntrinsicInst &II, InstCombinerImpl &IC) {
   return nullptr;
 }
 
-/// Convert a table lookup to shufflevector if the mask is constant.
-/// This could benefit tbl1 if the mask is { 7,6,5,4,3,2,1,0 }, in
-/// which case we could lower the shufflevector with rev64 instructions
-/// as it's actually a byte reverse.
-static Value *simplifyNeonTbl1(const IntrinsicInst &II,
-                               InstCombiner::BuilderTy &Builder) {
-  // Bail out if the mask is not a constant.
-  auto *C = dyn_cast<Constant>(II.getArgOperand(1));
-  if (!C)
-    return nullptr;
-
-  auto *VecTy = cast<FixedVectorType>(II.getType());
-  unsigned NumElts = VecTy->getNumElements();
-
-  // Only perform this transformation for <8 x i8> vector types.
-  if (!VecTy->getElementType()->isIntegerTy(8) || NumElts != 8)
-    return nullptr;
-
-  int Indexes[8];
-
-  for (unsigned I = 0; I < NumElts; ++I) {
-    Constant *COp = C->getAggregateElement(I);
-
-    if (!COp || !isa<ConstantInt>(COp))
-      return nullptr;
-
-    Indexes[I] = cast<ConstantInt>(COp)->getLimitedValue();
-
-    // Make sure the mask indices are in range.
-    if ((unsigned)Indexes[I] >= NumElts)
-      return nullptr;
-  }
-
-  auto *V1 = II.getArgOperand(0);
-  auto *V2 = Constant::getNullValue(V1->getType());
-  return Builder.CreateShuffleVector(V1, V2, ArrayRef(Indexes));
-}
-
 // Returns true iff the 2 intrinsics have the same operands, limiting the
 // comparison to the first NumOperands.
 static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,
@@ -3155,72 +3117,6 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) {
         Intrinsic::getOrInsertDeclaration(II->getModule(), NewIntrin);
     return CallInst::Create(NewFn, CallArgs);
   }
-  case Intrinsic::arm_neon_vtbl1:
-  case Intrinsic::aarch64_neon_tbl1:
-    if (Value *V = simplifyNeonTbl1(*II, Builder))
-      return replaceInstUsesWith(*II, V);
-    break;
-
-  case Intrinsic::arm_neon_vmulls:
-  case Intrinsic::arm_neon_vmullu:
-  case Intrinsic::aarch64_neon_smull:
-  case Intrinsic::aarch64_neon_umull: {
-    Value *Arg0 = II->getArgOperand(0);
-    Value *Arg1 = II->getArgOperand(1);
-
-    // Handle mul by zero first:
-    if (isa<ConstantAggregateZero>(Arg0) || isa<ConstantAggregateZero>(Arg1)) {
-      return replaceInstUsesWith(CI, ConstantAggregateZero::get(II->getType()));
-    }
-
-    // Check for constant LHS & RHS - in this case we just simplify.
-    bool Zext = (IID == Intrinsic::arm_neon_vmullu ||
-                 IID == Intrinsic::aarch64_neon_umull);
-    VectorType *NewVT = cast<VectorType>(II->getType());
-    if (Constant *CV0 = dyn_cast<Constant>(Arg0)) {
-      if (Constant *CV1 = dyn_cast<Constant>(Arg1)) {
-        Value *V0 = Builder.CreateIntCast(CV0, NewVT, /*isSigned=*/!Zext);
-        Value *V1 = Builder.CreateIntCast(CV1, NewVT, /*isSigned=*/!Zext);
-        return replaceInstUsesWith(CI, Builder.CreateMul(V0, V1));
-      }
-
-      // Couldn't simplify - canonicalize constant to the RHS.
-      std::swap(Arg0, Arg1);
-    }
-
-    // Handle mul by one:
-    if (Constant *CV1 = dyn_cast<Constant>(Arg1))
-      if (ConstantInt *Splat =
-              dyn_cast_or_null<ConstantInt>(CV1->getSplatValue()))
-        if (Splat->isOne())
-          return CastInst::CreateIn...
[truncated]

valadaptive added 4 commits November 25, 2025 19:25

[AArch64][ARM] Move ARM-specific InstCombine transforms to new module

852e4d3

[AArch64][ARM] !Zext -> IsSigned

234164f

[AArch64][ARM] Make simplifyNeonTbl1 behave like the other transforms

e53963f

[AArch64][ARM] Add new tests for tbl/tbx optimizations

14ae19c

valadaptive requested a review from nikic as a code owner November 26, 2025 23:48

llvmbot added backend:ARM backend:AArch64 llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Nov 26, 2025

[AArch64][ARM] Optimize more tbl/tbx calls into shufflevector

f9b09c8

valadaptive force-pushed the aarch64-const-tbl branch from b05dd6e to f9b09c8 Compare November 27, 2025 05:43

nasherm self-requested a review November 27, 2025 09:30

This was referenced Nov 27, 2025

Shuffle/swizzle operations linebender/fearless_simd#29

Open

[AArch64][ARM] Move ARM-specific InstCombine transforms to new module #169589

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64][ARM] Optimize more `tbl`/`tbx` calls into `shufflevector` #169748

[AArch64][ARM] Optimize more `tbl`/`tbx` calls into `shufflevector` #169748

Uh oh!

valadaptive commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

llvmbot commented Nov 26, 2025 •

edited

Loading

Uh oh!

llvmbot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[AArch64][ARM] Optimize more tbl/tbx calls into shufflevector #169748

Are you sure you want to change the base?

[AArch64][ARM] Optimize more tbl/tbx calls into shufflevector #169748

Uh oh!

Conversation

valadaptive commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

llvmbot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[AArch64][ARM] Optimize more `tbl`/`tbx` calls into `shufflevector` #169748

[AArch64][ARM] Optimize more `tbl`/`tbx` calls into `shufflevector` #169748

llvmbot commented Nov 26, 2025 •

edited

Loading