Skip to content

Conversation

@jurahul
Copy link
Contributor

@jurahul jurahul commented Aug 8, 2025

This change implements several small improvements to Intrinsic::getAttributes:

  1. Use SequenceToOffsetTable to emit ArgAttrIdTable. This enables reuse of entries when they share a common prefix. This reduces the size of this table from 546 to 484 entries, which is 248 bytes.
  2. Fix AttributeComparator to purely compare argument attributes and not look at function attributes. This avoids unnecessary duplicates in the uniqueing process and eliminates 2 entries from ArgAttributesInfoTable, saving 8 bytes.
  3. Improve Intrinsic::getAttributes code to not initialize all entries of AS always. Currently, we initialize all entries of the array AS even if we may not use all of them. In addition to the runtime cost, for Clang release builds, since the initialization loop is unrolled, it consumes ~330 bytes of code to initialize the AS array. Address this by declaring the storage for AS using just a char array with appropriate alignas (similar to how SmallVectorStorage defines its inline elements).

@jurahul jurahul force-pushed the ie_sequence_to_offset_table branch from 1b717bd to 2dd7ebc Compare August 8, 2025 17:23
@jurahul jurahul force-pushed the ie_sequence_to_offset_table branch from 2dd7ebc to 4ac6a93 Compare August 8, 2025 20:26
@jurahul jurahul marked this pull request as ready for review August 8, 2025 22:04
@jurahul jurahul requested review from arsenm and nikic August 8, 2025 22:04
@llvmbot
Copy link
Member

llvmbot commented Aug 8, 2025

@llvm/pr-subscribers-tablegen

Author: Rahul Joshi (jurahul)

Changes

This change implements several small improvements to Intrinsic::getAttributes:

  1. Use SequenceToOffsetTable to emit ArgAttrIdTable. This enables reuse of entries when they share a common prefix. This reduces the size of this table from 546 to 484 entries, which is 248 bytes.
  2. Fix AttributeComparator to purely compare argument attributes and not look at function attributes. This avoids unnecessary duplicates in the uniqueing process and eliminates 2 entries from ArgAttributesInfoTable, saving 8 bytes.
  3. Improve Intrinsic::getAttributes code to not initialize all entries of AS always. Currently, we initialize all entries of the array AS even if we may not use all of them. In addition to the runtime cost, for Clang release builds, since the initialization loop is unrolled, it consumes ~330 bytes of code to initialize the AS array. Address this by declaring the storage for AS using just a char array with appropriate alignas (similar to how SmallVectorStorage defines its inline elements).

Full diff: https://github.com/llvm/llvm-project/pull/152761.diff

1 Files Affected:

  • (modified) llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp (+51-57)
diff --git a/llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp b/llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp
index 293e64e97cc33..04707084a4166 100644
--- a/llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp
+++ b/llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp
@@ -455,15 +455,9 @@ struct FnAttributeComparator {
 
 struct AttributeComparator {
   bool operator()(const CodeGenIntrinsic *L, const CodeGenIntrinsic *R) const {
-    // Order all intrinsics with no functiona attributes before all intrinsics
-    // with function attributes.
-    bool HasFnAttrLHS = hasFnAttributes(*L);
-    bool HasFnAttrRHS = hasFnAttributes(*R);
-
-    // Order by argument attributes if function `hasFnAttributes` is equal.
-    // This is reliable because each side is already sorted internally.
-    return std::tie(HasFnAttrLHS, L->ArgumentAttributes) <
-           std::tie(HasFnAttrRHS, R->ArgumentAttributes);
+    // This comparator is used to unique just the argument attributes of an
+    // intrinsic without considering any function attributes.
+    return L->ArgumentAttributes < R->ArgumentAttributes;
   }
 };
 } // End anonymous namespace
@@ -659,7 +653,7 @@ static constexpr uint16_t IntrinsicsToAttributesMap[] = {)";
   //
   //   getIntrinsicArgAttributeSet(C, ArgAttrID, FT->getContainedType(ArgNo));
   //
-  // Create a table that records, for each argument attributes, the set of
+  // Create a table that records, for each argument attributes, the list of
   // <ArgNo, ArgAttrID> pairs that are needed to construct its argument
   // attributes. These tables for all intrinsics will be concatenated into one
   // large table and then for each intrinsic, we remember the Staring index and
@@ -667,79 +661,77 @@ static constexpr uint16_t IntrinsicsToAttributesMap[] = {)";
   // non-empty attributes), so that we can build the attribute list for an
   // intrinsic without using a switch-case.
 
-  // Find the max number of attributes to create the local array and create
-  // a concatenated list of <ArgNo, AttrID> pairs.
-  struct ArgNoAttrIDPair {
-    uint16_t ArgNo, ArgAttrID;
-    ArgNoAttrIDPair(uint16_t ArgNo, uint16_t ArgAttrID)
-        : ArgNo(ArgNo), ArgAttrID(ArgAttrID) {}
-  };
+  using ArgNoAttrIDPair = std::pair<uint16_t, uint16_t>;
 
-  // For each unique ID in UniqAttributes, reacord the starting index in the
-  // flattened ArgNoAttrIDPair table, and the number of non-empty arg
-  // attributes.
-  struct ArgAttributesInfo {
-    uint16_t StartIndex;
-    uint16_t NumAttrs;
-    ArgAttributesInfo(uint16_t StartIndex, uint16_t NumAttrs)
-        : StartIndex(StartIndex), NumAttrs(NumAttrs) {}
-  };
-  SmallVector<ArgNoAttrIDPair> ArgAttrIdTable;
-  SmallVector<ArgAttributesInfo> ArgAttributesInfoTable(UniqAttributes.size(),
-                                                        {0, 0});
+  // Emit the table of concatenated <ArgNo, AttrId> using SequenceToOffsetTable
+  // so that entries can be reused if possible. Individual sequences in this
+  // table do not have any terminator.
+  using ArgAttrIDSubTable = SmallVector<ArgNoAttrIDPair>;
+  SequenceToOffsetTable<ArgAttrIDSubTable> ArgAttrIdSequenceTable(std::nullopt);
+  SmallVector<ArgAttrIDSubTable> ArgAttrIdSubTables(
+      UniqAttributes.size()); // Indexed by UniqueID.
 
+  // Find the max number of attributes to create the local array.
   unsigned MaxNumAttrs = 0;
   for (const auto [IntPtr, UniqueID] : UniqAttributes) {
     const CodeGenIntrinsic &Int = *IntPtr;
-    unsigned NumAttrs = 0;
-    unsigned StartIndex = ArgAttrIdTable.size();
+    ArgAttrIDSubTable SubTable;
 
     for (const auto &[ArgNo, Attrs] : enumerate(Int.ArgumentAttributes)) {
       if (Attrs.empty())
         continue;
 
       uint16_t ArgAttrID = UniqArgAttributes.find(Attrs)->second;
-      ArgAttrIdTable.emplace_back((uint16_t)ArgNo, ArgAttrID);
-      ++NumAttrs;
+      SubTable.emplace_back((uint16_t)ArgNo, ArgAttrID);
     }
-
-    // Record the start index and size of the list for this unique ID.
-    if (NumAttrs)
-      ArgAttributesInfoTable[UniqueID] =
-          ArgAttributesInfo(StartIndex, NumAttrs);
-
-    NumAttrs += hasFnAttributes(Int);
+    ArgAttrIdSubTables[UniqueID] = SubTable;
+    if (!SubTable.empty())
+      ArgAttrIdSequenceTable.add(SubTable);
+    unsigned NumAttrs = SubTable.size() + hasFnAttributes(Int);
     MaxNumAttrs = std::max(MaxNumAttrs, NumAttrs);
   }
 
-  if (ArgAttrIdTable.size() >= std::numeric_limits<uint16_t>::max())
+  ArgAttrIdSequenceTable.layout();
+
+  if (ArgAttrIdSequenceTable.size() >= std::numeric_limits<uint16_t>::max())
     PrintFatalError("Size of ArgAttrIdTable exceeds supported limit");
 
-  // Emit the 2 tables (flattened ArgNo, ArgAttrID) and ArgAttrIdTableIndex
-  OS << R"(
-namespace {
-struct ArgNoAttrIDPair {
+  // Emit the 2 tables (flattened ArgNo, ArgAttrID) and ArgAttributesInfoTable.
+  OS << formatv(R"(
+namespace {{
+struct ArgNoAttrIDPair {{
   uint16_t ArgNo, ArgAttrID;
 };
 } // namespace
 
-static constexpr ArgNoAttrIDPair ArgAttrIdTable[] = {
-)";
-  for (const auto &[ArgNo, ArgAttrID] : ArgAttrIdTable)
-    OS << formatv("  {{{}, {}},\n", ArgNo, ArgAttrID);
-  OS << R"(}; // ArgAttrIdTable
+// Number of entries: {}
+static constexpr ArgNoAttrIDPair ArgAttrIdTable[] = {{
+)",
+                ArgAttrIdSequenceTable.size());
 
-namespace {
-struct ArgAttributesInfo {
+  ArgAttrIdSequenceTable.emit(OS, [](raw_ostream &OS, ArgNoAttrIDPair Elem) {
+    OS << formatv("{{{}, {}}", Elem.first, Elem.second);
+  });
+
+  OS << formatv(R"(}; // ArgAttrIdTable
+
+namespace {{
+struct ArgAttributesInfo {{
   uint16_t StartIndex;
   uint16_t NumAttrs;
 };
 } // namespace
- 
-static constexpr ArgAttributesInfo ArgAttributesInfoTable[] = {
-)";
-  for (const auto &[StartIndex, NumAttrs] : ArgAttributesInfoTable)
+
+// Number of entries: {}
+static constexpr ArgAttributesInfo ArgAttributesInfoTable[] = {{
+)",
+                ArgAttrIdSubTables.size());
+
+  for (const auto &SubTable : ArgAttrIdSubTables) {
+    unsigned NumAttrs = SubTable.size();
+    unsigned StartIndex = NumAttrs ? ArgAttrIdSequenceTable.get(SubTable) : 0;
     OS << formatv("  {{{}, {}},\n", StartIndex, NumAttrs);
+  }
   OS << "}; // ArgAttributesInfoTable\n";
 
   // Now emit the Intrinsic::getAttributes function. This will first map
@@ -756,7 +748,9 @@ AttributeList Intrinsic::getAttributes(LLVMContext &C, ID id,
   uint16_t PackedID = IntrinsicsToAttributesMap[id - 1];
   uint8_t FnAttrID = PackedID >> 8;
   uint8_t ArgAttrID = PackedID & 0xFF;
-  std::pair<unsigned, AttributeSet> AS[{}];
+  using PairTy = std::pair<unsigned, AttributeSet>;
+  alignas(PairTy) char ASStorage[sizeof(PairTy) * {}];
+  PairTy *AS = reinterpret_cast<PairTy *>(ASStorage);
 
   // Construct an ArrayRef for easier range checking.
   ArrayRef<ArgAttributesInfo> ArgAttributesInfoTableAR(ArgAttributesInfoTable);

@arsenm
Copy link
Contributor

arsenm commented Aug 12, 2025

I don't really follow point 3. It seems like it's just a more convoluted way of achieving storage with the same size and alignment that works around (or runs into) some kind of optimizer bug?

@jurahul
Copy link
Contributor Author

jurahul commented Aug 12, 2025

I don't really follow point 3. It seems like it's just a more convoluted way of achieving storage with the same size and alignment that works around (or runs into) some kind of optimizer bug?

Its really a micro-optimization for this code. Currently, the size of this array is 17, so using

std::pair<unsigned, AttributeSet> AS[17];

in the code results in running the constructor of the std::pair (which just zero inits the unsigned and the Impl pointer in AttributeSet) on all 17 elements at the start of this function. In a given call, not all 17 elements are used so this is wasteful work. GCC seems to generate a small loop to do this init (using rep stos) that essentially does bzero on 17*16 bytes. Clang decides to unroll this loop, generating 17 pairs of movl and movq and hence consuming ~300 bytes of code space.

If we change this to a char array, we essentially avoid running the constructor for all 17 elements. In the actual code, we will initialize just the elements we will use. Note that this can also be achieved in a more standard way by using SmallVector<Pair, 17> but that seems to rope in all the other SmallVector baggage for heap fallback which we know is not needed here (Maybe SmallVector-like ADT where there is no heap fallback might be useful in such contexts, but it does not exist).

As I said, its a micro opt and should always be beneficial I think, so I have not done any actual measurements. It just popped up since I have been staring at the assembly for this function for the other issues I fixed earlier. I am ok with not doing it as well (to avoid complexity).

@jurahul jurahul merged commit 89ea9df into llvm:main Aug 12, 2025
12 checks passed
@jurahul jurahul deleted the ie_sequence_to_offset_table branch August 12, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants