-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[MIR2Vec] Add MIR2Vec support to llvm-ir2vec tool #164025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/svkeerthy/10-13-handle_operands
Are you sure you want to change the base?
[MIR2Vec] Add MIR2Vec support to llvm-ir2vec tool #164025
Conversation
@llvm/pr-subscribers-mlgo Author: S. VenkataKeerthy (svkeerthy) ChangesAdd MIR2Vec support to the llvm-ir2vec tool, enabling embedding generation for Machine IR alongside the existing LLVM IR functionality. Patch is 35.27 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/164025.diff 7 Files Affected:
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index fc590a6180316..55fe75d2084b1 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -1,5 +1,5 @@
-llvm-ir2vec - IR2Vec Embedding Generation Tool
-==============================================
+llvm-ir2vec - IR2Vec and MIR2Vec Embedding Generation Tool
+===========================================================
.. program:: llvm-ir2vec
@@ -11,9 +11,9 @@ SYNOPSIS
DESCRIPTION
-----------
-:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
-generates IR2Vec embeddings for LLVM IR and supports triplet generation
-for vocabulary training.
+:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec and MIR2Vec.
+It generates embeddings for both LLVM IR and Machine IR (MIR) and supports
+triplet generation for vocabulary training.
The tool provides three main subcommands:
@@ -23,23 +23,33 @@ The tool provides three main subcommands:
2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary
training.
-3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary
+3. **embeddings**: Generates IR2Vec or MIR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).
+The tool supports two operation modes:
+
+* **LLVM IR mode** (``--mode=llvm``): Process LLVM IR bitcode files and generate
+ IR2Vec embeddings
+* **Machine IR mode** (``--mode=mir``): Process Machine IR (.mir) files and generate
+ MIR2Vec embeddings
+
The tool is designed to facilitate machine learning applications that work with
-LLVM IR by converting the IR into numerical representations that can be used by
-ML models. The `triplets` subcommand generates numeric IDs directly instead of string
-triplets, streamlining the training data preparation workflow.
+LLVM IR or Machine IR by converting them into numerical representations that can
+be used by ML models. The `triplets` subcommand generates numeric IDs directly
+instead of string triplets, streamlining the training data preparation workflow.
.. note::
- For information about using IR2Vec programmatically within LLVM passes and
- the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
+ For information about using IR2Vec and MIR2Vec programmatically within LLVM
+ passes and the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
section in the MLGO documentation.
OPERATION MODES
---------------
+The tool operates in two modes: **LLVM IR mode** and **Machine IR mode**. The mode
+is selected using the ``--mode`` option (default: ``llvm``).
+
Triplet Generation and Entity Mapping Modes are used for preparing
vocabulary and training data for knowledge graph embeddings. The Embedding Mode
is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
@@ -89,18 +99,31 @@ Embedding Generation
~~~~~~~~~~~~~~~~~~~~
With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
-generate numerical embeddings for LLVM IR at different levels of granularity.
+generate numerical embeddings for LLVM IR or Machine IR at different levels of granularity.
+
+Example Usage for LLVM IR:
+
+.. code-block:: bash
+
+ llvm-ir2vec embeddings --mode=llvm --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
-Example Usage:
+Example Usage for Machine IR:
.. code-block:: bash
- llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
+ llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=vocab.json --level=func input.mir -o embeddings.txt
OPTIONS
-------
-Global options:
+Common options (applicable to both LLVM IR and Machine IR modes):
+
+.. option:: --mode=<mode>
+
+ Specify the operation mode. Valid values are:
+
+ * ``llvm`` - Process LLVM IR bitcode files (default)
+ * ``mir`` - Process Machine IR (.mir) files
.. option:: -o <filename>
@@ -116,8 +139,8 @@ Subcommand-specific options:
.. option:: <input-file>
- The input LLVM IR or bitcode file to process. This positional argument is
- required for the `embeddings` subcommand.
+ The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
+ This positional argument is required for the `embeddings` subcommand.
.. option:: --level=<level>
@@ -131,6 +154,8 @@ Subcommand-specific options:
Process only the specified function instead of all functions in the module.
+**IR2Vec-specific options** (for ``--mode=llvm``):
+
.. option:: --ir2vec-kind=<kind>
Specify the kind of IR2Vec embeddings to generate. Valid values are:
@@ -143,8 +168,8 @@ Subcommand-specific options:
.. option:: --ir2vec-vocab-path=<path>
- Specify the path to the vocabulary file (required for embedding generation).
- The vocabulary file should be in JSON format and contain the trained
+ Specify the path to the IR2Vec vocabulary file (required for LLVM IR embedding
+ generation). The vocabulary file should be in JSON format and contain the trained
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
for pre-trained vocabulary files.
@@ -163,6 +188,35 @@ Subcommand-specific options:
Specify the weight for argument embeddings (default: 0.2). This controls
the relative importance of operand information in the final embedding.
+**MIR2Vec-specific options** (for ``--mode=mir``):
+
+.. option:: --mir2vec-vocab-path=<path>
+
+ Specify the path to the MIR2Vec vocabulary file (required for Machine IR
+ embedding generation). The vocabulary file should be in JSON format and
+ contain the trained vocabulary for embedding generation.
+
+.. option:: --mir2vec-kind=<kind>
+
+ Specify the kind of MIR2Vec embeddings to generate. Valid values are:
+
+ * ``symbolic`` - Generate symbolic embeddings (default)
+
+.. option:: --mir2vec-opc-weight=<weight>
+
+ Specify the weight for machine opcode embeddings (default: 1.0). This controls
+ the relative importance of machine instruction opcodes in the final embedding.
+
+.. option:: --mir2vec-common-operand-weight=<weight>
+
+ Specify the weight for common operand embeddings (default: 1.0). This controls
+ the relative importance of common operand types in the final embedding.
+
+.. option:: --mir2vec-reg-operand-weight=<weight>
+
+ Specify the weight for register operand embeddings (default: 1.0). This controls
+ the relative importance of register operands in the final embedding.
+
**triplets** subcommand:
@@ -240,3 +294,6 @@ SEE ALSO
For more information about the IR2Vec algorithm and approach, see:
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
+
+For more information about the MIR2Vec algorithm and approach, see:
+`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.
diff --git a/llvm/include/llvm/CodeGen/MIR2Vec.h b/llvm/include/llvm/CodeGen/MIR2Vec.h
index 953e590a6d64f..f47d9abb042d8 100644
--- a/llvm/include/llvm/CodeGen/MIR2Vec.h
+++ b/llvm/include/llvm/CodeGen/MIR2Vec.h
@@ -7,9 +7,20 @@
//===----------------------------------------------------------------------===//
///
/// \file
-/// This file defines the MIR2Vec vocabulary
-/// analysis(MIR2VecVocabLegacyAnalysis), the core mir2vec::MIREmbedder
-/// interface for generating Machine IR embeddings, and related utilities.
+/// This file defines the MIR2Vec framework for generating Machine IR
+/// embeddings.
+///
+/// Architecture Overview:
+/// ----------------------
+/// 1. MIR2VecVocabProvider - Core vocabulary loading logic (no PM dependency)
+/// - Can be used standalone or wrapped by the pass manager
+/// - Requires MachineModuleInfo with parsed machine functions
+///
+/// 2. MIR2VecVocabLegacyAnalysis - Pass manager wrapper (ImmutablePass)
+/// - Integrated and used by llc -print-mir2vec
+///
+/// 3. MIREmbedder - Generates embeddings from vocabulary
+/// - SymbolicMIREmbedder: MIR2Vec embedding implementation
///
/// MIR2Vec extends IR2Vec to support Machine IR embeddings. It represents the
/// LLVM Machine IR as embeddings which can be used as input to machine learning
@@ -306,26 +317,58 @@ class SymbolicMIREmbedder : public MIREmbedder {
} // namespace mir2vec
+/// MIR2Vec vocabulary provider used by pass managers and standalone tools.
+/// This class encapsulates the core vocabulary loading logic and can be used
+/// independently of the pass manager infrastructure. For pass-based usage,
+/// see MIR2VecVocabLegacyAnalysis.
+///
+/// Note: This provider pattern makes new PM migration straightforward when
+/// needed. A new PM analysis wrapper can be added that delegates to this
+/// provider, similar to how MIR2VecVocabLegacyAnalysis currently wraps it.
+class MIR2VecVocabProvider {
+ using VocabMap = std::map<std::string, mir2vec::Embedding>;
+
+public:
+ MIR2VecVocabProvider(const MachineModuleInfo &MMI) : MMI(MMI) {}
+
+ Expected<mir2vec::MIRVocabulary> getVocabulary(const Module &M);
+
+private:
+ Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab,
+ VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap);
+ const MachineModuleInfo &MMI;
+};
+
/// Pass to analyze and populate MIR2Vec vocabulary from a module
class MIR2VecVocabLegacyAnalysis : public ImmutablePass {
using VocabVector = std::vector<mir2vec::Embedding>;
using VocabMap = std::map<std::string, mir2vec::Embedding>;
- std::optional<mir2vec::MIRVocabulary> Vocab;
StringRef getPassName() const override;
- Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab,
- VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap);
protected:
void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<MachineModuleInfoWrapperPass>();
AU.setPreservesAll();
}
+ std::unique_ptr<MIR2VecVocabProvider> Provider;
public:
static char ID;
MIR2VecVocabLegacyAnalysis() : ImmutablePass(ID) {}
- Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M);
+
+ Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M) {
+ MachineModuleInfo &MMI =
+ getAnalysis<MachineModuleInfoWrapperPass>().getMMI();
+ if (!Provider)
+ Provider = std::make_unique<MIR2VecVocabProvider>(MMI);
+ return Provider->getVocabulary(M);
+ }
+
+ MIR2VecVocabProvider &getProvider() {
+ assert(Provider && "Provider not initialized");
+ return *Provider;
+ }
};
/// This pass prints the embeddings in the MIR2Vec vocabulary
diff --git a/llvm/lib/CodeGen/MIR2Vec.cpp b/llvm/lib/CodeGen/MIR2Vec.cpp
index 716221101af9f..69c1e28e55e3b 100644
--- a/llvm/lib/CodeGen/MIR2Vec.cpp
+++ b/llvm/lib/CodeGen/MIR2Vec.cpp
@@ -412,24 +412,39 @@ Expected<MIRVocabulary> MIRVocabulary::createDummyVocabForTest(
}
//===----------------------------------------------------------------------===//
-// MIR2VecVocabLegacyAnalysis Implementation
+// MIR2VecVocabProvider and MIR2VecVocabLegacyAnalysis
//===----------------------------------------------------------------------===//
-char MIR2VecVocabLegacyAnalysis::ID = 0;
-INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
- "MIR2Vec Vocabulary Analysis", false, true)
-INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass)
-INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
- "MIR2Vec Vocabulary Analysis", false, true)
+Expected<mir2vec::MIRVocabulary>
+MIR2VecVocabProvider::getVocabulary(const Module &M) {
+ VocabMap OpcVocab, CommonOperandVocab, PhyRegVocabMap, VirtRegVocabMap;
-StringRef MIR2VecVocabLegacyAnalysis::getPassName() const {
- return "MIR2Vec Vocabulary Analysis";
+ if (Error Err = readVocabulary(OpcVocab, CommonOperandVocab, PhyRegVocabMap,
+ VirtRegVocabMap))
+ return std::move(Err);
+
+ for (const auto &F : M) {
+ if (F.isDeclaration())
+ continue;
+
+ if (auto *MF = MMI.getMachineFunction(F)) {
+ auto &Subtarget = MF->getSubtarget();
+ if (const auto *TII = Subtarget.getInstrInfo())
+ if (const auto *TRI = Subtarget.getRegisterInfo())
+ return mir2vec::MIRVocabulary::create(
+ std::move(OpcVocab), std::move(CommonOperandVocab),
+ std::move(PhyRegVocabMap), std::move(VirtRegVocabMap), *TII, *TRI,
+ MF->getRegInfo());
+ }
+ }
+ return createStringError(errc::invalid_argument,
+ "No machine functions found in module");
}
-Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab,
- VocabMap &CommonOperandVocab,
- VocabMap &PhyRegVocabMap,
- VocabMap &VirtRegVocabMap) {
+Error MIR2VecVocabProvider::readVocabulary(VocabMap &OpcodeVocab,
+ VocabMap &CommonOperandVocab,
+ VocabMap &PhyRegVocabMap,
+ VocabMap &VirtRegVocabMap) {
if (VocabFile.empty())
return createStringError(
errc::invalid_argument,
@@ -478,49 +493,15 @@ Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab,
return Error::success();
}
-Expected<mir2vec::MIRVocabulary>
-MIR2VecVocabLegacyAnalysis::getMIR2VecVocabulary(const Module &M) {
- if (Vocab.has_value())
- return std::move(Vocab.value());
-
- VocabMap OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap;
- if (Error Err =
- readVocabulary(OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap))
- return std::move(Err);
-
- // Get machine module info to access machine functions and target info
- MachineModuleInfo &MMI = getAnalysis<MachineModuleInfoWrapperPass>().getMMI();
-
- // Find first available machine function to get target instruction info
- for (const auto &F : M) {
- if (F.isDeclaration())
- continue;
-
- if (auto *MF = MMI.getMachineFunction(F)) {
- auto &Subtarget = MF->getSubtarget();
- const TargetInstrInfo *TII = Subtarget.getInstrInfo();
- if (!TII) {
- return createStringError(errc::invalid_argument,
- "No TargetInstrInfo available; cannot create "
- "MIR2Vec vocabulary");
- }
-
- const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
- if (!TRI) {
- return createStringError(errc::invalid_argument,
- "No TargetRegisterInfo available; cannot "
- "create MIR2Vec vocabulary");
- }
-
- return mir2vec::MIRVocabulary::create(
- std::move(OpcMap), std::move(CommonOperandMap), std::move(PhyRegMap),
- std::move(VirtRegMap), *TII, *TRI, MF->getRegInfo());
- }
- }
+char MIR2VecVocabLegacyAnalysis::ID = 0;
+INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
+ "MIR2Vec Vocabulary Analysis", false, true)
+INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass)
+INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
+ "MIR2Vec Vocabulary Analysis", false, true)
- // No machine functions available - return error
- return createStringError(errc::invalid_argument,
- "No machine functions found in module");
+StringRef MIR2VecVocabLegacyAnalysis::getPassName() const {
+ return "MIR2Vec Vocabulary Analysis";
}
//===----------------------------------------------------------------------===//
diff --git a/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir b/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir
new file mode 100644
index 0000000000000..e5f78bfd2090e
--- /dev/null
+++ b/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir
@@ -0,0 +1,92 @@
+# REQUIRES: x86_64-linux
+# RUN: llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT
+# RUN: llvm-ir2vec embeddings --mode=mir --level=func --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL
+# RUN: llvm-ir2vec embeddings --mode=mir --level=func --function=add_function --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ADD
+# RUN: not llvm-ir2vec embeddings --mode=mir --level=func --function=missing_function --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-MISSING
+# RUN: llvm-ir2vec embeddings --mode=mir --level=bb --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL
+# RUN: llvm-ir2vec embeddings --mode=mir --level=inst --function=add_function --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL
+
+--- |
+ target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+ target triple = "x86_64-unknown-linux-gnu"
+
+ define dso_local noundef i32 @add_function(i32 noundef %a, i32 noundef %b) {
+ entry:
+ %sum = add nsw i32 %a, %b
+ %result = mul nsw i32 %sum, 2
+ ret i32 %result
+ }
+
+ define dso_local void @simple_function() {
+ entry:
+ ret void
+ }
+...
+---
+name: add_function
+alignment: 16
+tracksRegLiveness: true
+registers:
+ - { id: 0, class: gr32 }
+ - { id: 1, class: gr32 }
+ - { id: 2, class: gr32 }
+ - { id: 3, class: gr32 }
+liveins:
+ - { reg: '$edi', virtual-reg: '%0' }
+ - { reg: '$esi', virtual-reg: '%1' }
+body: |
+ bb.0.entry:
+ liveins: $edi, $esi
+
+ %1:gr32 = COPY $esi
+ %0:gr32 = COPY $edi
+ %2:gr32 = nsw ADD32rr %0, %1, implicit-def dead $eflags
+ %3:gr32 = ADD32rr %2, %2, implicit-def dead $eflags
+ $eax = COPY %3
+ RET 0, $eax
+
+---
+name: simple_function
+alignment: 16
+tracksRegLiveness: true
+body: |
+ bb.0.entry:
+ RET 0
+
+# CHECK-DEFAULT: MIR2Vec embeddings for machine function add_function:
+# CHECK-DEFAULT-NEXT: Function vector: [ 26.50 27.10 27.70 ]
+# CHECK-DEFAULT: MIR2Vec embeddings for machine function simple_function:
+# CHECK-DEFAULT-NEXT: Function vector: [ 1.10 1.20 1.30 ]
+
+# CHECK-FUNC-LEVEL: MIR2Vec embeddings for machine function add_function:
+# CHECK-FUNC-LEVEL-NEXT: Function vector: [ 26.50 27.10 27.70 ]
+# CHECK-FUNC-LEVEL: MIR2Vec embeddings for machine function simple_function:
+# CHECK-FUNC-LEVEL-NEXT: Function vector: [ 1.10 1.20 1.30 ]
+
+# CHECK-FUNC-LEVEL-ADD: MIR2Vec embeddings for machine function add_function:
+# CHECK-FUNC-LEVEL-ADD-NEXT: Function vector: [ 26.50 27.10 27.70 ]
+# CHECK-FUNC-LEVEL-ADD-NOT: simple_function
+
+# CHECK-FUNC-MISSING: Error: Function 'missing_function' not found
+
+# CHECK-BB-LEVEL: MIR2Vec embeddings for machine function add_function:
+# CHECK-BB-LEVEL-NEXT: Basic block vectors:
+# CHECK-BB-LEVEL-NEXT: MBB entry: [ 26.50 27.10 27.70 ]
+# CHECK-BB-LEVEL: MIR2Vec embeddings for machine function simple_function:
+# CHECK-BB-LEVEL-NEXT: Basic block vectors:
+# CHECK-BB-LEVEL-NEXT: MBB entry: [ 1.10 1.20 1.30 ]
+
+# CHECK-INST-LEVEL: MIR2Vec embeddings for machine function add_function:
+# CHECK-INST-LEVEL-NEXT: Instruction vectors:
+# CHECK-INST-LEVEL: %1:gr32 = COPY $esi
+# CHECK-INST-LEVEL-NEXT: -> [ 6.00 6.10 6.20 ]
+# CHECK-INST-LEVEL-NEXT: %0:gr32 = COPY $edi
+# CHECK-INST-LEVEL-NEXT: -> [ 6.00 6.10 6.20 ]
+# CHECK-INST-LEVEL: %2:gr32 = nsw ADD32rr
+# CHECK-INST-LEVEL: -> [ 3.70 3.80 3.90 ]
+# CHECK-INST-LEVEL: %3:gr32 = ADD32rr
+# CHECK-INST-LEVEL: -> [ 3.70 3.80 3.90 ]
+# CHECK-INST-LEVEL: $eax = COPY %3:gr32
+# CHECK-INST-LEVEL-NEXT: -> [ 6.00 6.10 6.20 ]
+# CHECK-INST-LEVEL: RET 0, $eax
+# CHECK-INST-LEVEL-NEXT: -> [ 1.10 1.20 1.30 ]
diff --git a/llvm/test/tools/llvm-ir2vec/error-handling.mir b/llvm/test/...
[truncated]
|
@llvm/pr-subscribers-llvm-binary-utilities Author: S. VenkataKeerthy (svkeerthy) ChangesAdd MIR2Vec support to the llvm-ir2vec tool, enabling embedding generation for Machine IR alongside the existing LLVM IR functionality. Patch is 35.27 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/164025.diff 7 Files Affected:
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index fc590a6180316..55fe75d2084b1 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -1,5 +1,5 @@
-llvm-ir2vec - IR2Vec Embedding Generation Tool
-==============================================
+llvm-ir2vec - IR2Vec and MIR2Vec Embedding Generation Tool
+===========================================================
.. program:: llvm-ir2vec
@@ -11,9 +11,9 @@ SYNOPSIS
DESCRIPTION
-----------
-:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
-generates IR2Vec embeddings for LLVM IR and supports triplet generation
-for vocabulary training.
+:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec and MIR2Vec.
+It generates embeddings for both LLVM IR and Machine IR (MIR) and supports
+triplet generation for vocabulary training.
The tool provides three main subcommands:
@@ -23,23 +23,33 @@ The tool provides three main subcommands:
2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary
training.
-3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary
+3. **embeddings**: Generates IR2Vec or MIR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).
+The tool supports two operation modes:
+
+* **LLVM IR mode** (``--mode=llvm``): Process LLVM IR bitcode files and generate
+ IR2Vec embeddings
+* **Machine IR mode** (``--mode=mir``): Process Machine IR (.mir) files and generate
+ MIR2Vec embeddings
+
The tool is designed to facilitate machine learning applications that work with
-LLVM IR by converting the IR into numerical representations that can be used by
-ML models. The `triplets` subcommand generates numeric IDs directly instead of string
-triplets, streamlining the training data preparation workflow.
+LLVM IR or Machine IR by converting them into numerical representations that can
+be used by ML models. The `triplets` subcommand generates numeric IDs directly
+instead of string triplets, streamlining the training data preparation workflow.
.. note::
- For information about using IR2Vec programmatically within LLVM passes and
- the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
+ For information about using IR2Vec and MIR2Vec programmatically within LLVM
+ passes and the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
section in the MLGO documentation.
OPERATION MODES
---------------
+The tool operates in two modes: **LLVM IR mode** and **Machine IR mode**. The mode
+is selected using the ``--mode`` option (default: ``llvm``).
+
Triplet Generation and Entity Mapping Modes are used for preparing
vocabulary and training data for knowledge graph embeddings. The Embedding Mode
is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
@@ -89,18 +99,31 @@ Embedding Generation
~~~~~~~~~~~~~~~~~~~~
With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
-generate numerical embeddings for LLVM IR at different levels of granularity.
+generate numerical embeddings for LLVM IR or Machine IR at different levels of granularity.
+
+Example Usage for LLVM IR:
+
+.. code-block:: bash
+
+ llvm-ir2vec embeddings --mode=llvm --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
-Example Usage:
+Example Usage for Machine IR:
.. code-block:: bash
- llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
+ llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=vocab.json --level=func input.mir -o embeddings.txt
OPTIONS
-------
-Global options:
+Common options (applicable to both LLVM IR and Machine IR modes):
+
+.. option:: --mode=<mode>
+
+ Specify the operation mode. Valid values are:
+
+ * ``llvm`` - Process LLVM IR bitcode files (default)
+ * ``mir`` - Process Machine IR (.mir) files
.. option:: -o <filename>
@@ -116,8 +139,8 @@ Subcommand-specific options:
.. option:: <input-file>
- The input LLVM IR or bitcode file to process. This positional argument is
- required for the `embeddings` subcommand.
+ The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
+ This positional argument is required for the `embeddings` subcommand.
.. option:: --level=<level>
@@ -131,6 +154,8 @@ Subcommand-specific options:
Process only the specified function instead of all functions in the module.
+**IR2Vec-specific options** (for ``--mode=llvm``):
+
.. option:: --ir2vec-kind=<kind>
Specify the kind of IR2Vec embeddings to generate. Valid values are:
@@ -143,8 +168,8 @@ Subcommand-specific options:
.. option:: --ir2vec-vocab-path=<path>
- Specify the path to the vocabulary file (required for embedding generation).
- The vocabulary file should be in JSON format and contain the trained
+ Specify the path to the IR2Vec vocabulary file (required for LLVM IR embedding
+ generation). The vocabulary file should be in JSON format and contain the trained
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
for pre-trained vocabulary files.
@@ -163,6 +188,35 @@ Subcommand-specific options:
Specify the weight for argument embeddings (default: 0.2). This controls
the relative importance of operand information in the final embedding.
+**MIR2Vec-specific options** (for ``--mode=mir``):
+
+.. option:: --mir2vec-vocab-path=<path>
+
+ Specify the path to the MIR2Vec vocabulary file (required for Machine IR
+ embedding generation). The vocabulary file should be in JSON format and
+ contain the trained vocabulary for embedding generation.
+
+.. option:: --mir2vec-kind=<kind>
+
+ Specify the kind of MIR2Vec embeddings to generate. Valid values are:
+
+ * ``symbolic`` - Generate symbolic embeddings (default)
+
+.. option:: --mir2vec-opc-weight=<weight>
+
+ Specify the weight for machine opcode embeddings (default: 1.0). This controls
+ the relative importance of machine instruction opcodes in the final embedding.
+
+.. option:: --mir2vec-common-operand-weight=<weight>
+
+ Specify the weight for common operand embeddings (default: 1.0). This controls
+ the relative importance of common operand types in the final embedding.
+
+.. option:: --mir2vec-reg-operand-weight=<weight>
+
+ Specify the weight for register operand embeddings (default: 1.0). This controls
+ the relative importance of register operands in the final embedding.
+
**triplets** subcommand:
@@ -240,3 +294,6 @@ SEE ALSO
For more information about the IR2Vec algorithm and approach, see:
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
+
+For more information about the MIR2Vec algorithm and approach, see:
+`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.
diff --git a/llvm/include/llvm/CodeGen/MIR2Vec.h b/llvm/include/llvm/CodeGen/MIR2Vec.h
index 953e590a6d64f..f47d9abb042d8 100644
--- a/llvm/include/llvm/CodeGen/MIR2Vec.h
+++ b/llvm/include/llvm/CodeGen/MIR2Vec.h
@@ -7,9 +7,20 @@
//===----------------------------------------------------------------------===//
///
/// \file
-/// This file defines the MIR2Vec vocabulary
-/// analysis(MIR2VecVocabLegacyAnalysis), the core mir2vec::MIREmbedder
-/// interface for generating Machine IR embeddings, and related utilities.
+/// This file defines the MIR2Vec framework for generating Machine IR
+/// embeddings.
+///
+/// Architecture Overview:
+/// ----------------------
+/// 1. MIR2VecVocabProvider - Core vocabulary loading logic (no PM dependency)
+/// - Can be used standalone or wrapped by the pass manager
+/// - Requires MachineModuleInfo with parsed machine functions
+///
+/// 2. MIR2VecVocabLegacyAnalysis - Pass manager wrapper (ImmutablePass)
+/// - Integrated and used by llc -print-mir2vec
+///
+/// 3. MIREmbedder - Generates embeddings from vocabulary
+/// - SymbolicMIREmbedder: MIR2Vec embedding implementation
///
/// MIR2Vec extends IR2Vec to support Machine IR embeddings. It represents the
/// LLVM Machine IR as embeddings which can be used as input to machine learning
@@ -306,26 +317,58 @@ class SymbolicMIREmbedder : public MIREmbedder {
} // namespace mir2vec
+/// MIR2Vec vocabulary provider used by pass managers and standalone tools.
+/// This class encapsulates the core vocabulary loading logic and can be used
+/// independently of the pass manager infrastructure. For pass-based usage,
+/// see MIR2VecVocabLegacyAnalysis.
+///
+/// Note: This provider pattern makes new PM migration straightforward when
+/// needed. A new PM analysis wrapper can be added that delegates to this
+/// provider, similar to how MIR2VecVocabLegacyAnalysis currently wraps it.
+class MIR2VecVocabProvider {
+ using VocabMap = std::map<std::string, mir2vec::Embedding>;
+
+public:
+ MIR2VecVocabProvider(const MachineModuleInfo &MMI) : MMI(MMI) {}
+
+ Expected<mir2vec::MIRVocabulary> getVocabulary(const Module &M);
+
+private:
+ Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab,
+ VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap);
+ const MachineModuleInfo &MMI;
+};
+
/// Pass to analyze and populate MIR2Vec vocabulary from a module
class MIR2VecVocabLegacyAnalysis : public ImmutablePass {
using VocabVector = std::vector<mir2vec::Embedding>;
using VocabMap = std::map<std::string, mir2vec::Embedding>;
- std::optional<mir2vec::MIRVocabulary> Vocab;
StringRef getPassName() const override;
- Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab,
- VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap);
protected:
void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<MachineModuleInfoWrapperPass>();
AU.setPreservesAll();
}
+ std::unique_ptr<MIR2VecVocabProvider> Provider;
public:
static char ID;
MIR2VecVocabLegacyAnalysis() : ImmutablePass(ID) {}
- Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M);
+
+ Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M) {
+ MachineModuleInfo &MMI =
+ getAnalysis<MachineModuleInfoWrapperPass>().getMMI();
+ if (!Provider)
+ Provider = std::make_unique<MIR2VecVocabProvider>(MMI);
+ return Provider->getVocabulary(M);
+ }
+
+ MIR2VecVocabProvider &getProvider() {
+ assert(Provider && "Provider not initialized");
+ return *Provider;
+ }
};
/// This pass prints the embeddings in the MIR2Vec vocabulary
diff --git a/llvm/lib/CodeGen/MIR2Vec.cpp b/llvm/lib/CodeGen/MIR2Vec.cpp
index 716221101af9f..69c1e28e55e3b 100644
--- a/llvm/lib/CodeGen/MIR2Vec.cpp
+++ b/llvm/lib/CodeGen/MIR2Vec.cpp
@@ -412,24 +412,39 @@ Expected<MIRVocabulary> MIRVocabulary::createDummyVocabForTest(
}
//===----------------------------------------------------------------------===//
-// MIR2VecVocabLegacyAnalysis Implementation
+// MIR2VecVocabProvider and MIR2VecVocabLegacyAnalysis
//===----------------------------------------------------------------------===//
-char MIR2VecVocabLegacyAnalysis::ID = 0;
-INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
- "MIR2Vec Vocabulary Analysis", false, true)
-INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass)
-INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
- "MIR2Vec Vocabulary Analysis", false, true)
+Expected<mir2vec::MIRVocabulary>
+MIR2VecVocabProvider::getVocabulary(const Module &M) {
+ VocabMap OpcVocab, CommonOperandVocab, PhyRegVocabMap, VirtRegVocabMap;
-StringRef MIR2VecVocabLegacyAnalysis::getPassName() const {
- return "MIR2Vec Vocabulary Analysis";
+ if (Error Err = readVocabulary(OpcVocab, CommonOperandVocab, PhyRegVocabMap,
+ VirtRegVocabMap))
+ return std::move(Err);
+
+ for (const auto &F : M) {
+ if (F.isDeclaration())
+ continue;
+
+ if (auto *MF = MMI.getMachineFunction(F)) {
+ auto &Subtarget = MF->getSubtarget();
+ if (const auto *TII = Subtarget.getInstrInfo())
+ if (const auto *TRI = Subtarget.getRegisterInfo())
+ return mir2vec::MIRVocabulary::create(
+ std::move(OpcVocab), std::move(CommonOperandVocab),
+ std::move(PhyRegVocabMap), std::move(VirtRegVocabMap), *TII, *TRI,
+ MF->getRegInfo());
+ }
+ }
+ return createStringError(errc::invalid_argument,
+ "No machine functions found in module");
}
-Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab,
- VocabMap &CommonOperandVocab,
- VocabMap &PhyRegVocabMap,
- VocabMap &VirtRegVocabMap) {
+Error MIR2VecVocabProvider::readVocabulary(VocabMap &OpcodeVocab,
+ VocabMap &CommonOperandVocab,
+ VocabMap &PhyRegVocabMap,
+ VocabMap &VirtRegVocabMap) {
if (VocabFile.empty())
return createStringError(
errc::invalid_argument,
@@ -478,49 +493,15 @@ Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab,
return Error::success();
}
-Expected<mir2vec::MIRVocabulary>
-MIR2VecVocabLegacyAnalysis::getMIR2VecVocabulary(const Module &M) {
- if (Vocab.has_value())
- return std::move(Vocab.value());
-
- VocabMap OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap;
- if (Error Err =
- readVocabulary(OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap))
- return std::move(Err);
-
- // Get machine module info to access machine functions and target info
- MachineModuleInfo &MMI = getAnalysis<MachineModuleInfoWrapperPass>().getMMI();
-
- // Find first available machine function to get target instruction info
- for (const auto &F : M) {
- if (F.isDeclaration())
- continue;
-
- if (auto *MF = MMI.getMachineFunction(F)) {
- auto &Subtarget = MF->getSubtarget();
- const TargetInstrInfo *TII = Subtarget.getInstrInfo();
- if (!TII) {
- return createStringError(errc::invalid_argument,
- "No TargetInstrInfo available; cannot create "
- "MIR2Vec vocabulary");
- }
-
- const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
- if (!TRI) {
- return createStringError(errc::invalid_argument,
- "No TargetRegisterInfo available; cannot "
- "create MIR2Vec vocabulary");
- }
-
- return mir2vec::MIRVocabulary::create(
- std::move(OpcMap), std::move(CommonOperandMap), std::move(PhyRegMap),
- std::move(VirtRegMap), *TII, *TRI, MF->getRegInfo());
- }
- }
+char MIR2VecVocabLegacyAnalysis::ID = 0;
+INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
+ "MIR2Vec Vocabulary Analysis", false, true)
+INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass)
+INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
+ "MIR2Vec Vocabulary Analysis", false, true)
- // No machine functions available - return error
- return createStringError(errc::invalid_argument,
- "No machine functions found in module");
+StringRef MIR2VecVocabLegacyAnalysis::getPassName() const {
+ return "MIR2Vec Vocabulary Analysis";
}
//===----------------------------------------------------------------------===//
diff --git a/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir b/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir
new file mode 100644
index 0000000000000..e5f78bfd2090e
--- /dev/null
+++ b/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.mir
@@ -0,0 +1,92 @@
+# REQUIRES: x86_64-linux
+# RUN: llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT
+# RUN: llvm-ir2vec embeddings --mode=mir --level=func --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL
+# RUN: llvm-ir2vec embeddings --mode=mir --level=func --function=add_function --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ADD
+# RUN: not llvm-ir2vec embeddings --mode=mir --level=func --function=missing_function --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-MISSING
+# RUN: llvm-ir2vec embeddings --mode=mir --level=bb --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL
+# RUN: llvm-ir2vec embeddings --mode=mir --level=inst --function=add_function --mir2vec-vocab-path=%S/../../CodeGen/MIR2Vec/Inputs/mir2vec_dummy_3D_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL
+
+--- |
+ target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+ target triple = "x86_64-unknown-linux-gnu"
+
+ define dso_local noundef i32 @add_function(i32 noundef %a, i32 noundef %b) {
+ entry:
+ %sum = add nsw i32 %a, %b
+ %result = mul nsw i32 %sum, 2
+ ret i32 %result
+ }
+
+ define dso_local void @simple_function() {
+ entry:
+ ret void
+ }
+...
+---
+name: add_function
+alignment: 16
+tracksRegLiveness: true
+registers:
+ - { id: 0, class: gr32 }
+ - { id: 1, class: gr32 }
+ - { id: 2, class: gr32 }
+ - { id: 3, class: gr32 }
+liveins:
+ - { reg: '$edi', virtual-reg: '%0' }
+ - { reg: '$esi', virtual-reg: '%1' }
+body: |
+ bb.0.entry:
+ liveins: $edi, $esi
+
+ %1:gr32 = COPY $esi
+ %0:gr32 = COPY $edi
+ %2:gr32 = nsw ADD32rr %0, %1, implicit-def dead $eflags
+ %3:gr32 = ADD32rr %2, %2, implicit-def dead $eflags
+ $eax = COPY %3
+ RET 0, $eax
+
+---
+name: simple_function
+alignment: 16
+tracksRegLiveness: true
+body: |
+ bb.0.entry:
+ RET 0
+
+# CHECK-DEFAULT: MIR2Vec embeddings for machine function add_function:
+# CHECK-DEFAULT-NEXT: Function vector: [ 26.50 27.10 27.70 ]
+# CHECK-DEFAULT: MIR2Vec embeddings for machine function simple_function:
+# CHECK-DEFAULT-NEXT: Function vector: [ 1.10 1.20 1.30 ]
+
+# CHECK-FUNC-LEVEL: MIR2Vec embeddings for machine function add_function:
+# CHECK-FUNC-LEVEL-NEXT: Function vector: [ 26.50 27.10 27.70 ]
+# CHECK-FUNC-LEVEL: MIR2Vec embeddings for machine function simple_function:
+# CHECK-FUNC-LEVEL-NEXT: Function vector: [ 1.10 1.20 1.30 ]
+
+# CHECK-FUNC-LEVEL-ADD: MIR2Vec embeddings for machine function add_function:
+# CHECK-FUNC-LEVEL-ADD-NEXT: Function vector: [ 26.50 27.10 27.70 ]
+# CHECK-FUNC-LEVEL-ADD-NOT: simple_function
+
+# CHECK-FUNC-MISSING: Error: Function 'missing_function' not found
+
+# CHECK-BB-LEVEL: MIR2Vec embeddings for machine function add_function:
+# CHECK-BB-LEVEL-NEXT: Basic block vectors:
+# CHECK-BB-LEVEL-NEXT: MBB entry: [ 26.50 27.10 27.70 ]
+# CHECK-BB-LEVEL: MIR2Vec embeddings for machine function simple_function:
+# CHECK-BB-LEVEL-NEXT: Basic block vectors:
+# CHECK-BB-LEVEL-NEXT: MBB entry: [ 1.10 1.20 1.30 ]
+
+# CHECK-INST-LEVEL: MIR2Vec embeddings for machine function add_function:
+# CHECK-INST-LEVEL-NEXT: Instruction vectors:
+# CHECK-INST-LEVEL: %1:gr32 = COPY $esi
+# CHECK-INST-LEVEL-NEXT: -> [ 6.00 6.10 6.20 ]
+# CHECK-INST-LEVEL-NEXT: %0:gr32 = COPY $edi
+# CHECK-INST-LEVEL-NEXT: -> [ 6.00 6.10 6.20 ]
+# CHECK-INST-LEVEL: %2:gr32 = nsw ADD32rr
+# CHECK-INST-LEVEL: -> [ 3.70 3.80 3.90 ]
+# CHECK-INST-LEVEL: %3:gr32 = ADD32rr
+# CHECK-INST-LEVEL: -> [ 3.70 3.80 3.90 ]
+# CHECK-INST-LEVEL: $eax = COPY %3:gr32
+# CHECK-INST-LEVEL-NEXT: -> [ 6.00 6.10 6.20 ]
+# CHECK-INST-LEVEL: RET 0, $eax
+# CHECK-INST-LEVEL-NEXT: -> [ 1.10 1.20 1.30 ]
diff --git a/llvm/test/tools/llvm-ir2vec/error-handling.mir b/llvm/test/...
[truncated]
|
Add MIR2Vec support to the llvm-ir2vec tool, enabling embedding generation for Machine IR alongside the existing LLVM IR functionality.
(This is an initial integration; Other entity/triplet gen for vocab generation would follow as separate patches)