Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 76 additions & 19 deletions llvm/docs/CommandGuide/llvm-ir2vec.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
llvm-ir2vec - IR2Vec Embedding Generation Tool
==============================================
llvm-ir2vec - IR2Vec and MIR2Vec Embedding Generation Tool
===========================================================

.. program:: llvm-ir2vec

Expand All @@ -11,9 +11,9 @@ SYNOPSIS
DESCRIPTION
-----------

:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
generates IR2Vec embeddings for LLVM IR and supports triplet generation
for vocabulary training.
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec and MIR2Vec.
It generates embeddings for both LLVM IR and Machine IR (MIR) and supports
triplet generation for vocabulary training.

The tool provides three main subcommands:

Expand All @@ -23,23 +23,33 @@ The tool provides three main subcommands:
2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary
training.

3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary
3. **embeddings**: Generates IR2Vec or MIR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).

The tool supports two operation modes:

* **LLVM IR mode** (``--mode=llvm``): Process LLVM IR bitcode files and generate
IR2Vec embeddings
* **Machine IR mode** (``--mode=mir``): Process Machine IR (.mir) files and generate
MIR2Vec embeddings

The tool is designed to facilitate machine learning applications that work with
LLVM IR by converting the IR into numerical representations that can be used by
ML models. The `triplets` subcommand generates numeric IDs directly instead of string
triplets, streamlining the training data preparation workflow.
LLVM IR or Machine IR by converting them into numerical representations that can
be used by ML models. The `triplets` subcommand generates numeric IDs directly
instead of string triplets, streamlining the training data preparation workflow.

.. note::

For information about using IR2Vec programmatically within LLVM passes and
the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
For information about using IR2Vec and MIR2Vec programmatically within LLVM
passes and the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
section in the MLGO documentation.

OPERATION MODES
---------------

The tool operates in two modes: **LLVM IR mode** and **Machine IR mode**. The mode
is selected using the ``--mode`` option (default: ``llvm``).

Triplet Generation and Entity Mapping Modes are used for preparing
vocabulary and training data for knowledge graph embeddings. The Embedding Mode
is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
Expand Down Expand Up @@ -89,18 +99,31 @@ Embedding Generation
~~~~~~~~~~~~~~~~~~~~

With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
generate numerical embeddings for LLVM IR at different levels of granularity.
generate numerical embeddings for LLVM IR or Machine IR at different levels of granularity.

Example Usage for LLVM IR:

.. code-block:: bash

llvm-ir2vec embeddings --mode=llvm --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt

Example Usage:
Example Usage for Machine IR:

.. code-block:: bash

llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=vocab.json --level=func input.mir -o embeddings.txt

OPTIONS
-------

Global options:
Common options (applicable to both LLVM IR and Machine IR modes):

.. option:: --mode=<mode>

Specify the operation mode. Valid values are:

* ``llvm`` - Process LLVM IR bitcode files (default)
* ``mir`` - Process Machine IR (.mir) files

.. option:: -o <filename>

Expand All @@ -116,8 +139,8 @@ Subcommand-specific options:

.. option:: <input-file>

The input LLVM IR or bitcode file to process. This positional argument is
required for the `embeddings` subcommand.
The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
This positional argument is required for the `embeddings` subcommand.

.. option:: --level=<level>

Expand All @@ -131,6 +154,8 @@ Subcommand-specific options:

Process only the specified function instead of all functions in the module.

**IR2Vec-specific options** (for ``--mode=llvm``):

.. option:: --ir2vec-kind=<kind>

Specify the kind of IR2Vec embeddings to generate. Valid values are:
Expand All @@ -143,8 +168,8 @@ Subcommand-specific options:

.. option:: --ir2vec-vocab-path=<path>

Specify the path to the vocabulary file (required for embedding generation).
The vocabulary file should be in JSON format and contain the trained
Specify the path to the IR2Vec vocabulary file (required for LLVM IR embedding
generation). The vocabulary file should be in JSON format and contain the trained
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
for pre-trained vocabulary files.

Expand All @@ -163,6 +188,35 @@ Subcommand-specific options:
Specify the weight for argument embeddings (default: 0.2). This controls
the relative importance of operand information in the final embedding.

**MIR2Vec-specific options** (for ``--mode=mir``):

.. option:: --mir2vec-vocab-path=<path>

Specify the path to the MIR2Vec vocabulary file (required for Machine IR
embedding generation). The vocabulary file should be in JSON format and
contain the trained vocabulary for embedding generation.

.. option:: --mir2vec-kind=<kind>

Specify the kind of MIR2Vec embeddings to generate. Valid values are:

* ``symbolic`` - Generate symbolic embeddings (default)

.. option:: --mir2vec-opc-weight=<weight>

Specify the weight for machine opcode embeddings (default: 1.0). This controls
the relative importance of machine instruction opcodes in the final embedding.

.. option:: --mir2vec-common-operand-weight=<weight>

Specify the weight for common operand embeddings (default: 1.0). This controls
the relative importance of common operand types in the final embedding.

.. option:: --mir2vec-reg-operand-weight=<weight>

Specify the weight for register operand embeddings (default: 1.0). This controls
the relative importance of register operands in the final embedding.


**triplets** subcommand:

Expand Down Expand Up @@ -240,3 +294,6 @@ SEE ALSO

For more information about the IR2Vec algorithm and approach, see:
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.

For more information about the MIR2Vec algorithm and approach, see:
`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.
57 changes: 50 additions & 7 deletions llvm/include/llvm/CodeGen/MIR2Vec.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,20 @@
//===----------------------------------------------------------------------===//
///
/// \file
/// This file defines the MIR2Vec vocabulary
/// analysis(MIR2VecVocabLegacyAnalysis), the core mir2vec::MIREmbedder
/// interface for generating Machine IR embeddings, and related utilities.
/// This file defines the MIR2Vec framework for generating Machine IR
/// embeddings.
///
/// Architecture Overview:
/// ----------------------
/// 1. MIR2VecVocabProvider - Core vocabulary loading logic (no PM dependency)
/// - Can be used standalone or wrapped by the pass manager
/// - Requires MachineModuleInfo with parsed machine functions
///
/// 2. MIR2VecVocabLegacyAnalysis - Pass manager wrapper (ImmutablePass)
/// - Integrated and used by llc -print-mir2vec
///
/// 3. MIREmbedder - Generates embeddings from vocabulary
/// - SymbolicMIREmbedder: MIR2Vec embedding implementation
///
/// MIR2Vec extends IR2Vec to support Machine IR embeddings. It represents the
/// LLVM Machine IR as embeddings which can be used as input to machine learning
Expand Down Expand Up @@ -306,26 +317,58 @@ class SymbolicMIREmbedder : public MIREmbedder {

} // namespace mir2vec

/// MIR2Vec vocabulary provider used by pass managers and standalone tools.
/// This class encapsulates the core vocabulary loading logic and can be used
/// independently of the pass manager infrastructure. For pass-based usage,
/// see MIR2VecVocabLegacyAnalysis.
///
/// Note: This provider pattern makes new PM migration straightforward when
/// needed. A new PM analysis wrapper can be added that delegates to this
/// provider, similar to how MIR2VecVocabLegacyAnalysis currently wraps it.
class MIR2VecVocabProvider {
using VocabMap = std::map<std::string, mir2vec::Embedding>;

public:
MIR2VecVocabProvider(const MachineModuleInfo &MMI) : MMI(MMI) {}

Expected<mir2vec::MIRVocabulary> getVocabulary(const Module &M);

private:
Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab,
VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap);
const MachineModuleInfo &MMI;
};

/// Pass to analyze and populate MIR2Vec vocabulary from a module
class MIR2VecVocabLegacyAnalysis : public ImmutablePass {
using VocabVector = std::vector<mir2vec::Embedding>;
using VocabMap = std::map<std::string, mir2vec::Embedding>;
std::optional<mir2vec::MIRVocabulary> Vocab;

StringRef getPassName() const override;
Error readVocabulary(VocabMap &OpcVocab, VocabMap &CommonOperandVocab,
VocabMap &PhyRegVocabMap, VocabMap &VirtRegVocabMap);

protected:
void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<MachineModuleInfoWrapperPass>();
AU.setPreservesAll();
}
std::unique_ptr<MIR2VecVocabProvider> Provider;

public:
static char ID;
MIR2VecVocabLegacyAnalysis() : ImmutablePass(ID) {}
Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M);

Expected<mir2vec::MIRVocabulary> getMIR2VecVocabulary(const Module &M) {
MachineModuleInfo &MMI =
getAnalysis<MachineModuleInfoWrapperPass>().getMMI();
if (!Provider)
Provider = std::make_unique<MIR2VecVocabProvider>(MMI);
return Provider->getVocabulary(M);
}

MIR2VecVocabProvider &getProvider() {
assert(Provider && "Provider not initialized");
return *Provider;
}
};

/// This pass prints the embeddings in the MIR2Vec vocabulary
Expand Down
91 changes: 36 additions & 55 deletions llvm/lib/CodeGen/MIR2Vec.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -412,24 +412,39 @@ Expected<MIRVocabulary> MIRVocabulary::createDummyVocabForTest(
}

//===----------------------------------------------------------------------===//
// MIR2VecVocabLegacyAnalysis Implementation
// MIR2VecVocabProvider and MIR2VecVocabLegacyAnalysis
//===----------------------------------------------------------------------===//

char MIR2VecVocabLegacyAnalysis::ID = 0;
INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
"MIR2Vec Vocabulary Analysis", false, true)
INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass)
INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
"MIR2Vec Vocabulary Analysis", false, true)
Expected<mir2vec::MIRVocabulary>
MIR2VecVocabProvider::getVocabulary(const Module &M) {
VocabMap OpcVocab, CommonOperandVocab, PhyRegVocabMap, VirtRegVocabMap;

StringRef MIR2VecVocabLegacyAnalysis::getPassName() const {
return "MIR2Vec Vocabulary Analysis";
if (Error Err = readVocabulary(OpcVocab, CommonOperandVocab, PhyRegVocabMap,
VirtRegVocabMap))
return std::move(Err);

for (const auto &F : M) {
if (F.isDeclaration())
continue;

if (auto *MF = MMI.getMachineFunction(F)) {
auto &Subtarget = MF->getSubtarget();
if (const auto *TII = Subtarget.getInstrInfo())
if (const auto *TRI = Subtarget.getRegisterInfo())
return mir2vec::MIRVocabulary::create(
std::move(OpcVocab), std::move(CommonOperandVocab),
std::move(PhyRegVocabMap), std::move(VirtRegVocabMap), *TII, *TRI,
MF->getRegInfo());
}
}
return createStringError(errc::invalid_argument,
"No machine functions found in module");
}

Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab,
VocabMap &CommonOperandVocab,
VocabMap &PhyRegVocabMap,
VocabMap &VirtRegVocabMap) {
Error MIR2VecVocabProvider::readVocabulary(VocabMap &OpcodeVocab,
VocabMap &CommonOperandVocab,
VocabMap &PhyRegVocabMap,
VocabMap &VirtRegVocabMap) {
if (VocabFile.empty())
return createStringError(
errc::invalid_argument,
Expand Down Expand Up @@ -478,49 +493,15 @@ Error MIR2VecVocabLegacyAnalysis::readVocabulary(VocabMap &OpcodeVocab,
return Error::success();
}

Expected<mir2vec::MIRVocabulary>
MIR2VecVocabLegacyAnalysis::getMIR2VecVocabulary(const Module &M) {
if (Vocab.has_value())
return std::move(Vocab.value());

VocabMap OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap;
if (Error Err =
readVocabulary(OpcMap, CommonOperandMap, PhyRegMap, VirtRegMap))
return std::move(Err);

// Get machine module info to access machine functions and target info
MachineModuleInfo &MMI = getAnalysis<MachineModuleInfoWrapperPass>().getMMI();

// Find first available machine function to get target instruction info
for (const auto &F : M) {
if (F.isDeclaration())
continue;

if (auto *MF = MMI.getMachineFunction(F)) {
auto &Subtarget = MF->getSubtarget();
const TargetInstrInfo *TII = Subtarget.getInstrInfo();
if (!TII) {
return createStringError(errc::invalid_argument,
"No TargetInstrInfo available; cannot create "
"MIR2Vec vocabulary");
}

const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
if (!TRI) {
return createStringError(errc::invalid_argument,
"No TargetRegisterInfo available; cannot "
"create MIR2Vec vocabulary");
}

return mir2vec::MIRVocabulary::create(
std::move(OpcMap), std::move(CommonOperandMap), std::move(PhyRegMap),
std::move(VirtRegMap), *TII, *TRI, MF->getRegInfo());
}
}
char MIR2VecVocabLegacyAnalysis::ID = 0;
INITIALIZE_PASS_BEGIN(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
"MIR2Vec Vocabulary Analysis", false, true)
INITIALIZE_PASS_DEPENDENCY(MachineModuleInfoWrapperPass)
INITIALIZE_PASS_END(MIR2VecVocabLegacyAnalysis, "mir2vec-vocab-analysis",
"MIR2Vec Vocabulary Analysis", false, true)

// No machine functions available - return error
return createStringError(errc::invalid_argument,
"No machine functions found in module");
StringRef MIR2VecVocabLegacyAnalysis::getPassName() const {
return "MIR2Vec Vocabulary Analysis";
}

//===----------------------------------------------------------------------===//
Expand Down
Loading
Loading