Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 140 additions & 4 deletions llvm/docs/MLGO.rst
Original file line number Diff line number Diff line change
Expand Up @@ -434,8 +434,27 @@ The latter is also used in tests.
There is no C++ implementation of a log reader. We do not have a scenario
motivating one.

IR2Vec Embeddings
=================
Embeddings
==========

LLVM provides embedding frameworks to generate vector representations of code
at different abstraction levels. These embeddings capture syntactic, semantic,
and structural properties of the code and can be used as features for machine
learning models in various compiler optimization tasks.

Two embedding frameworks are available:

- **IR2Vec**: Generates embeddings for LLVM IR
- **MIR2Vec**: Generates embeddings for Machine IR

Both frameworks follow a similar architecture with vocabulary-based embedding
generation, where a vocabulary maps code entities to n-dimensional floating
point vectors. These embeddings can be computed at multiple granularity levels
(instruction, basic block, and function) and used for ML-guided compiler
optimizations.

IR2Vec
------

IR2Vec is a program embedding approach designed specifically for LLVM IR. It
is implemented as a function analysis pass in LLVM. The IR2Vec embeddings
Expand Down Expand Up @@ -466,7 +485,7 @@ The core components are:
compute embeddings for instructions, basic blocks, and functions.

Using IR2Vec
------------
^^^^^^^^^^^^

.. note::

Expand Down Expand Up @@ -526,7 +545,7 @@ embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
between different code snippets, or perform other analyses as needed.

Further Details
---------------
^^^^^^^^^^^^^^^

For more detailed information about the IR2Vec algorithm, its parameters, and
advanced usage, please refer to the original paper:
Expand All @@ -538,6 +557,123 @@ triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
The LLVM source code for ``IR2Vec`` can also be explored to understand the
implementation details.

MIR2Vec
-------

MIR2Vec is an extension of IR2Vec designed specifically for LLVM Machine IR
(MIR). It generates embeddings for machine-level instructions, basic blocks,
and functions. MIR2Vec operates on the target-specific machine representation,
capturing machine instruction semantics including opcodes, operands, and
register information at the machine level.

MIR2Vec extends the vocabulary to include:

- **Machine Opcodes**: Target-specific instruction opcodes derived from the
TargetInstrInfo, grouped by instruction semantics.

- **Common Operands**: All common operand types (excluding register operands),
defined by the ``MachineOperand::MachineOperandType`` enum.

- **Physical Register Classes**: Register classes defined by the target,
specialized for physical registers.

- **Virtual Register Classes**: Register classes defined by the target,
specialized for virtual registers.

The core components are:

- **Vocabulary**: A mapping from machine IR entities (opcodes, operands, register
classes) to their vector representations. This is managed by
``MIR2VecVocabLegacyAnalysis`` for the legacy pass manager, with a
``MIR2VecVocabProvider`` that can be used standalone or wrapped by pass
managers. The vocabulary (.json file) contains sections for opcodes, common
operands, physical register classes, and virtual register classes.

.. note::

The vocabulary file should contain these sections for it to be valid.

- **Embedder**: A class (``mir2vec::MIREmbedder``) that uses the vocabulary to
compute embeddings for machine instructions, machine basic blocks, and
machine functions. Currently, ``SymbolicMIREmbedder`` is the available
implementation.

Using MIR2Vec
^^^^^^^^^^^^^

.. note::

This section describes how to use MIR2Vec within LLVM passes. `llvm-ir2vec`
tool ` :doc:`CommandGuide/llvm-ir2vec` can be used for generating MIR2Vec
embeddings from Machine IR files (.mir), which can be useful for generating
embeddings outside of compiler passes.

To generate MIR2Vec embeddings in a compiler pass, first obtain the vocabulary,
then create an embedder instance to compute and access embeddings.

1. **Get the Vocabulary**:
In a MachineFunctionPass, get the vocabulary from the analysis:

.. code-block:: c++

auto &VocabAnalysis = getAnalysis<MIR2VecVocabLegacyAnalysis>();
auto VocabOrErr = VocabAnalysis.getMIR2VecVocabulary(*MF.getFunction().getParent());
if (!VocabOrErr) {
// Handle error: vocabulary is not available or invalid
return;
}
const mir2vec::MIRVocabulary &Vocabulary = *VocabOrErr;

Note that ``MIR2VecVocabLegacyAnalysis`` is an immutable pass.

2. **Create Embedder instance**:
With the vocabulary, create an embedder for a specific machine function:

.. code-block:: c++

// Assuming MF is a MachineFunction&
// For example, using MIR2VecKind::Symbolic:
std::unique_ptr<mir2vec::MIREmbedder> Emb =
mir2vec::MIREmbedder::create(MIR2VecKind::Symbolic, MF, Vocabulary);


3. **Compute and Access Embeddings**:
Call ``getMFunctionVector()`` to get the embedding for the machine function.

.. code-block:: c++

mir2vec::Embedding FuncVector = Emb->getMFunctionVector();

Currently, ``MIREmbedder`` can generate embeddings at three levels: Machine
Instructions, Machine Basic Blocks, and Machine Functions. Appropriate
getters are provided to access the embeddings at these levels.

.. note::

The validity of the ``MIREmbedder`` instance (and the embeddings it
generates) is tied to the machine function it is associated with. If the
machine function is modified, the embeddings may become stale and should
be recomputed accordingly.

4. **Working with Embeddings:**
Embeddings are represented as ``std::vector<double>``. These vectors can be
used as features for machine learning models, compute similarity scores
between different code snippets, or perform other analyses as needed.

Further Details
^^^^^^^^^^^^^^^

For more detailed information about the MIR2Vec algorithm, its parameters, and
advanced usage, please refer to the original paper:
`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.

For information about using MIR2Vec tool for generating embeddings from
Machine IR, see :doc:`CommandGuide/llvm-ir2vec`.

The LLVM source code for ``MIR2Vec`` can be explored to understand the
implementation details. See ``llvm/include/llvm/CodeGen/MIR2Vec.h`` and
``llvm/lib/CodeGen/MIR2Vec.cpp``.

Building with ML support
========================

Expand Down
Loading