-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[MLGO] Add MIR2Vec embedding framework documentation #164033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
svkeerthy
wants to merge
1
commit into
users/svkeerthy/10-17-use_colored_error_messages
Choose a base branch
from
users/svkeerthy/10-17-update_mlgo_doc
base: users/svkeerthy/10-17-use_colored_error_messages
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[MLGO] Add MIR2Vec embedding framework documentation #164033
svkeerthy
wants to merge
1
commit into
users/svkeerthy/10-17-use_colored_error_messages
from
users/svkeerthy/10-17-update_mlgo_doc
+140
−4
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Oct 17, 2025
@llvm/pr-subscribers-mlgo Author: S. VenkataKeerthy (svkeerthy) ChangesFull diff: https://github.com/llvm/llvm-project/pull/164033.diff 1 Files Affected:
diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst
index bf3de11a2640e..2443835ea2fff 100644
--- a/llvm/docs/MLGO.rst
+++ b/llvm/docs/MLGO.rst
@@ -434,8 +434,27 @@ The latter is also used in tests.
There is no C++ implementation of a log reader. We do not have a scenario
motivating one.
-IR2Vec Embeddings
-=================
+Embeddings
+==========
+
+LLVM provides embedding frameworks to generate vector representations of code
+at different abstraction levels. These embeddings capture syntactic, semantic,
+and structural properties of the code and can be used as features for machine
+learning models in various compiler optimization tasks.
+
+Two embedding frameworks are available:
+
+- **IR2Vec**: Generates embeddings for LLVM IR
+- **MIR2Vec**: Generates embeddings for Machine IR
+
+Both frameworks follow a similar architecture with vocabulary-based embedding
+generation, where a vocabulary maps code entities to n-dimensional floating
+point vectors. These embeddings can be computed at multiple granularity levels
+(instruction, basic block, and function) and used for ML-guided compiler
+optimizations.
+
+IR2Vec
+------
IR2Vec is a program embedding approach designed specifically for LLVM IR. It
is implemented as a function analysis pass in LLVM. The IR2Vec embeddings
@@ -466,7 +485,7 @@ The core components are:
compute embeddings for instructions, basic blocks, and functions.
Using IR2Vec
-------------
+^^^^^^^^^^^^
.. note::
@@ -526,7 +545,7 @@ embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
between different code snippets, or perform other analyses as needed.
Further Details
----------------
+^^^^^^^^^^^^^^^
For more detailed information about the IR2Vec algorithm, its parameters, and
advanced usage, please refer to the original paper:
@@ -538,6 +557,123 @@ triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
The LLVM source code for ``IR2Vec`` can also be explored to understand the
implementation details.
+MIR2Vec
+-------
+
+MIR2Vec is an extension of IR2Vec designed specifically for LLVM Machine IR
+(MIR). It generates embeddings for machine-level instructions, basic blocks,
+and functions. MIR2Vec operates on the target-specific machine representation,
+capturing machine instruction semantics including opcodes, operands, and
+register information at the machine level.
+
+MIR2Vec extends the vocabulary to include:
+
+- **Machine Opcodes**: Target-specific instruction opcodes derived from the
+ TargetInstrInfo, grouped by instruction semantics.
+
+- **Common Operands**: All common operand types (excluding register operands),
+ defined by the ``MachineOperand::MachineOperandType`` enum.
+
+- **Physical Register Classes**: Register classes defined by the target,
+ specialized for physical registers.
+
+- **Virtual Register Classes**: Register classes defined by the target,
+ specialized for virtual registers.
+
+The core components are:
+
+- **Vocabulary**: A mapping from machine IR entities (opcodes, operands, register
+ classes) to their vector representations. This is managed by
+ ``MIR2VecVocabLegacyAnalysis`` for the legacy pass manager, with a
+ ``MIR2VecVocabProvider`` that can be used standalone or wrapped by pass
+ managers. The vocabulary (.json file) contains sections for opcodes, common
+ operands, physical register classes, and virtual register classes.
+
+ .. note::
+
+ The vocabulary file should contain these sections for it to be valid.
+
+- **Embedder**: A class (``mir2vec::MIREmbedder``) that uses the vocabulary to
+ compute embeddings for machine instructions, machine basic blocks, and
+ machine functions. Currently, ``SymbolicMIREmbedder`` is the available
+ implementation.
+
+Using MIR2Vec
+^^^^^^^^^^^^^
+
+.. note::
+
+ This section describes how to use MIR2Vec within LLVM passes. `llvm-ir2vec`
+ tool ` :doc:`CommandGuide/llvm-ir2vec` can be used for generating MIR2Vec
+ embeddings from Machine IR files (.mir), which can be useful for generating
+ embeddings outside of compiler passes.
+
+To generate MIR2Vec embeddings in a compiler pass, first obtain the vocabulary,
+then create an embedder instance to compute and access embeddings.
+
+1. **Get the Vocabulary**:
+ In a MachineFunctionPass, get the vocabulary from the analysis:
+
+ .. code-block:: c++
+
+ auto &VocabAnalysis = getAnalysis<MIR2VecVocabLegacyAnalysis>();
+ auto VocabOrErr = VocabAnalysis.getMIR2VecVocabulary(*MF.getFunction().getParent());
+ if (!VocabOrErr) {
+ // Handle error: vocabulary is not available or invalid
+ return;
+ }
+ const mir2vec::MIRVocabulary &Vocabulary = *VocabOrErr;
+
+ Note that ``MIR2VecVocabLegacyAnalysis`` is an immutable pass.
+
+2. **Create Embedder instance**:
+ With the vocabulary, create an embedder for a specific machine function:
+
+ .. code-block:: c++
+
+ // Assuming MF is a MachineFunction&
+ // For example, using MIR2VecKind::Symbolic:
+ std::unique_ptr<mir2vec::MIREmbedder> Emb =
+ mir2vec::MIREmbedder::create(MIR2VecKind::Symbolic, MF, Vocabulary);
+
+
+3. **Compute and Access Embeddings**:
+ Call ``getMFunctionVector()`` to get the embedding for the machine function.
+
+ .. code-block:: c++
+
+ mir2vec::Embedding FuncVector = Emb->getMFunctionVector();
+
+ Currently, ``MIREmbedder`` can generate embeddings at three levels: Machine
+ Instructions, Machine Basic Blocks, and Machine Functions. Appropriate
+ getters are provided to access the embeddings at these levels.
+
+ .. note::
+
+ The validity of the ``MIREmbedder`` instance (and the embeddings it
+ generates) is tied to the machine function it is associated with. If the
+ machine function is modified, the embeddings may become stale and should
+ be recomputed accordingly.
+
+4. **Working with Embeddings:**
+ Embeddings are represented as ``std::vector<double>``. These vectors can be
+ used as features for machine learning models, compute similarity scores
+ between different code snippets, or perform other analyses as needed.
+
+Further Details
+^^^^^^^^^^^^^^^
+
+For more detailed information about the MIR2Vec algorithm, its parameters, and
+advanced usage, please refer to the original paper:
+`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.
+
+For information about using MIR2Vec tool for generating embeddings from
+Machine IR, see :doc:`CommandGuide/llvm-ir2vec`.
+
+The LLVM source code for ``MIR2Vec`` can be explored to understand the
+implementation details. See ``llvm/include/llvm/CodeGen/MIR2Vec.h`` and
+``llvm/lib/CodeGen/MIR2Vec.cpp``.
+
Building with ML support
========================
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.