Skip to content

Conversation

svkeerthy
Copy link
Contributor

No description provided.

Copy link
Contributor Author

@svkeerthy svkeerthy changed the title Update MLGO Doc [MLGO] Add MIR2Vec embedding framework documentation Oct 17, 2025
@svkeerthy svkeerthy marked this pull request as ready for review October 17, 2025 23:43
@llvmbot llvmbot added the mlgo label Oct 17, 2025
@llvmbot
Copy link
Member

llvmbot commented Oct 17, 2025

@llvm/pr-subscribers-mlgo

Author: S. VenkataKeerthy (svkeerthy)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/164033.diff

1 Files Affected:

  • (modified) llvm/docs/MLGO.rst (+140-4)
diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst
index bf3de11a2640e..2443835ea2fff 100644
--- a/llvm/docs/MLGO.rst
+++ b/llvm/docs/MLGO.rst
@@ -434,8 +434,27 @@ The latter is also used in tests.
 There is no C++ implementation of a log reader. We do not have a scenario
 motivating one.
 
-IR2Vec Embeddings
-=================
+Embeddings
+==========
+
+LLVM provides embedding frameworks to generate vector representations of code
+at different abstraction levels. These embeddings capture syntactic, semantic,
+and structural properties of the code and can be used as features for machine
+learning models in various compiler optimization tasks.
+
+Two embedding frameworks are available:
+
+- **IR2Vec**: Generates embeddings for LLVM IR
+- **MIR2Vec**: Generates embeddings for Machine IR
+
+Both frameworks follow a similar architecture with vocabulary-based embedding
+generation, where a vocabulary maps code entities to n-dimensional floating
+point vectors. These embeddings can be computed at multiple granularity levels
+(instruction, basic block, and function) and used for ML-guided compiler
+optimizations.
+
+IR2Vec
+------
 
 IR2Vec is a program embedding approach designed specifically for LLVM IR. It
 is implemented as a function analysis pass in LLVM. The IR2Vec embeddings
@@ -466,7 +485,7 @@ The core components are:
     compute embeddings for instructions, basic blocks, and functions.
 
 Using IR2Vec
-------------
+^^^^^^^^^^^^
 
 .. note::
 
@@ -526,7 +545,7 @@ embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
    between different code snippets, or perform other analyses as needed.
 
 Further Details
----------------
+^^^^^^^^^^^^^^^
 
 For more detailed information about the IR2Vec algorithm, its parameters, and
 advanced usage, please refer to the original paper:
@@ -538,6 +557,123 @@ triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
 The LLVM source code for ``IR2Vec`` can also be explored to understand the 
 implementation details.
 
+MIR2Vec
+-------
+
+MIR2Vec is an extension of IR2Vec designed specifically for LLVM Machine IR 
+(MIR). It generates embeddings for machine-level instructions, basic blocks, 
+and functions. MIR2Vec operates on the target-specific machine representation,
+capturing machine instruction semantics including opcodes, operands, and 
+register information at the machine level.
+
+MIR2Vec extends the vocabulary to include:
+
+- **Machine Opcodes**: Target-specific instruction opcodes derived from the
+  TargetInstrInfo, grouped by instruction semantics.
+
+- **Common Operands**: All common operand types (excluding register operands),
+  defined by the ``MachineOperand::MachineOperandType`` enum.
+
+- **Physical Register Classes**: Register classes defined by the target,
+  specialized for physical registers.
+
+- **Virtual Register Classes**: Register classes defined by the target,
+  specialized for virtual registers.
+
+The core components are:
+
+- **Vocabulary**: A mapping from machine IR entities (opcodes, operands, register
+  classes) to their vector representations. This is managed by 
+  ``MIR2VecVocabLegacyAnalysis`` for the legacy pass manager, with a 
+  ``MIR2VecVocabProvider`` that can be used standalone or wrapped by pass 
+  managers. The vocabulary (.json file) contains sections for opcodes, common 
+  operands, physical register classes, and virtual register classes.
+
+  .. note::
+    
+    The vocabulary file should contain these sections for it to be valid.
+
+- **Embedder**: A class (``mir2vec::MIREmbedder``) that uses the vocabulary to
+  compute embeddings for machine instructions, machine basic blocks, and 
+  machine functions. Currently, ``SymbolicMIREmbedder`` is the available 
+  implementation.
+
+Using MIR2Vec
+^^^^^^^^^^^^^
+
+.. note::
+
+   This section describes how to use MIR2Vec within LLVM passes. `llvm-ir2vec`
+   tool ` :doc:`CommandGuide/llvm-ir2vec` can be used for generating MIR2Vec
+   embeddings from Machine IR files (.mir), which can be useful for generating
+   embeddings outside of compiler passes.
+
+To generate MIR2Vec embeddings in a compiler pass, first obtain the vocabulary,
+then create an embedder instance to compute and access embeddings.
+
+1. **Get the Vocabulary**:
+   In a MachineFunctionPass, get the vocabulary from the analysis:
+
+   .. code-block:: c++
+
+      auto &VocabAnalysis = getAnalysis<MIR2VecVocabLegacyAnalysis>();
+      auto VocabOrErr = VocabAnalysis.getMIR2VecVocabulary(*MF.getFunction().getParent());
+      if (!VocabOrErr) {
+        // Handle error: vocabulary is not available or invalid
+        return;
+      }
+      const mir2vec::MIRVocabulary &Vocabulary = *VocabOrErr;
+
+   Note that ``MIR2VecVocabLegacyAnalysis`` is an immutable pass.
+
+2. **Create Embedder instance**:
+   With the vocabulary, create an embedder for a specific machine function:
+
+   .. code-block:: c++
+
+      // Assuming MF is a MachineFunction&
+      // For example, using MIR2VecKind::Symbolic:
+      std::unique_ptr<mir2vec::MIREmbedder> Emb =
+          mir2vec::MIREmbedder::create(MIR2VecKind::Symbolic, MF, Vocabulary);
+
+
+3. **Compute and Access Embeddings**:
+   Call ``getMFunctionVector()`` to get the embedding for the machine function.
+
+   .. code-block:: c++
+
+    mir2vec::Embedding FuncVector = Emb->getMFunctionVector();
+
+   Currently, ``MIREmbedder`` can generate embeddings at three levels: Machine
+   Instructions, Machine Basic Blocks, and Machine Functions. Appropriate 
+   getters are provided to access the embeddings at these levels.
+
+   .. note::
+
+    The validity of the ``MIREmbedder`` instance (and the embeddings it 
+    generates) is tied to the machine function it is associated with. If the 
+    machine function is modified, the embeddings may become stale and should 
+    be recomputed accordingly.
+
+4. **Working with Embeddings:**
+   Embeddings are represented as ``std::vector<double>``. These vectors can be
+   used as features for machine learning models, compute similarity scores
+   between different code snippets, or perform other analyses as needed.
+
+Further Details
+^^^^^^^^^^^^^^^
+
+For more detailed information about the MIR2Vec algorithm, its parameters, and
+advanced usage, please refer to the original paper:
+`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.
+
+For information about using MIR2Vec tool for generating embeddings from
+Machine IR, see :doc:`CommandGuide/llvm-ir2vec`.
+
+The LLVM source code for ``MIR2Vec`` can be explored to understand the 
+implementation details. See ``llvm/include/llvm/CodeGen/MIR2Vec.h`` and 
+``llvm/lib/CodeGen/MIR2Vec.cpp``.
+
 Building with ML support
 ========================
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants