diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst index bf3de11a2640e..2443835ea2fff 100644 --- a/llvm/docs/MLGO.rst +++ b/llvm/docs/MLGO.rst @@ -434,8 +434,27 @@ The latter is also used in tests. There is no C++ implementation of a log reader. We do not have a scenario motivating one. -IR2Vec Embeddings -================= +Embeddings +========== + +LLVM provides embedding frameworks to generate vector representations of code +at different abstraction levels. These embeddings capture syntactic, semantic, +and structural properties of the code and can be used as features for machine +learning models in various compiler optimization tasks. + +Two embedding frameworks are available: + +- **IR2Vec**: Generates embeddings for LLVM IR +- **MIR2Vec**: Generates embeddings for Machine IR + +Both frameworks follow a similar architecture with vocabulary-based embedding +generation, where a vocabulary maps code entities to n-dimensional floating +point vectors. These embeddings can be computed at multiple granularity levels +(instruction, basic block, and function) and used for ML-guided compiler +optimizations. + +IR2Vec +------ IR2Vec is a program embedding approach designed specifically for LLVM IR. It is implemented as a function analysis pass in LLVM. The IR2Vec embeddings @@ -466,7 +485,7 @@ The core components are: compute embeddings for instructions, basic blocks, and functions. Using IR2Vec ------------- +^^^^^^^^^^^^ .. note:: @@ -526,7 +545,7 @@ embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance. between different code snippets, or perform other analyses as needed. Further Details ---------------- +^^^^^^^^^^^^^^^ For more detailed information about the IR2Vec algorithm, its parameters, and advanced usage, please refer to the original paper: @@ -538,6 +557,123 @@ triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`. The LLVM source code for ``IR2Vec`` can also be explored to understand the implementation details. +MIR2Vec +------- + +MIR2Vec is an extension of IR2Vec designed specifically for LLVM Machine IR +(MIR). It generates embeddings for machine-level instructions, basic blocks, +and functions. MIR2Vec operates on the target-specific machine representation, +capturing machine instruction semantics including opcodes, operands, and +register information at the machine level. + +MIR2Vec extends the vocabulary to include: + +- **Machine Opcodes**: Target-specific instruction opcodes derived from the + TargetInstrInfo, grouped by instruction semantics. + +- **Common Operands**: All common operand types (excluding register operands), + defined by the ``MachineOperand::MachineOperandType`` enum. + +- **Physical Register Classes**: Register classes defined by the target, + specialized for physical registers. + +- **Virtual Register Classes**: Register classes defined by the target, + specialized for virtual registers. + +The core components are: + +- **Vocabulary**: A mapping from machine IR entities (opcodes, operands, register + classes) to their vector representations. This is managed by + ``MIR2VecVocabLegacyAnalysis`` for the legacy pass manager, with a + ``MIR2VecVocabProvider`` that can be used standalone or wrapped by pass + managers. The vocabulary (.json file) contains sections for opcodes, common + operands, physical register classes, and virtual register classes. + + .. note:: + + The vocabulary file should contain these sections for it to be valid. + +- **Embedder**: A class (``mir2vec::MIREmbedder``) that uses the vocabulary to + compute embeddings for machine instructions, machine basic blocks, and + machine functions. Currently, ``SymbolicMIREmbedder`` is the available + implementation. + +Using MIR2Vec +^^^^^^^^^^^^^ + +.. note:: + + This section describes how to use MIR2Vec within LLVM passes. `llvm-ir2vec` + tool ` :doc:`CommandGuide/llvm-ir2vec` can be used for generating MIR2Vec + embeddings from Machine IR files (.mir), which can be useful for generating + embeddings outside of compiler passes. + +To generate MIR2Vec embeddings in a compiler pass, first obtain the vocabulary, +then create an embedder instance to compute and access embeddings. + +1. **Get the Vocabulary**: + In a MachineFunctionPass, get the vocabulary from the analysis: + + .. code-block:: c++ + + auto &VocabAnalysis = getAnalysis(); + auto VocabOrErr = VocabAnalysis.getMIR2VecVocabulary(*MF.getFunction().getParent()); + if (!VocabOrErr) { + // Handle error: vocabulary is not available or invalid + return; + } + const mir2vec::MIRVocabulary &Vocabulary = *VocabOrErr; + + Note that ``MIR2VecVocabLegacyAnalysis`` is an immutable pass. + +2. **Create Embedder instance**: + With the vocabulary, create an embedder for a specific machine function: + + .. code-block:: c++ + + // Assuming MF is a MachineFunction& + // For example, using MIR2VecKind::Symbolic: + std::unique_ptr Emb = + mir2vec::MIREmbedder::create(MIR2VecKind::Symbolic, MF, Vocabulary); + + +3. **Compute and Access Embeddings**: + Call ``getMFunctionVector()`` to get the embedding for the machine function. + + .. code-block:: c++ + + mir2vec::Embedding FuncVector = Emb->getMFunctionVector(); + + Currently, ``MIREmbedder`` can generate embeddings at three levels: Machine + Instructions, Machine Basic Blocks, and Machine Functions. Appropriate + getters are provided to access the embeddings at these levels. + + .. note:: + + The validity of the ``MIREmbedder`` instance (and the embeddings it + generates) is tied to the machine function it is associated with. If the + machine function is modified, the embeddings may become stale and should + be recomputed accordingly. + +4. **Working with Embeddings:** + Embeddings are represented as ``std::vector``. These vectors can be + used as features for machine learning models, compute similarity scores + between different code snippets, or perform other analyses as needed. + +Further Details +^^^^^^^^^^^^^^^ + +For more detailed information about the MIR2Vec algorithm, its parameters, and +advanced usage, please refer to the original paper: +`RL4ReAl: Reinforcement Learning for Register Allocation `_. + +For information about using MIR2Vec tool for generating embeddings from +Machine IR, see :doc:`CommandGuide/llvm-ir2vec`. + +The LLVM source code for ``MIR2Vec`` can be explored to understand the +implementation details. See ``llvm/include/llvm/CodeGen/MIR2Vec.h`` and +``llvm/lib/CodeGen/MIR2Vec.cpp``. + Building with ML support ========================