Adding documentation

svkeerthy · svkeerthy · commit 6e0385e614f8 · 2025-05-11T23:23:44.000Z
diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst
@@ -174,3 +174,151 @@ clang.
     TODO(mtrofin): 
         - logging, and the use in interactive mode.
         - discuss an example (like the inliner)
+
+IR2Vec Embeddings
+=================
+
+IR2Vec is a program embedding approach designed specifically for LLVM IR. It
+is implemented as a function analysis pass in LLVM. The IR2Vec embeddings
+capture syntactic, semantic, and structural properties of the IR through 
+learned representations. These representations are obtained as a JSON 
+vocabulary that maps the entities of the IR (opcodes, types, operands) to 
+n-dimensional floating point vectors (embeddings). 
+
+With IR2Vec, representation at different granularities of IR, such as
+instructions, functions, and basic blocks, can be obtained. Representations 
+of loops and regions can be derived from these representations, which can be
+useful in different scenarios. The representations can be useful for various
+downstream tasks, including ML-guided compiler optimizations.
+
+Currently, to use IR2Vec embeddings, the JSON vocabulary first needs to be read
+and used to obtain the vocabulary mapping. Then, use this mapping to
+derive the representations. In LLVM, this process is implemented using two
+independent passes: ``IR2VecVocabAnalysis`` and ``IR2VecAnalysis``. The former
+reads the JSON vocabulary and populates ``IR2VecVocabResult``, which is then used
+by ``IR2VecAnalysis``.
+
+It is recommended to run ``IR2VecVocabAnalysis`` once, as the
+vocabulary typically does not change. In the future, we plan
+to improve this process by automatically generating the vocabulary mappings
+during build time, eliminating the need for a separate file read.
+
+IR2VecAnalysis Usage
+--------------------
+
+To use IR2Vec in an LLVM-based tool or pass, interaction with the analysis 
+results can be done through the following APIs:
+
+1. **Including the Header:**
+
+   First, include the necessary header file in the source code:
+
+   .. code-block:: c++
+
+      #include "llvm/Analysis/IR2VecAnalysis.h"
+
+2. **Accessing the Analysis Results:**
+
+   To access the IR2Vec embeddings, obtain the ``IR2VecAnalysis``
+   result from the Function Analysis Manager (FAM).
+
+   .. code-block:: c++
+
+      llvm::FunctionAnalysisManager &FAM = ...; // The FAM instance
+      llvm::Function &F = ...; // The function to analyze
+      auto &IR2VecResult = FAM.getResult<llvm::IR2VecAnalysis>(F);
+
+3. **Checking for Valid Results:**
+
+   Ensure that the analysis result is valid before accessing the embeddings:
+
+   .. code-block:: c++
+
+      if (IR2VecResult.isValid()) {
+        // Proceed to access embeddings
+      }
+
+4. **Retrieving Embeddings:**
+
+   The ``IR2VecResult`` provides access to embeddings (currently) at three levels:
+
+   - **Instruction Embeddings:**
+
+     .. code-block:: c++
+
+        const auto &instVecMap = IR2VecResult.getInstVecMap();
+        // instVecMap is a SmallMapVector<const Instruction*, ir2vec::Embedding, 128>
+        for (const auto &it : instVecMap) {
+          const Instruction *I = it.first;
+          const ir2vec::Embedding &embedding = it.second;
+          // Use the instruction embedding
+        }
+   - **Basic Block Embeddings:**
+
+     .. code-block:: c++
+
+        const auto &bbVecMap = IR2VecResult.getBBVecMap();
+        // bbVecMap is a SmallMapVector<const BasicBlock*, ir2vec::Embedding, 16>
+        for (const auto &it : bbVecMap) {
+          const BasicBlock *BB = it.first;
+          const ir2vec::Embedding &embedding = it.second;
+          // Use the basic block embedding
+        }
+   - **Function Embedding:**
+
+     .. code-block:: c++
+
+        const ir2vec::Embedding &funcEmbedding = IR2VecResult.getFunctionVector();
+        // Use the function embedding
+
+5. **Working with Embeddings:**
+
+   Embeddings are represented as ``std::vector<double>``. These
+   vectors as features for machine learning models, compute similarity scores
+   between different code snippets, or perform other analyses as needed.
+
+Example Usage
+^^^^^^^^^^^^^
+
+.. code-block:: c++
+
+   #include "llvm/Analysis/IR2VecAnalysis.h"
+   #include "llvm/IR/Function.h"
+   #include "llvm/IR/Instructions.h"
+   #include "llvm/Passes/PassBuilder.h"
+
+   // ... other includes and code ...
+
+   void processFunction(llvm::Function &F, llvm::FunctionAnalysisManager &FAM) {
+     auto &IR2VecResult = FAM.getResult<llvm::IR2VecAnalysis>(F);
+
+     if (IR2VecResult.isValid()) {
+       const auto &instVecMap = IR2VecResult.getInstVecMap();
+       for (const auto &it : instVecMap) {
+         const Instruction *I = it.first;
+         const auto &embedding = it.second;
+         llvm::errs() << "Instruction: " << *I << "\n";
+         llvm::errs() << "Embedding: ";
+         for (double val : embedding) {
+           llvm::errs() << val << " ";
+         }
+         llvm::errs() << "\n";
+       }
+     } else {
+       llvm::errs() << "IR2Vec analysis failed for function " << F.getName() << "\n";
+     }
+   }
+
+   // ... rest of the pass ...
+
+   // In the pass's run method:
+   // processFunction(F, FAM);
+
+Further Details
+---------------
+
+For more detailed information about the IR2Vec algorithm, its parameters, and
+advanced usage, please refer to the original paper:
+`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
+The LLVM source code for ``IR2VecAnalysis`` can also be explored to understand the 
+implementation details.